CN108564089B

CN108564089B - Manchu component set construction method

Info

Publication number: CN108564089B
Application number: CN201810371805.5A
Authority: CN
Inventors: 郑蕊蕊; 李敏; 贺建军; 许爽; 吴宝春
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2020-10-23
Anticipated expiration: 2038-04-24
Also published as: CN108564089A

Abstract

A Manchu component set construction method belongs to the field of character segmentation, and aims to solve the problem of improving Manchu segmentation precision, a Manchu component initial set is constructed, Flag of each Manchu component is collocated as 0, the Manchu component segmentation method is used for segmenting Manchu word images, and segmentation results are counted and analyzed: if the divided part does not belong to the initial set, adding the part into the initial set, and juxtaposing the Flag of the part to be 1; if the divided components belong to the initial component set, setting the Flag of the corresponding component to be 1, checking whether the Flag of the component in the initial set is 0, judging whether the component never appears in the division result, if the component exists, deleting the component from the initial set, sorting and outputting the Manchu component set.

Description

Manchu component set construction method

Technical Field

The invention belongs to the field of character segmentation, and relates to a method for recognizing Manchu characters of a printed form based on Manchu character component segmentation.

Background

Manchu is a language character used by minority nationalities such as Manchu nationalities and Taber nationalities in China, is popularized and used as legal character in Qing dynasty, and forms a large amount of precious Manchu literature. Because the full language is about to disappear at present, the full language culture heritage needs to be urgently recovered and protected to be recognized and valued by the nation and the society. The study of Manchu optical character recognition technology is important for protecting and inheriting Qing dynasty cultural heritage. Manchu is a phonemic text having a total of 38 letters, 6 vowels, 22 consonants, and 10 special letters dedicated to spelling Chinese borrowers. The full writing adopts the rules that the word sequence is from top to bottom, and the line is from left to right. For Manchu recognition, the Manchu is usually recognized after the basic units (such as letters) are segmented first, so that the accuracy of Manchu recognition can be improved, and the segmentation accuracy can be improved.

Disclosure of Invention

In order to solve the problem of improving the Manchu segmentation precision, the invention provides the following technical scheme: a Manchu component set construction method comprises the steps of constructing a Manchu component initial set, juxtaposing Flag of each Manchu component to be 0, segmenting Manchu word images by using a Manchu component segmentation method, and counting and analyzing segmentation results: if the divided part does not belong to the initial set, adding the part into the initial set, and juxtaposing the Flag of the part to be 1; if the divided parts belong to the initial part set, setting the Flag of the corresponding part to be 1, checking whether the Flag of the part in the initial set is 0, judging whether the part never appears in the division result, if the part exists, deleting the part from the initial set, and sorting and outputting the full-text part set.

In addition to the technical solution, the Manchu parts set is a parts set shown in attached Table 1.

Has the advantages that: the Manchu component is used as a segmentation unit to form a Manchu component set, so that the phenomena of over-segmentation and weak segmentation in Manchu letter segmentation are greatly reduced, and the segmentation precision is higher.

Drawings

FIG. 1 is a Manchu parts set building flow diagram;

FIG. 2 is a Manchu parts segmentation flow diagram;

FIG. 3 is a diagram illustrating an example of an error extracted from the central axis of a Manchu word image by a conventional method;

fig. 4 is a graph of the axial width in Manchu determined using the area-defined maximum run-length scale method, in which: (1) the error example graph of the maximum run-length proportion method, (2) the search range graph defined by the invention, (3) the result graph of the method of the invention;

FIG. 5 is a diagram showing the effect of extracting the central axis in the method of the present invention;

FIG. 6 Manchu parts segmentation flow chart;

fig. 7 a full context part segmentation result diagram, wherein: (1) a weak segmentation phenomenon map, (2) a weak segmentation region passing fine segmentation map, (3) an over segmentation phenomenon map, (4) an over segmentation region passing merging map, and (5) a partial segmentation result map.

Detailed Description

From the analysis of the optical character recognition technology, Manchu has the following characteristics:(1) the same letter in Manchu has 4 different forms of independent shape, head shape, middle shape and tail shape according to the position in the word. The total number of letters of different fonts in Manchu is 114. (2) The words in the same column of the Manchu document are all positioned near the same central axis, and the words between two columns of the Manchu document in the printed form are basically not crossed, thereby being beneficial to column extraction. Certain intervals are arranged among Manchu words in the same column of Manchu texts, which is beneficial to word extraction. (3) The Manchu words are formed by connecting one or more Manchu letters with vertical central axes, and no gap exists between the letters in the same word. However, the spelling position of the letters and the letters is located on the axis in the Manchu word image, and the division of the Manchu letters by using the pixel characteristics of the axis can be considered. (4) Some Manchu letters have the phenomenon of "one-shaped multi-character". Such as characters

The shapes of the letters a, e and n can be distinguished in recognition according to spelling rules of adjacent letters. (5) Part of the Mandarin letters have the same components. Such as characters

(prefix shape of letter o), can be regarded as consisting of characters

(letter e's letter head shape) and characters

(the letter o is in the shape of a Chinese character) is formed by combining two parts. Therefore, the phenomenon of over-segmentation and weak segmentation is easy to occur by taking Manchu letters as basic segmentation units. (6) Some letter combinations have no separability. For example

(bo), is cut into

(letters b) and

the (letter o) is very difficult.

Based on the above features of Manchu, this embodiment proposes a concept of reconstructing Manchu words by components, and a Manchu component (hereinafter referred to as a component) is used as a basic unit for segmentation and recognition, which can solve the problems of over-segmentation and weak segmentation caused by using Manchu letters as a basic segmentation unit, where a Manchu component set includes 3 sources of Manchu letters, a part of letters or a combination of letters, and the like, and the purpose of constructing the Manchu component set is to reduce erroneous recognition caused by segmentation, because if the letters are used as the basic segmentation unit, the problems of over-segmentation and weak segmentation are likely to occur as in the foregoing analysis, and then a subsequent classifier for recognizing the letters is likely to generate recognition errors on the over-segmented and weak-segmented parts, or even cannot recognize the letters; the Manchu part set proposed by the invention (method) is constructed by taking the result of the segmentation method as a guide design, namely common over-segmentation (part of letters or letter combinations) and weak-segmentation (letter combinations) are not considered as 'wrong' but as 'correct' segmentation any more, so that a classifier designed subsequently can identify the parts, thereby reducing the problems of identification errors and the like caused by segmentation errors. For an understanding of the Manchu component, reference can be made to analogies to the recognition of English words. Taking English word study as an example, the whole word study can be directly identified; or the whole word can be cut into s, t, u, d, y and other letters, the letters are respectively recognized, and then the letters are combined into a word study; the cutting into letters is difficult to achieve, while the cutting into parts is relatively easy, for example, into: st, u, dy, (where st, u, dy are all parts) then identify parts and combine them into words, however, full language is not as easy to split parts as exemplified english because of the above features, as shown in fig. 1, the full language part set is constructed by: referring to the Manchu alphabet, the national standard of the people's republic of China, namely a multi-eight-bit encoding character set for information technology, Siberian and Manchu fonts, and the Mongolian component set in documents [1-2], an initial Manchu component set (hereinafter referred to as an initial set) comprising 99 initial components is provided, and Flag of each Manchu component is set to be 0. Segmenting the Manchu word image by Manchu segmentation, and counting and analyzing segmentation results: if the divided part does not belong to the initial set, adding the part into the initial set, and juxtaposing the Flag of the part to be 1; if the divided component belongs to the initial component set, the Flag of the corresponding component is set to 1. Whether the Flag of the component in the initial set is 0 or not is checked, whether the component never appears in the division result is judged, and if the component exists, the component is deleted from the initial set. And sorting and outputting the Manchu part set. The Manchu parts collectively comprises 106 parts, which are detailed in attached Table 1. Documents [1-2] mentioned therein:

[1]Hongxi Wei,Guanglai Gao.A keyword retrieval system for historicalMongolian document images[J].Internationaljournal on document analysisandrecognition,2014,17(1),33-45.

[2]Liangrui Peng,Changsong Liu,Xiaoqing Ding,Jianming Jin,Youshou Wu,Hua Wang,Yanhua Bao.Multi-font printed Mongolian document recognition system[J].International journal on document analysis and recognition,2010,13(2):93-106.

as shown in fig. 2, the full part text is cut as follows:

s1, converting a Manchu paper document into a digital image document which can be stored and processed by a computer through a photoelectric conversion device, and carrying out image preprocessing (smoothing and binarization) on the digital image of the Manchu document;

s2, analyzing layout (inclination correction, column segmentation and word segmentation);

s3, extracting Manchu word images;

s4, position normalization;

s5, extracting a central axis;

and S6, segmenting the Manchu parts according to the relation between the Manchu parts and the position of the central axis.

The inclination correction adopts a Hough transform method to determine the inclination angle of the layout, and then the image is rotated and corrected back to a vertical text state; column segmentation is carried out on the tilt-corrected Manchu document by adopting a vertical projection method, words are segmented by adopting a horizontal projection method, Manchu words in the Manchu column image are extracted, and position normalization is carried out on the Manchu word image. The preprocessing of the Manchu word image is completed through the steps, and the height and the width of the Manchu word image are H and W respectively. It should be noted that, the unnecessary white background edge of the Manchu word image is cut off by performing position normalization on the Manchu word image, and the flow shown in FIG. 2 is to turn over the image for programming convenience and display the removed black edge of the Manchu word. The image with black and white characters in fig. 2 is also called image inversion. The original image should be a black word with white background, but for programming convenience, the image with black border removed after turning over is directly given because the image with black border removed after turning over is more convenient to program.

In this embodiment, the axis extraction in the Manchu word image directly affects the accuracy of segmentation, and the following describes a specific scheme thereof in detail.

For the axis extraction in the Manchu word image, i.e., step S5, the vertical projection method and the maximum cumulative vertical projection method are generally used in the prior art, however, the two methods have the situations of axis positioning offset and axis width estimation error, etc., as shown in FIG. 3. The embodiment provides a method for extracting a central axis of a Manchu word image, which comprises the following steps:

s5.1, positioning the central axis of the Manchu word image:

first, the Manchu word image is inverted, that is, the pixel value of the text portion is 1 and the pixel value of the background is 0. The morphological refinement function of the MATLAB image processing tool box is used, a 3 x 3 structural element template is adopted, each template comprises 9 pixels, and each pixel can only take 0 or 1, so that the template has 512 different forms, and the template is divided into 8 directions to realize the morphological refinement of the Manchu word image. And determining the column coordinates corresponding to the thinned central axis of the Manchu word image by using Hough transform, namely determining the position of the central axis of the Manchu word image. In the extraction of the central axis of the Manchu word image, the angle of a Hough transform search straight line is limited to be 90, namely, only straight lines in the vertical direction are searched, the straight lines with the same longitudinal position are connected, the straight lines with the distance smaller than the height H of the word image and the length larger than 1 pixel are one straight line, and the central position of the central axis is calculated and is marked as baseline. The central axis of the Manchu word image refers to the column coordinate position of the central axis of the Manchu word in the image, but not the central line of one image.

S5.2, axial line width detection in Manchu word image

S5.2.1, adopting a maximum run length proportion method of the central axis width: firstly, scanning each line of a Manchu word image, and counting the run length of continuous black pixels and the occurrence frequency of the run length; scanning all the lines in sequence, the run length with the largest occurrence frequency is the width of the central axis of the Manchu word image, denoted as w₀. The maximum run scale method is effective for detecting the axial width of the Manchu word image, but there still exists an error condition as shown in FIG. 4 (1). The reason for this error is that the max run scale method is to perform continuous black pixel run statistics on the whole Manchu word image, and different font distortion of Manchu characters seriously interferes with the statistics of the max run scale method on the whole world. Statistics on Manchu writing show that the axle width of Manchu generally does not exceed 1/2 of word width W, so the search area of the maximum run-length ratio method is limited, and the search area of the algorithm is limited within the range specified by formula (1), which is called the maximum run-length ratio method of area limitation.

In formula (1), sl is the left boundary of the defined search range, sr is the right boundary of the defined search range, baseline is the center position of the central axis, and round represents rounding to the nearest integer. Limiting the search area range weakens the statistical influence of the Manchu freeness and branch strokes on the central axis width, and then adopting the maximum run length ratio method to finish the detection of the central axis width in the Manchu word image after the search range is limited, and the result is shown in (3) of FIG. 4.

And S5.2.2, calculating the left boundary bl and the right boundary br of the central axis according to the formula (2) by the width of the central axis baseline _ width and the central position baseline of the central axis.

The total 400 Manchu images of different fonts and sizes are randomly extracted, and the maximum run length ratio method and the vertical projection method defined by the area of the embodiment are respectively adopted to extract the central axis, and the results are shown in Table 1. An example of a portion of the central axis that is correctly extracted using the method of the present invention is shown in fig. 5. The experimental result shows that the axial line position in the Manchu word image can be accurately positioned by adopting morphological refinement and Hough transformation, and the width of the axial line in the Manchu word image can be correctly determined by adopting a maximum run-length probability method limited by a region.

TABLE 1 Manchu word image axle line extraction result statistical table

	The method of the invention	Vertical projection method
			Number of correct samples	397	210
Number of wrong samples	3	190
			Accuracy rate	99.25％	52.50％

In this embodiment, the accuracy of the Manchu character segmentation is a bottleneck problem of improving the Manchu recognition accuracy, and the following detailed description is provided for a specific scheme thereof.

For the Manchu parts splitting, step S6, as shown in FIG. 6, includes:

s6.1, roughly cutting the Manchu parts;

s6.2, judging and finely dividing the weak division of the candidate division areas;

and S6.3, over-segmentation judgment and combination of the candidate segmentation areas.

The above steps are specifically explained as follows:

s6.1. rough cutting of Manchu parts

Since the Manchu parts are connected by taking the central axis as a center, the Manchu words are divided into a left part, a middle part and a right part 3 by taking the central axis as the center. The range of the left part is from the 1 st column to the bl-1 st column of the Manchu word, and the range of the right part is from the br +1 st column to the W th column of the Manchu word. The left and right parts are projected horizontally, denoted pl and pr, respectively. The slicing cost function for the ith row is defined as:

Cost(i)＝pl(i)+pr(i),i＝1,2,…,H (3)

ideally the cost function value for a split row should be 0, i.e. neither the left nor the right part has a stroke in the row other than the central axis. However, in practical situations, the constraint conditions for the segmentation rows are too strict due to the noise influence caused by preprocessing such as scanning, tilt correction, binarization and the like, which may cause a serious weak segmentation problem. Assuming T1 as the rough-cut threshold for Manchu parts, the value of T1 was determined by a number of experiments to be

Only the conditions are met:

Cost(i)≤T1 (4)

the line (2) is the candidate segmentation line, and the sequence formed by all the candidate segmentation lines satisfying the condition (4) is recorded as Can _ seg. In the determination experiment of the value of T1, a full text component segmentation method is performed to select different multiples of baseline _ width as T1, where the multiples are all scores of 1, and for image comparison after segmentation, T1 corresponding to a full text word image with a better segmentation effect is selected, and finally the value is selected as the value of T1.

The following three situations can occur in the candidate segmentation row set obtained by rough segmentation of the Manchu component:

1) taking the line 1 of the image as a candidate segmentation line, which is obviously an unreasonable candidate line, so the line is deleted from the candidate segmentation line set;

2) the continuous adjacent lines of the image starting from the 1 st line/the continuous adjacent lines of the image ending with the last 1 (H) th line are unreasonable candidate line subsegments, so the subsegments should be deleted from the candidate segmentation set;

3) except the subsegment composed of continuous adjacent rows in 2), only one candidate segmentation row in the middle position is needed, and the rest is not needed; the intermediate candidate rows should be used to replace the subsections formed by the entire consecutive adjacent rows.

From the above, there are often redundant candidate segmentation lines in the Can _ seg, and for this reason, the redundant candidate segmentation lines in the Can _ seg are further deleted by adopting the following strategy:

(1) if only 1 candidate segmentation row exists in the Can _ seg and the candidate segmentation row is the 1 st row, deleting the row; otherwise, turning to the step (2);

(2) searching sub-segment conti _ subseg formed by continuous candidate segmentation lines, and deleting all lines of the sub-segment if the initial line of the sub-segment is the 1 st line or the ending line of the sub-segment is the H th line; otherwise, turning to the step (3);

(3) in the continuous candidate segmentation subsegment conti _ subseg, replacing all lines of the subsegment with median in the order from small to large (when even candidate lines exist, the average value of two middle values is taken and then rounded upwards);

(4) and outputting a new segmentation row sequence Can _ seg _ new for deleting redundant candidate segmentation rows.

S7.2. Weak segmentation judgment and fine segmentation of candidate segmentation areas

A coarsely sliced Manchu component may have a weak slicing condition. Statistics show that the height of the Manchu part does not exceed 5 times Baseline _ width, so that the weak segmentation judgment threshold T _ less is 5. Calculating the height hl of each segmentation area in the Can _ seg _ new, and then the height hThe sliced region of l > (T _ less × baseline _ width) is determined as a weakly-segmented region. For the weakly segmented regions, the coarse segmentation method and the fine segmentation threshold T2 are adopted to perform secondary segmentation, and the segmented regions are stored in the Seg1 sequence. The fine cut threshold T2 relaxes the constraint of the pair of candidate lines of the cut again on the basis of the rough cut, and is determined through a large number of experiments

The determination experiment of the value of T2 was carried out by selecting different multiples of Baseline _ width as T2<And executing a Manchu component segmentation method with the score of 1, comparing the segmented images, selecting T2 corresponding to the Manchu word image with better segmentation effect, and finally selecting the value as the value T2.

S7.3. over-segmentation judgment and combination of candidate segmentation areas

After rough and fine segmentation, Seg1 sequences may also have over-segmented regions. Statistics show that the height of the Manchu part is generally larger than baseline _ width, so the over-segmentation decision threshold T _ over is set to 1. Calculating the height ho of each segmentation region in Seg1, determining the segmentation region with height ho < (T _ over × baseline _ width) as an over-segmentation region, and merging the over-segmentation regions, where the merging may be as follows:

1) counting from top to bottom, if the first segmentation area is judged to be over-segmented, the first segmentation area can only be merged with the 2 nd area;

2) from bottom to top, the 2 nd area from the last is judged to be over-divided, and only the 2 nd area from the last can be merged;

3) if the over-divided region is located in the middle, the adjacent upper and lower regions need to be considered. Respectively calculating the height h _ up of the region merged with the upper region and the height h _ lw of the region merged with the lower region, and selecting the merging scheme with the smaller height after merging;

4) if the height of the combined solution is equal to the height of the combined upper and lower regions, namely the combined solution can not be determined according to the step 3), respectively calculating the number of the connected domains combined with the upper and lower regions, and selecting the combined solution with the small number of the connected domains;

5) and outputting the segmentation lines after the region merging.

To this end, the over-segmented regions are merged using the following rules:

(1) if the 1 st segmentation region is over-segmented, merging with the 2 nd segmentation region; otherwise, turning to the step (2).

(2) If the 2 nd from last segmentation region is over-segmented, merging with the last segmentation region; otherwise, turning to the step (3).

(3) If the over-segmentation area is neither the 1 st nor the 2 nd from last, the heights h _ up and h _ lw of the upper and lower 2 adjacent segmentation areas are respectively calculated. If h _ up is less than h _ lw, merging with the last segmentation area; if h _ up is more than h _ lw, merging the next segmentation area; otherwise, turning to the step (4).

(4) If the heights of the upper and lower 2 adjacent regions of the over-divided region are equal, the numbers num _ up and num _ lw of the connected regions merged with the upper or lower regions are respectively calculated. If num _ up is less than num _ lw, merging with the last segmentation area; if num _ up > num _ lw, merge with the next slice region.

(5) And outputting the segmentation row sequence of the combined over-segmentation region.

By the scheme, a Manchu component segmentation result is obtained, as shown in fig. 7, and fig. 7(1) - (2) are the result of the weak segmentation region after the fine segmentation; FIG. 7(3) - (4) shows the result of merging over-segmented regions.

The completed Manchu component segmentation result is further processed to identify Manchu components, and the identification method comprises the following steps besides the segmentation of the Manchu word image:

(1) manchu component normalization

Including Manchu part position normalization and size normalization.

The position normalization of the Manchu parts is to cut off the background part of the Manchu part image by taking the uppermost, the lowermost, the leftmost and the rightmost pixel points of the stroke pixel points as boundaries, and only reserve the part with the stroke. The Manchu component size normalization is to normalize the position-normalized images to the same size (e.g., 64 pixels by 64 pixels).

(2) Manchu component feature extraction

Firstly, respectively extracting common minority character features, comprising the following steps: contour features, grid features, direction line element features, visual direction features, and affine invariant features. These features are then fused and principal component analysis is used to reduce the dimensions of the fused features.

(3) Manchu component identification

And (3) adopting a support vector machine classifier with a Gaussian kernel function, and realizing the identification of a certain Manchu component by using a 'one-to-the-rest' multi-classifier combination rule.

(4) The full text component is processed after being identified,

and for the recognized Manchu parts, completing the recombination from the parts to words according to the recognition results of the upper and lower adjacent parts and the spelling rule of Manchu letters, thereby realizing the recognition of Manchu words.

Attached table 1:

Claims

1. a Manchu component set construction method is characterized in that:

constructing an initial set of Manchu parts, juxtaposing Flag =0 of each Manchu part, segmenting Manchu word images by using a Manchu part segmentation method, and counting and analyzing segmentation results: if the divided part does not belong to the initial set, adding the part into the initial set, and juxtaposing Flag =1 of the part; if the divided parts belong to the full text part initial set, setting Flag =1 of the corresponding parts, checking whether the Flag =0 of the part exists in the initial set, judging whether the part never appears in the division result, if so, deleting the part from the initial set, and sorting and outputting the full text part set;

the Manchu component cutting method comprises the following steps:

s1, rough segmentation of Manchu parts: dividing the Manchu word image into a left side part, a middle part and a right side part by taking the central axis of the Manchu word image as the center, wherein the range of the left side part is from the 1 st column to the 1 st column of the Manchu word imagebl-1 column, right side part ranging from first of Manchu word imagebr+1 column to the secondWColumns, horizontally projecting to the left and right sides, respectively, are denoted asplAndpr(ii) a Setting a threshold valueT1, only if the condition is satisfiedCost(i)≤TThe row of 1 is a candidate segmentation row,T1 =⌈1/2 ×baseline_width⌉，baseline_widthis the width of the central axis of the Manchu word image; first, theiLine slicing cost functionCost(i) =pl(i)+pr(i) ,i= 1,2,…,H，blIs the left boundary of the central axis,bris the right boundary of the central axis,Wis the width of the Manchu word image,His the height of the Manchu word image;

s2, judging and finely cutting the weak segmentation area;

s3, judging and combining over-segmentation areas;

wherein: s3, the step of judging and combining the over-segmentation areas is as follows:

setting an over-segmentation decision thresholdT_overCalculatingSeg1 height of each segmentation region in the sequencehoThen, the over-divided region is determined by the following equation:

ho＜T_over×baseline_width

the segmentation region meeting the height of the formula is judged as an over-segmentation region;

the over-segmented regions are merged using the following rules:

(1) if the 1 st segmentation region is over-segmented, merging with the 2 nd segmentation region; otherwise, turning to the step (2);

(2) if the 2 nd from last segmentation region is over-segmented, merging with the last segmentation region; otherwise, turning to the step (3);

(3) if the over-segmentation area is neither the 1 st nor the 2 nd from last, the heights of the upper and lower 2 adjacent segmentation areas are respectively calculatedh_upAndh_lw，if it is noth_up＜h_lwMerging the segmentation area with the last segmentation area; if it is noth_up＞h_lwMerging the next segmentation area; otherwise, turning to the step (4);

(4) if the heights of the upper and lower 2 adjacent regions of the over-divided region are equal, respectively calculating the number of connected domains merged with the upper regionnum_upNumber of connected component merged with lower componentnum_lw，If it is notnum_up＜num_lwThen merge with the last segmentation area ifnum_up＞num_lwMerging with the next segmentation area;

2. Construction of Manchu component set according to claim 1The method is characterized in that an over-segmentation judgment threshold value is setT_over=1。