CN108537229B - Manchu component segmentation-based print style Manchu recognition method - Google Patents

Manchu component segmentation-based print style Manchu recognition method Download PDF

Info

Publication number
CN108537229B
CN108537229B CN201810371757.XA CN201810371757A CN108537229B CN 108537229 B CN108537229 B CN 108537229B CN 201810371757 A CN201810371757 A CN 201810371757A CN 108537229 B CN108537229 B CN 108537229B
Authority
CN
China
Prior art keywords
manchu
segmentation
component
central axis
word image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810371757.XA
Other languages
Chinese (zh)
Other versions
CN108537229A (en
Inventor
郑蕊蕊
李敏
贺建军
许爽
吴宝春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Minzu University
Original Assignee
Dalian Minzu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Minzu University filed Critical Dalian Minzu University
Priority to CN201810371757.XA priority Critical patent/CN108537229B/en
Publication of CN108537229A publication Critical patent/CN108537229A/en
Application granted granted Critical
Publication of CN108537229B publication Critical patent/CN108537229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/158Segmentation of character regions using character size, text spacings or pitch estimation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/155Segmentation; Edge detection involving morphological operators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)
  • Character Input (AREA)
  • Image Analysis (AREA)

Abstract

A method for recognizing Manchu in a printed form based on Manchu component segmentation belongs to the field of character recognition, and aims to solve the problem of improving the Manchu segmentation precision, and the key point is that the method comprises the following steps: s1, segmenting Manchu components; s2, normalizing the Manchu part; s3, extracting and fusing Manchu component features; s4, recognizing Manchu parts; and S5, the Manchu parts are recombined and the Manchu words are recognized, and the effect is that before recognition, the Manchu parts are used as segmentation units, so that the phenomena of over-segmentation and weak segmentation in the Manchu segmentation process can be greatly reduced.

Description

Manchu component segmentation-based print style Manchu recognition method
Technical Field
The invention belongs to the field of character recognition, and relates to a method for recognizing Manchu characters of a printed form based on segmentation of Manchu characters.
Background
Manchu is a language character used by minority nationalities such as Manchu nationalities and Taber nationalities in China, is popularized and used as legal character in Qing dynasty, and forms a large amount of precious Manchu literature. Because the full language is about to disappear at present, the full language culture heritage needs to be urgently recovered and protected to be recognized and valued by the nation and the society. The study of Manchu optical character recognition technology is important for protecting and inheriting Qing dynasty cultural heritage. Manchu is a phonemic text having a total of 38 letters, 6 vowels, 22 consonants, and 10 special letters dedicated to spelling Chinese borrowers. The full writing adopts the rules that the word sequence is from top to bottom, and the line is from left to right. For Manchu recognition, the Manchu is usually recognized after the basic units (such as letters) are segmented first, so that the accuracy of Manchu recognition can be improved, and the segmentation accuracy can be improved.
Disclosure of Invention
In order to solve the problem of improving the Manchu segmentation precision, the invention provides the following technical scheme: a printed Manchu recognition method based on Manchu component segmentation comprises the following steps:
s1, segmenting Manchu components;
s2, normalizing the Manchu part;
s3, extracting and fusing Manchu component features;
s4, recognizing Manchu parts;
and S5, the Manchu component recombines and identifies Manchu words.
As a supplement to the technical scheme, the segmentation of the Manchu component comprises two steps of extracting a central axis of the Manchu word image and segmenting the Manchu component.
As a supplement to the technical solution, the method for extracting the central axis of the Manchu word image comprises the following steps:
s1.1, positioning a central axis of the Manchu word image;
s1.2, detecting the axial line width in the Manchu word image.
As a further supplement to the technical solution, said step S1.1 specifically comprises:
s1.1.1, negating the Manchu word image, and enabling the pixel value of a character part to be 1 and the pixel value of a background part to be 0;
s1.1.2, using a morphological thinning function of an MATLAB image processing tool box to realize morphological thinning of the Manchu word image;
and S1.1.3, determining a column coordinate corresponding to a thinned central axis by using Hough transform on the morphologically thinned Manchu word image, wherein the column coordinate is used as the position of the central axis of the Manchu word image, the angle of a Hough transform search straight line is limited to be 90, only straight lines in the vertical direction are searched, the straight lines at the same longitudinal position are connected, the straight lines with the distance smaller than the height of the Manchu word image and the length larger than 1 pixel are used as a straight line, and the central position of the central axis is obtained.
As a further supplement to the technical solution, said step S1.2 specifically comprises:
s1.2.1, determining a search area of a maximum run proportion method;
s1.2.2, applying a maximum run length proportion method to the Manchu word image in a search area to determine the width of a central axis of the Manchu word image;
and S1.2.3, calculating a left boundary and a right boundary of the central axis according to the central position of the central axis of the Manchu word image and the width of the central axis.
As a further supplement to the technical solution, the step S2.1 is specifically:
the search area of the maximum run length ratio method is determined by the range specified by the following formula:
Figure BDA0001638677580000021
where sl is the left boundary of the defined search range, sr is the right boundary of the defined search range, baseline is the center position of the central axis, round represents rounding to the nearest integer, and W is the width of the Manchu word image.
As a further supplement to the technical solution, the step S1.2.2 is a step of maximum run-length ratio method: and scanning each line of the Manchu word image searching area, and counting the run length of continuous black pixels and the occurrence frequency of the run length, wherein the run length with the maximum occurrence frequency is the width of a central axis of the Manchu word image.
As a further supplement to the technical solution, the left boundary and the right boundary of the central axis of step S1.2.3 are calculated by the following formula;
Figure BDA0001638677580000022
wherein: bl is the left boundary of the central axis, br is the right boundary of the central axis, baseline is the center position of the central axis of the Manchu word image, baseline _ width is the width of the central axis of the Manchu word image, and round represents rounding to the nearest integer.
As a supplement to the technical solution, the method for segmenting the Manchu parts comprises the following steps:
s1, roughly cutting a Manchu component;
s2, judging and finely cutting the weak segmentation area;
and S3, judging and combining over-segmentation areas.
As a further supplement to the technical solution, the step of roughly cutting the Manchu parts comprises:
dividing the Manchu word image into a left side part, a middle part and a right side part by taking the central axis of the Manchu word image as the center, wherein the range of the left side part is from the 1 st line to the bl-1 st line of the Manchu word image, the range of the right side part is from the br +1 st line to the W th line of the Manchu word image, and the left side part and the right side part are horizontally projected and marked as pl and pr;
setting a threshold value T1, wherein only the rows meeting the condition that Cost (i) is less than or equal to T1 are candidate segmentation rows;
wherein: the segmentation Cost function Cost (i) of the ith row is pl (i) + pr (i), i is 1,2, …, H, bl is the left boundary of the central axis, br is the right boundary of the central axis, W is the width of the Manchu word image, and baseline _ width is the width of the central axis of the Manchu word image.
As a further addition to the technical solution,
Figure BDA0001638677580000031
as a further supplement to the technical solution, the sequence formed by the candidate segmentation lines is Can _ seg, and the step of deleting redundant candidate segmentation lines in the sequence Can _ seg:
(1) if only 1 candidate cutting line exists in the sequence Can _ seg and the candidate cutting line is the 1 st line, deleting the line; otherwise, turning to the step (2);
(2) searching sub-segment conti _ subseg formed by continuous candidate segmentation lines, and deleting all lines of the sub-segment if the initial line of the sub-segment is the 1 st line or the ending line of the sub-segment is the H th line; otherwise, turning to the step (3), wherein H is the height of the Manchu word image;
(3) in the continuous candidate segmentation subsegment conti _ subseg, replacing all lines of the subsegment with median in the sequence from small to large, and taking the average value of two middle values and then rounding up when even candidate lines exist;
(4) the segmentation line sequence Can _ seg _ new from which the redundant candidate segmentation lines are deleted is output.
As a further complement of the technical solution, the step of judging and fine-segmenting the weak segmentation region:
setting a weak segmentation judgment threshold value T _ less, calculating the height hl of each segmentation region in the segmentation row, and judging the weak segmentation region according to the following formula:
hl>T_less×baseline_width
the segmentation region satisfying the above formula height is determined as a weak segmentation region, and for the weak segmentation region, secondary segmentation is performed using the step of full-text component rough segmentation and the fine segmentation threshold T2, and is stored in the Seg1 sequence.
As a further addition to the above, a weak segmentation decision threshold T _ less is set to 5, and a fine segmentation threshold is set to
Figure BDA0001638677580000032
Figure BDA0001638677580000033
As a further complement to the technical solution, the over-segmented region determination and merging:
setting an over-segmentation judgment threshold T _ over, calculating the height ho of each segmentation region in the Seg1 sequence, and judging the over-segmentation region according to the following formula:
ho<T_over×baseline_width
the segmentation region meeting the height of the formula is judged as an over-segmentation region;
the over-segmented regions are merged using the following rules:
(1) if the 1 st segmentation region is over-segmented, merging with the 2 nd segmentation region; otherwise, turning to the step (2);
(2) if the 2 nd from last segmentation region is over-segmented, merging with the last segmentation region; otherwise, turning to the step (3);
(3) if the over-segmentation region is neither the 1 st nor the 2 nd from last, respectively calculating the heights h _ up and h _ lw of the upper and lower 2 adjacent segmentation regions, and if h _ up is less than h _ lw, merging the over-segmentation regions with the previous segmentation region; if h _ up is more than h _ lw, merging the next segmentation area; otherwise, turning to the step (4);
(4) if the heights of the upper and lower 2 adjacent areas of the over-segmentation area are equal, respectively calculating the number num _ up of the connected domains merged with the upper area and the number num _ lw of the connected domains merged with the lower area, merging with the previous segmentation area if num _ up is less than num _ lw, and merging with the next segmentation area if num _ up is more than num _ lw;
(5) and outputting the segmentation row sequence of the combined over-segmentation region.
As a further addition to the above, the over-segmentation decision threshold T _ over is set to 1.
As a supplement to the technical solution, the Manchu component normalization includes two steps of Manchu component position normalization and size normalization:
manchu part position normalization: taking the uppermost, lowermost, leftmost and rightmost pixel points of the stroke pixel points of the segmented Manchu component image as boundaries, cutting off the background and reserving the part of the Manchu component with strokes;
manchu component size normalization: images normalized by Manchu parts positions are normalized to the same size.
As a supplement to the technical solution, the Manchu part feature extraction and fusion step: and respectively extracting the contour feature, the grid feature, the direction line element feature, the visual direction feature and the affine invariant distance feature of the Manchu part to be normalized, fusing the features, and reducing the dimension of the fused feature by adopting a principal component analysis method.
As a supplement to the technical solution, the Manchu parts identification step: and identifying the dimensionality-reduced fusion features of the Manchu component by using a support vector machine classifier with a Gaussian kernel function, thereby realizing the identification of the Manchu component.
As a supplement to the technical solution, the Manchu component reorganizes and recognizes Manchu words: and for the recognized Manchu parts, the recognized Manchu parts are adjacent to the recognized Manchu parts above and below the recognized Manchu parts, and the Manchu words are recognized by completing the recombination from the Manchu parts to the Manchu words according to the spelling rules of the Manchu letters.
Has the advantages that: before identification, the Manchu parts are used as segmentation units, so that the over-segmentation and weak-segmentation phenomena in the Manchu segmentation process can be greatly reduced.
Drawings
FIG. 1 is a Manchu parts set building flow diagram;
FIG. 2 is a Manchu parts segmentation flow diagram;
FIG. 3 is a diagram illustrating an example of an error extracted from the central axis of a Manchu word image by a conventional method;
fig. 4 is a graph of the axial width in Manchu determined using the area-defined maximum run-length scale method, in which: (1) the error example graph of the maximum run-length proportion method, (2) the search range graph defined by the invention, (3) the result graph of the method of the invention;
FIG. 5 is a diagram showing the effect of extracting the central axis in the method of the present invention;
FIG. 6 Manchu parts segmentation flow chart;
fig. 7 a full context part segmentation result diagram, wherein: (1) a weak segmentation phenomenon map, (2) a weak segmentation region passing fine segmentation map, (3) an over segmentation phenomenon map, (4) an over segmentation region passing merging map, and (5) a partial segmentation result map.
Detailed Description
From the analysis of the optical character recognition technology, Manchu has the following characteristics: (1) the same letter in Manchu has 4 different forms of independent shape, head shape, middle shape and tail shape according to the position in the word. The total number of letters of different fonts in Manchu is 114. (2) The words in the same column of the Manchu document are all positioned near the same central axis, and the words between two columns of the Manchu document in the printed form are basically not crossed, thereby being beneficial to column extraction. Certain intervals are arranged among Manchu words in the same column of Manchu texts, which is beneficial to word extraction. (3) The Manchu words are formed by connecting one or more Manchu letters with vertical central axes, and no gap exists between the letters in the same word. However, the spelling position of the letters is located on the axis line in the Manchu word image, which can be considered as beneficialsegBy usingmenIntatiThe pixel characteristic at the axis on line segments the Manchu alphabet. (4) Some Manchu letters have the phenomenon of "one-shaped multi-character". Such as characters
Figure BDA0001638677580000051
The shapes of the letters a, e and n can be distinguished in recognition according to spelling rules of adjacent letters. (5) Part of the Mandarin letters have the same components. Such as characters
Figure BDA0001638677580000052
(prefix shape of letter o), can be regarded as consisting of characters
Figure BDA0001638677580000053
(letter e's letter head shape) and characters
Figure BDA0001638677580000055
(the letter o is in the shape of a Chinese character) is formed by combining two parts. Therefore, the phenomenon of over-segmentation and weak segmentation is easy to occur by taking Manchu letters as basic segmentation units. (6) Some letter combinations have no separability. For example
Figure BDA0001638677580000054
(bo), is cut into
Figure BDA0001638677580000057
(letters b) and
Figure BDA0001638677580000056
the (letter o) is very difficult.
Based on the above features of Manchu, this embodiment proposes a concept of reconstructing Manchu words by components, and a Manchu component (hereinafter referred to as a component) is used as a basic unit for segmentation and recognition, which can solve the problems of over-segmentation and weak segmentation caused by using Manchu letters as a basic segmentation unit, where a Manchu component set includes 3 sources of Manchu letters, a part of letters or a combination of letters, and the like, and the purpose of constructing the Manchu component set is to reduce erroneous recognition caused by segmentation, because if the letters are used as the basic segmentation unit, the problems of over-segmentation and weak segmentation are likely to occur as in the foregoing analysis, and then a subsequent classifier for recognizing the letters is likely to generate recognition errors on the over-segmented and weak-segmented parts, or even cannot recognize the letters; the Manchu part set proposed by the invention (method) is constructed by taking the result of the segmentation method as a guide design, namely common over-segmentation (part of letters or letter combinations) and weak-segmentation (letter combinations) are not considered as 'wrong' but as 'correct' segmentation any more, so that a classifier designed subsequently can identify the parts, thereby reducing the problems of identification errors and the like caused by segmentation errors. For an understanding of the Manchu component, reference can be made to analogies to the recognition of English words. Taking English word study as an example, the whole word study can be directly identified; or the whole word can be cut into s, t, u, d, y and other letters, the letters are respectively recognized, and then the letters are combined into a word study; the cutting into letters is difficult to achieve, while the cutting into parts is relatively easy, for example, into: st, u, dy, (where st, u, dy are all parts) then identify parts and combine them into words, however, full language is not as easy to split parts as exemplified english because of the above features, as shown in fig. 1, the full language part set is constructed by: referring to the Manchu alphabet, the national standard of the people's republic of China, namely a multi-eight-bit encoding character set for information technology, Siberian and Manchu fonts, and the Mongolian component set in documents [1-2], an initial Manchu component set (hereinafter referred to as an initial set) comprising 99 initial components is provided, and Flag of each Manchu component is set to be 0. Segmenting the Manchu word image by Manchu segmentation, and counting and analyzing segmentation results: if the divided part does not belong to the initial set, adding the part into the initial set, and juxtaposing the Flag of the part to be 1; if the divided component belongs to the initial component set, the Flag of the corresponding component is set to 1. Whether the Flag of the component in the initial set is 0 or not is checked, whether the component never appears in the division result is judged, and if the component exists, the component is deleted from the initial set. And sorting and outputting the Manchu part set. The Manchu parts collectively comprises 106 parts, which are detailed in attached Table 1. Documents [1-2] mentioned therein:
[1]Hongxi Wei,Guanglai Gao.A keyword retrieval system for historicalMongolian document images[J].Internationaljournal on document analysisandrecognition,2014,17(1),33-45.
[2]Liangrui Peng,Changsong Liu,Xiaoqing Ding,Jianming Jin,Youshou Wu,Hua Wang,Yanhua Bao.Multi-font printed Mongolian document recognition system[J].International journal on document analysis and recognition,2010,13(2):93-106.
as shown in fig. 2, the full part text is cut as follows:
s1, converting a Manchu paper document into a digital image document which can be stored and processed by a computer through a photoelectric conversion device, and carrying out image preprocessing (smoothing and binarization) on the digital image of the Manchu document;
s2, analyzing layout (inclination correction, column segmentation and word segmentation);
s3, extracting Manchu word images;
s4, position normalization;
s5, extracting a central axis;
and S6, segmenting the Manchu parts according to the relation between the Manchu parts and the position of the central axis.
The inclination correction adopts a Hough transform method to determine the inclination angle of the layout, and then the image is rotated and corrected back to a vertical text state; column segmentation is carried out on the tilt-corrected Manchu document by adopting a vertical projection method, words are segmented by adopting a horizontal projection method, Manchu words in the Manchu column image are extracted, and position normalization is carried out on the Manchu word image. The preprocessing of the Manchu word image is completed through the steps, and the height and the width of the Manchu word image are H and W respectively. It should be noted that, the unnecessary white background edge of the Manchu word image is cut off by performing position normalization on the Manchu word image, and the flow shown in FIG. 2 is to turn over the image for programming convenience and display the removed black edge of the Manchu word. The image with black and white characters in fig. 2 is also called image inversion. The original image should be a black word with white background, but for programming convenience, the image with black border removed after turning over is directly given because the image with black border removed after turning over is more convenient to program.
In this embodiment, the axis extraction in the Manchu word image directly affects the accuracy of segmentation, and the following describes a specific scheme thereof in detail.
For the axis extraction in the Manchu word image, i.e., step S5, the vertical projection method and the maximum cumulative vertical projection method are generally used in the prior art, however, the two methods have the situations of axis positioning offset and axis width estimation error, etc., as shown in FIG. 3. The embodiment provides a method for extracting a central axis of a Manchu word image, which comprises the following steps:
s5.1, positioning the central axis of the Manchu word image:
first, the Manchu word image is inverted, that is, the pixel value of the text portion is 1 and the pixel value of the background is 0. The morphological refinement function of the MATLAB image processing tool box is used, a 3 x 3 structural element template is adopted, each template comprises 9 pixels, and each pixel can only take 0 or 1, so that the template has 512 different forms, and the template is divided into 8 directions to realize the morphological refinement of the Manchu word image. And determining the column coordinates corresponding to the thinned central axis of the Manchu word image by using Hough transform, namely determining the position of the central axis of the Manchu word image. In the extraction of the central axis of the Manchu word image, the angle of a Hough transform search straight line is limited to be 90, namely, only straight lines in the vertical direction are searched, the straight lines with the same longitudinal position are connected, the straight lines with the distance smaller than the height H of the word image and the length larger than 1 pixel are one straight line, and the central position of the central axis is calculated and is marked as baseline. The central axis of the Manchu word image refers to the column coordinate position of the central axis of the Manchu word in the image, but not the central line of one image.
S5.2, axial line width detection in Manchu word image
S5.2.1, adopting a maximum run length proportion method of the central axis width: firstly, scanning each line of a Manchu word image, and counting the run length of continuous black pixels and the occurrence frequency of the run length; scanning all the lines in sequence, the run length with the largest occurrence frequency is the width of the central axis of the Manchu word image, denoted as w0. The maximum run length scale method is effective for detecting the axial width of the Manchu word image, but there is still an error condition as shown in FIG. 4 (1). The reason for this error is that the max run scale method is to perform continuous black pixel run statistics on the whole Manchu word image, and different font distortion of Manchu characters seriously interferes with the statistics of the max run scale method on the whole world. Statistics on Manchu writing show that the width of the axle in Manchu generally does not exceed 1/2 of the word width W, so that the search area of the maximum run-length ratio method is limited, and the search area of the algorithm is limited within the range specified by the formula (1), which is called the maximum run-length ratio method of area limitation.
Figure BDA0001638677580000071
In formula (1), sl is the left boundary of the defined search range, sr is the right boundary of the defined search range, baseline is the center position of the central axis, and round represents rounding to the nearest integer. Limiting the search area range weakens the statistical influence of the Manchu freeness and branch strokes on the central axis width, and then adopting the maximum run length ratio method to finish the detection of the central axis width in the Manchu word image after the search range is limited, and the result is shown in FIG. 4 (3).
And S5.2.2, calculating the left boundary bl and the right boundary br of the central axis according to the formula (2) by the width of the central axis baseline _ width and the central position baseline of the central axis.
Figure BDA0001638677580000081
The total 400 Manchu images of different fonts and sizes are randomly extracted, and the maximum run length ratio method and the vertical projection method defined by the area of the embodiment are respectively adopted to extract the central axis, and the results are shown in Table 1. An example of a portion of the central axis that is correctly extracted using the method of the present invention is shown in fig. 5. The experimental result shows that the axial line position in the Manchu word image can be accurately positioned by adopting morphological refinement and Hough transformation, and the width of the axial line in the Manchu word image can be correctly determined by adopting a maximum run-length probability method limited by a region.
TABLE 1 Manchu word image axle line extraction result statistical table
The method of the invention Vertical projection method
Number of correct samples 397 210
Number of wrong samples 3 190
Accuracy rate 99.25% 52.50%
In this embodiment, the accuracy of the Manchu character segmentation is a bottleneck problem of improving the Manchu recognition accuracy, and the following detailed description is provided for a specific scheme thereof.
For the Manchu parts splitting, step S6, as shown in FIG. 6, includes:
s6.1, roughly cutting the Manchu parts;
s6.2, judging and finely dividing the weak division of the candidate division areas;
and S6.3, over-segmentation judgment and combination of the candidate segmentation areas.
The above steps are specifically explained as follows:
s6.1. rough cutting of Manchu parts
Since the Manchu parts are connected by taking the central axis as a center, the Manchu words are divided into a left part, a middle part and a right part 3 by taking the central axis as the center. The range of the left part is from the 1 st column to the bl-1 st column of the Manchu word, and the range of the right part is from the br +1 st column to the W th column of the Manchu word. The left and right parts are projected horizontally, denoted pl and pr, respectively. The slicing cost function for the ith row is defined as:
Cost(i)=pl(i)+pr(i),i=1,2,…,H (3)
ideally the cost function value for a split row should be 0, i.e. neither the left nor the right part has a stroke in the row other than the central axis. But in the actual case of the situation,due to noise influence caused by preprocessing such as scanning, inclination correction, binarization and the like, severe weak segmentation problem can be caused if constraint conditions on segmentation lines are too strict. Assuming T1 as the rough-cut threshold for Manchu parts, the value of T1 was determined by a number of experiments to be
Figure BDA0001638677580000091
Only the conditions are met:
Cost(i)≤T1 (4)
the line (2) is the candidate segmentation line, and the sequence formed by all the candidate segmentation lines satisfying the condition (4) is recorded as Can _ seg. In the determination experiment of the value of T1, a full text component segmentation method is performed to select different multiples of baseline _ width as T1, where the multiples are all scores of 1, and for image comparison after segmentation, T1 corresponding to a full text word image with a better segmentation effect is selected, and finally the value is selected as the value of T1.
The following three situations can occur in the candidate segmentation row set obtained by rough segmentation of the Manchu component:
1) taking the line 1 of the image as a candidate segmentation line, which is obviously an unreasonable candidate line, so the line is deleted from the candidate segmentation line set;
2) the continuous adjacent lines of the image starting from the 1 st line/the continuous adjacent lines of the image ending with the last 1 (H) th line are unreasonable candidate line subsegments, so the subsegments should be deleted from the candidate segmentation set;
3) except the subsegment composed of continuous adjacent rows in 2), only one candidate segmentation row in the middle position is needed, and the rest is not needed; the intermediate candidate rows should be used to replace the subsections formed by the entire consecutive adjacent rows.
From the above, there are often redundant candidate segmentation lines in the Can _ seg, and for this reason, the redundant candidate segmentation lines in the Can _ seg are further deleted by adopting the following strategy:
(1) if only 1 candidate segmentation row exists in the Can _ seg and the candidate segmentation row is the 1 st row, deleting the row; otherwise, turning to the step (2);
(2) searching sub-segment conti _ subseg formed by continuous candidate segmentation lines, and deleting all lines of the sub-segment if the initial line of the sub-segment is the 1 st line or the ending line of the sub-segment is the H th line; otherwise, turning to the step (3);
(3) in the continuous candidate segmentation subsegment conti _ subseg, replacing all lines of the subsegment with median in the order from small to large (when even candidate lines exist, the average value of two middle values is taken and then rounded upwards);
(4) and outputting a new segmentation row sequence Can _ seg _ new for deleting redundant candidate segmentation rows.
S7.2. Weak segmentation judgment and fine segmentation of candidate segmentation areas
A coarsely sliced Manchu component may have a weak slicing condition. Statistics show that the height of the Manchu part does not exceed 5 times Baseline _ width, so that the weak segmentation judgment threshold T _ less is 5. Calculating the height hl of each segmented region in the Can _ seg _ new, the segmented region with the height hl > (T _ less × baseline _ width) is determined as a weak segmented region. For the weakly segmented regions, the coarse segmentation method and the fine segmentation threshold T2 are adopted to perform secondary segmentation, and the segmented regions are stored in the Seg1 sequence. The fine cut threshold T2 relaxes the constraint of the pair of candidate lines of the cut again on the basis of the rough cut, and is determined through a large number of experiments
Figure BDA0001638677580000092
Figure BDA0001638677580000093
The determination experiment of the value of T2 was carried out by selecting different multiples of Baseline _ width as T2<And executing a Manchu component segmentation method with the score of 1, comparing the segmented images, selecting T2 corresponding to the Manchu word image with better segmentation effect, and finally selecting the value as the value T2.
S7.3. over-segmentation judgment and combination of candidate segmentation areas
After rough and fine segmentation, Seg1 sequences may also have over-segmented regions. Statistics show that the height of the Manchu part is generally larger than baseline _ width, so the over-segmentation decision threshold T _ over is set to 1. Calculating the height ho of each segmentation region in Seg1, determining the segmentation region with height ho < (T _ over × baseline _ width) as an over-segmentation region, and merging the over-segmentation regions, where the merging may be as follows:
1) counting from top to bottom, if the first segmentation area is judged to be over-segmented, the first segmentation area can only be merged with the 2 nd area;
2) from bottom to top, the 2 nd area from the last is judged to be over-divided, and only the 2 nd area from the last can be merged;
3) if the over-divided region is located in the middle, the adjacent upper and lower regions need to be considered. Respectively calculating the height h _ up of the region merged with the upper region and the height h _ lw of the region merged with the lower region, and selecting the merging scheme with the smaller height after merging;
4) if the height of the combined solution is equal to the height of the combined upper and lower regions, namely the combined solution can not be determined according to the step 3), respectively calculating the number of the connected domains combined with the upper and lower regions, and selecting the combined solution with the small number of the connected domains;
5) and outputting the segmentation lines after the region merging.
To this end, the over-segmented regions are merged using the following rules:
(1) if the 1 st segmentation region is over-segmented, merging with the 2 nd segmentation region; otherwise, turning to the step (2).
(2) If the 2 nd from last segmentation region is over-segmented, merging with the last segmentation region; otherwise, turning to the step (3).
(3) If the over-segmentation area is neither the 1 st nor the 2 nd from last, the heights h _ up and h _ lw of the upper and lower 2 adjacent segmentation areas are respectively calculated. If h _ up is less than h _ lw, merging with the last segmentation area; if h _ up is more than h _ lw, merging the next segmentation area; otherwise, turning to the step (4).
(4) If the heights of the upper and lower 2 adjacent regions of the over-divided region are equal, the numbers num _ up and num _ lw of the connected regions merged with the upper or lower regions are respectively calculated. If num _ up is less than num _ lw, merging with the last segmentation area; if num _ up > num _ lw, merge with the next slice region.
(5) And outputting the segmentation row sequence of the combined over-segmentation region.
With the above scheme, a Manchu component segmentation result is obtained, as shown in fig. 7, fig. 7(1) - (2) are results of the weak segmentation region being subjected to the fine segmentation; fig. 7(3) - (4) show the result of merging over-segmented regions.
The completed Manchu component segmentation result is further processed to identify Manchu components, and the identification method comprises the following steps besides the segmentation of the Manchu word image:
(1) manchu component normalization
Including Manchu part position normalization and size normalization.
The position normalization of the Manchu parts is to cut off the background part of the Manchu part image by taking the uppermost, the lowermost, the leftmost and the rightmost pixel points of the stroke pixel points as boundaries, and only reserve the part with the stroke. The Manchu component size normalization is to normalize the position-normalized images to the same size (e.g., 64 pixels by 64 pixels).
(2) Manchu component feature extraction
Firstly, respectively extracting common minority character features, comprising the following steps: contour features, grid features, direction line element features, visual direction features, and affine invariant features. These features are then fused and principal component analysis is used to reduce the dimensions of the fused features.
(3) Manchu component identification
And (3) adopting a support vector machine classifier with a Gaussian kernel function, and realizing the identification of a certain Manchu component by using a 'one-to-the-rest' multi-classifier combination rule.
(4) The full text component is processed after being identified,
and for the recognized Manchu parts, completing the recombination from the parts to words according to the recognition results of the upper and lower adjacent parts and the spelling rule of Manchu letters, thereby realizing the recognition of Manchu words.
Attached table 1:
Figure BDA0001638677580000121
Figure BDA0001638677580000131
Figure BDA0001638677580000141
Figure BDA0001638677580000151
Figure BDA0001638677580000161
Figure BDA0001638677580000171
Figure BDA0001638677580000181
Figure BDA0001638677580000191
Figure BDA0001638677580000201

Claims (6)

1. a method for recognizing Manchu characters of printed forms based on segmentation of Manchu characters components is characterized by comprising the following steps: the method comprises the following steps:
s1, segmenting Manchu components;
s2, normalizing the Manchu part;
s3, extracting and fusing Manchu component features;
s4, recognizing Manchu parts;
s5, the Manchu component recombines and identifies Manchu words;
the segmentation of the Manchu parts comprises two steps of extracting the central axis of the Manchu word image and segmenting the Manchu parts after extracting the central axis of the Manchu word image;
the method for extracting the central axis of the Manchu word image comprises the following steps:
s1.1, positioning a central axis of the Manchu word image;
s1.2, detecting the axial line width in the Manchu word image;
step S1.1 specifically includes:
s1.1.1, negating the Manchu word image, and enabling the pixel value of a character part to be 1 and the pixel value of a background part to be 0;
s1.1.2, using a morphological thinning function of an MATLAB image processing tool box to realize morphological thinning of the Manchu word image;
s1.1.3, using Hough transform to determine a column coordinate corresponding to a thinned central axis of the Manchu word image after morphological thinning, wherein the column coordinate is used as the position of the central axis of the Manchu word image, limiting the angle of a Hough transform search straight line to be 90, only searching a straight line in the vertical direction, connecting the straight lines in the same longitudinal position, and calculating the central position of the central axis, wherein the straight line has the distance smaller than the height of the Manchu word image and the length of the straight line larger than 1 pixel;
step S1.2 specifically includes:
s1.2.1, determining a search area of a maximum run proportion method;
s1.2.2, applying a maximum run length proportion method to the Manchu word image in a search area to determine the width of a central axis of the Manchu word image;
s1.2.3, calculating a left boundary and a right boundary of a central axis according to the central position of the central axis of the Manchu word image and the width of the central axis;
the step S1.2.1 specifically comprises the following steps:
the search area of the maximum run length ratio method is determined by the range specified by the following formula:
Figure FDA0002375381500000011
where sl is the left boundary of the defined search range, sr is the right boundary of the defined search range, baseline is the center position of the central axis, round represents rounding to the nearest integer, and W is the width of the Manchu word image.
2. The method for recognizing Manchu characters based on Manchu character component segmentation of claim 1, wherein: the method for segmenting the Manchu component after extracting the central axis of the Manchu word image comprises the following steps:
roughly cutting the Manchu parts;
judging and finely dividing the weakly divided regions;
and judging and combining the over-segmentation areas.
3. The method for recognizing Manchu characters based on Manchu character component segmentation of claim 1, wherein: manchu component normalization includes two steps of Manchu component position normalization and size normalization:
manchu part position normalization: taking the uppermost, lowermost, leftmost and rightmost pixel points of the stroke pixel points of the segmented Manchu component image as boundaries, cutting off the background and reserving the part of the Manchu component with strokes;
manchu component size normalization: images normalized by Manchu parts positions are normalized to the same size.
4. The method for recognizing Manchu characters based on Manchu character component segmentation of claim 1, wherein: the Manchu part feature extraction and fusion step: and respectively extracting the contour feature, the grid feature, the direction line element feature, the visual direction feature and the affine invariant distance feature of the Manchu part to be normalized, fusing the features, and reducing the dimension of the fused feature by adopting a principal component analysis method.
5. The method for recognizing Manchu characters based on Manchu character component segmentation of claim 1, wherein: a Manchu component identification step: and identifying the dimensionality-reduced fusion features of the Manchu component by using a support vector machine classifier with a Gaussian kernel function, thereby realizing the identification of the Manchu component.
6. The method for recognizing Manchu characters based on Manchu character component segmentation of claim 1, wherein: the Manchu component reorganizes and identifies Manchu words: and for the recognized Manchu parts, the recognized Manchu parts are adjacent to the recognized Manchu parts above and below the recognized Manchu parts, and the Manchu words are recognized by completing the recombination from the Manchu parts to the Manchu words according to the spelling rules of the Manchu letters.
CN201810371757.XA 2018-04-24 2018-04-24 Manchu component segmentation-based print style Manchu recognition method Active CN108537229B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810371757.XA CN108537229B (en) 2018-04-24 2018-04-24 Manchu component segmentation-based print style Manchu recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810371757.XA CN108537229B (en) 2018-04-24 2018-04-24 Manchu component segmentation-based print style Manchu recognition method

Publications (2)

Publication Number Publication Date
CN108537229A CN108537229A (en) 2018-09-14
CN108537229B true CN108537229B (en) 2020-06-02

Family

ID=63478332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810371757.XA Active CN108537229B (en) 2018-04-24 2018-04-24 Manchu component segmentation-based print style Manchu recognition method

Country Status (1)

Country Link
CN (1) CN108537229B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101025791A (en) * 2007-04-06 2007-08-29 清华大学 Printed Monggol language text segmentation method
KR20130047248A (en) * 2011-10-31 2013-05-08 한밭대학교 산학협력단 The input equipment of the manchu script

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868759B (en) * 2015-01-22 2019-11-05 阿里巴巴集团控股有限公司 The method and device of segmented image character
CN108830270B (en) * 2015-09-29 2021-10-08 大连民族大学 Method for positioning axle wire of Manchu word for correctly segmenting each recognized Manchu word
CN205451106U (en) * 2016-03-18 2016-08-10 大连民族大学 Write by hand language of manchus letter collection system
CN106372639B (en) * 2016-08-19 2019-03-08 西安电子科技大学 Block letter Uighur document cutting method based on morphology and integral projection
CN106127266A (en) * 2016-08-29 2016-11-16 大连民族大学 Hand-written Manchu alphabet recognition methods
CN106778752A (en) * 2016-11-16 2017-05-31 广西大学 A kind of character recognition method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101025791A (en) * 2007-04-06 2007-08-29 清华大学 Printed Monggol language text segmentation method
KR20130047248A (en) * 2011-10-31 2013-05-08 한밭대학교 산학협력단 The input equipment of the manchu script

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
图像背景下的满文文字提取;朱满琼 等;《大连民族学院学报》;20140131;第16卷(第1期);第78-81页 *
满文识别技术研究与分析;许爽 等;《大连民族学院学报》;20140930;第16卷(第5期);第546-551页 *

Also Published As

Publication number Publication date
CN108537229A (en) 2018-09-14

Similar Documents

Publication Publication Date Title
Roy et al. HMM-based Indic handwritten word recognition using zone segmentation
Pal et al. Script line separation from Indian multi-script documents
Razak et al. Off-line handwriting text line segmentation: A review
JP3452774B2 (en) Character recognition method
Pal et al. Identification of different script lines from multi-script documents
US10140556B2 (en) Arabic optical character recognition method using hidden markov models and decision trees
CN108830270B (en) Method for positioning axle wire of Manchu word for correctly segmenting each recognized Manchu word
US8559718B1 (en) Defining a layout of text lines of CJK and non-CJK characters
JP2000315247A (en) Character recognizing device
Din et al. Line and ligature segmentation in printed Urdu document images
CN108596182B (en) Manchu component cutting method
CN108564078B (en) Method for extracting axle wire of Manchu word image
CN102314252A (en) Character segmentation method and device for handwritten character string
Modi et al. Text line detection and segmentation in Handwritten Gurumukhi Scripts
CN108537229B (en) Manchu component segmentation-based print style Manchu recognition method
Lehal et al. Text segmentation of machine-printed Gurmukhi script
Ladwani et al. Novel approach to segmentation of handwritten Devnagari word
CN108596183B (en) Over-segmentation region merging method for Manchu component segmentation
CN108564089B (en) Manchu component set construction method
CN108564139B (en) Manchu component segmentation-based printed style Manchu recognition device
CN108549896B (en) Method for deleting redundant candidate segmentation lines in Manchu component segmentation
Alshameri et al. A combined algorithm for layout analysis of Arabic document images and text lines extraction
Cao et al. Robust page segmentation based on smearing and error correction unifying top-down and bottom-up approaches
Singh et al. Document layout analysis for Indian newspapers using contour based symbiotic approach
Razak et al. A real-time line segmentation algorithm for an offline overlapped handwritten Jawi character recognition chip

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant