CN102867178B - Method and device for Chinese character recognition - Google Patents
Method and device for Chinese character recognition Download PDFInfo
- Publication number
- CN102867178B CN102867178B CN201110187137.9A CN201110187137A CN102867178B CN 102867178 B CN102867178 B CN 102867178B CN 201110187137 A CN201110187137 A CN 201110187137A CN 102867178 B CN102867178 B CN 102867178B
- Authority
- CN
- China
- Prior art keywords
- unit
- image
- width
- identification
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000001514 detection method Methods 0.000 claims abstract description 29
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 description 24
- 238000010586 diagram Methods 0.000 description 23
- 238000005516 engineering process Methods 0.000 description 7
- 238000012015 optical character recognition Methods 0.000 description 7
- 230000003287 optical effect Effects 0.000 description 7
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000004075 alteration Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
Landscapes
- Character Input (AREA)
- Character Discrimination (AREA)
Abstract
The invention provides a device and a method for Chinese character recognition. The device comprises a first recognition unit, an error detection unit, an error correction unit and a second recognition unit, wherein the first recognition unit is used for dividing and recognizing document images to obtain recognition information, the error detection unit is used for obtaining position information of image units and recognition code information in the recognition information by the aid of the first recognition unit to detect mistakenly divided image units, the error correction unit is used for correcting the mistakenly divided image units detected by the error detection unit, and the second recognition unit is used for recognizing the image units corrected by the error correction unit to obtain corresponding recognition code information. By the aid of the recognition code information and the position information of the image units in the recognition information obtained after recognizing the document images, the mistakenly divided image units are detected and corrected, recognition accuracy can be improved, and the problems of the prior art are solved.
Description
Technical Field
The invention relates to a Chinese character recognition technology, in particular to a Chinese character recognition method and a Chinese character recognition device.
Background
With the improvement of the accuracy of Optical Character Recognition (OCR), the application of the method is more and more extensive, such as being widely applied to automatic office.
FIG. 1 is a schematic diagram of an optical recognition engine of the prior art; FIG. 2 is a schematic diagram of an image cell after a text image is segmented using a segmentation module 101 of an optical recognition engine; FIG. 3 is a schematic diagram of the selection of a sliced image element; FIG. 4 is a schematic diagram of a standard Chinese character with a left-right structure.
As shown in fig. 1, the optical recognition engine mainly includes: a segmentation module 101 and an identification module 102; the segmentation module 101 is configured to segment a Text Image (Text Image) containing a plurality of characters into Image units (Segments), as shown in fig. 2, the Text Image is an "information peripheral" 201, and the segmentation module 101 Segments the Text Image 201 to obtain a plurality of Image units 202, as shown in fig. 2, each Image unit 202 is separated by a vertical line; the recognition module 102 is configured to recognize an image unit 202 obtained by segmenting the Text image 201 by the segmentation module 101, so as to obtain an Editable Text (Editable Text), as shown in fig. 2, and obtain an Editable Text 203.
As shown in fig. 2, when the segmentation module 101 segments a text image, a segmentation error may occur. For example, some text images of single characters may be segmented into multiple image units, such as a single character "Xin" segmented into "" and ""; the individual characters "outer" are segmented into "sunset" and "poh", which ultimately leads to recognition errors.
Because the recognition similarity of the image unit of a standard Chinese character is higher than that of a non-standard Chinese character, the image unit with the wrong segmentation can be corrected by combining the recognition technology for the problem of the wrong segmentation, such as the error of segmenting a single character into a plurality of image units, namely, the image unit with the high recognition similarity is selected by recognizing the similarity, so that the segmentation error can be avoided.
For example, after the segmentation module 101 segments the text image 201 to obtain the image unit 202, in order to avoid the segmentation error, two adjacent image units may be merged, for example, as shown in FIG. 3, the image units "" alpha "" and "" alpha "" are merged, then, the alpha, alpha and the merged image unit letter are identified, the identification similarity is compared, because the letter is the image unit of standard Chinese character and the recognized character is the image unit of character whose character is not standard, therefore, the recognition similarity for the image unit "" letter "" is higher than the recognition similarity for the image unit "" alpha "" and "" speech "", thus, the merged image unit "" Xin "" with high recognition similarity is selected to correct the image unit "" alpha "" and "" fertile "" with segmentation error, i.e., the image unit of "" Xin "" is finally selected to replace "" alpha "" and "" fertile "".
As can be seen from the above, by selecting an image cell having a high similarity, a correct segmentation cell can be obtained, and fig. 3 shows an example of segmentation selection by combination recognition. However, in the process of implementing the invention, the inventor finds that the technology has the defects that: as shown in fig. 4, for a standard chinese character having a left-right structure, such as "outer", and both of the left part and the right part of the standard chinese character having the left-right structure, i.e., "late" and "long character", an optical recognition engine (OCR) divides the chinese character having the left-right structure into two image units, and even if the segmentation error correction method is used, the problem of the segmentation error cannot be solved, thereby finally causing a recognition error.
Disclosure of Invention
The embodiment of the invention aims to provide a Chinese character recognition method and a Chinese character recognition device, which can improve recognition accuracy and solve the problems in the prior art by detecting a wrongly-split image unit and correcting the wrongly-split image unit by using recognition coding information and image unit position information in recognition information obtained by recognizing a text image.
According to an aspect of an embodiment of the present invention, there is provided a chinese character recognition apparatus including:
the first recognition unit is used for segmenting and recognizing the text image to obtain recognition information; the identification information comprises position information of a plurality of image units in the text image obtained after the text image is segmented and identification coding information obtained by identifying the image units;
an error detection unit for detecting an erroneously sliced image unit using the identification code information and the position information obtained by the first identification unit;
an error correction unit for correcting the erroneously cut image unit detected by the error detection unit;
and the second identification unit is used for identifying the image unit corrected by the error correction unit to obtain corresponding identification code information.
According to another aspect of the embodiments of the present invention, there is provided a method for identifying chinese characters, the method including:
a first identification step, namely segmenting and identifying the text image to obtain identification information; the identification information comprises position information of a plurality of image units in the text image obtained after the text image is segmented and identification coding information obtained by identifying the image units;
an error detection step of detecting an error-segmented image unit using the identification code information and the position information in the identification information;
an error correction step of correcting the detected erroneously cut image unit;
and a second identification step, namely identifying the corrected image unit to obtain corresponding identification code information.
The embodiment of the invention has the beneficial effects that: the text image is segmented and recognized to obtain the recognition information, the recognition coding information and the image unit position information in the recognition information are utilized to search for the image unit which is segmented wrongly, the adjacent image units which are segmented wrongly are combined, and the combined image unit is used for replacing the image unit which is segmented wrongly, so that the recognition precision can be improved, and the problems in the prior art are solved.
Specific embodiments of the present invention are disclosed in detail with reference to the following description and drawings, indicating the manner in which the principles of the invention may be employed. It should be understood that the embodiments of the invention are not so limited in scope. The embodiments of the invention include many variations, modifications and equivalents within the spirit and scope of the appended claims.
Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments, in combination with or instead of the features of the other embodiments.
It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps or components.
Drawings
FIG. 1 is a schematic diagram of an optical recognition engine of the prior art;
FIG. 2 is a schematic diagram of an image cell after a text image is segmented using a segmentation module 101 of an optical recognition engine;
FIG. 3 is a schematic diagram of the selection of a sliced image element;
FIG. 4 is a schematic diagram of a standard Chinese character in left and right configuration;
FIG. 5 is a schematic diagram of three fixed-width Chinese characters;
FIG. 6 is a schematic diagram showing the construction of a Chinese character recognition apparatus according to embodiment 1 of the present invention;
FIG. 7 is a schematic diagram showing the configuration of the error detection unit in FIG. 6;
FIG. 8 is a schematic diagram showing the configuration of the word width calculating unit in FIG. 7;
FIG. 9 is a schematic diagram showing the position information of each image unit in embodiment 1 of the present invention;
FIG. 10 is a schematic diagram showing the configuration of the detecting unit in FIG. 7;
FIG. 11 is a schematic diagram showing the configuration of the error correction unit in FIG. 6;
FIG. 12 is a flow chart of a Chinese character recognition method according to embodiment 2 of the present invention;
FIG. 13 is a flowchart of an application example of the Chinese character recognition method according to embodiment 3 of the present invention;
fig. 14 is a schematic diagram of recognition results of a text image recognized by using a conventional OCR technology and a recognition method according to an embodiment of the present invention.
Detailed Description
Various embodiments of the present invention will be described below with reference to the accompanying drawings. These embodiments are merely exemplary and are not intended to limit the present invention. In order to make the principle and the embodiment of the present invention easily understood by those skilled in the art, the embodiment of the present invention will be described by taking the following chinese character recognition apparatus as an example.
In the process of implementing the invention, the inventor finds that the current Chinese character typesetting generally adopts Chinese character patterns with fixed width, and as shown in figure 5, three examples of Chinese characters with fixed width are shown, so that the typesetting result is neat and clear and is easy to read. Therefore, for a Hanzi image laid out using a fixed width Hanzi font, the Hanzi image cells in the image are all considered to have the same width. If the width information is used to detect the wrongly sliced image units, the recognition accuracy can be improved, and the method is simple and easy to implement, and solves the above-mentioned problems in the prior art,
fig. 6 is a schematic configuration diagram of a chinese character recognition apparatus according to embodiment 1 of the present invention. As shown in fig. 6, the apparatus includes: a first recognition unit 601, an error detection unit 602, an error correction unit 603, and a second recognition unit 604; wherein,
a first recognition unit 601, configured to segment and recognize a text image to obtain recognition information; the identification information comprises position information of a plurality of image units (Segments) obtained after segmenting the text image in the text image and identification coding information obtained by identifying the image units;
an error detection unit 602, configured to detect an error-sliced image unit by using the first identification sheet 601 to obtain the identification code information and the position information;
an error correction unit 603 for correcting the erroneously sliced image unit detected by the error detection unit 602;
a second identifying unit 604, configured to identify the image unit modified by the error correcting unit 603, so as to obtain corresponding identification code information.
In this embodiment, a plurality of image units obtained after the first identifying unit 601 cuts the text image may be represented by rectangles, and the position information of the image unit in the text image may be coordinate information of each image unit in a one-dimensional direction; the representation mode is not limited to the above representation mode, and other forms can be adopted for representation, for the position information represented by other forms, such as a unit represented by a contour, one-dimensional coordinate information can be obtained by projecting coordinate information of the contour, wherein the contour refers to the outer boundary of an image unit; and the identification code information obtained by the first identification unit 601 identifying the image unit is a digital representation of a standard Chinese character, for example, the identification code information can be represented by a Chinese character international spreading code GBK or a Unicode UNICODE, and each identification code information corresponds to one Chinese character. After the first recognition unit 601 performs segmentation and recognition on the text image, a final error recognition may be caused due to an error segmentation, as shown in fig. 2, for the words "believed" to be segmented into "and"; for "outer", it is divided into "sunset" and "halts", and the divided image units correspond to two standard chinese characters, respectively. Although the method in the background art can be used to detect and correct the image unit that is erroneously split by combining the recognition technology, the method cannot be used to solve the problem that the left part and the right part of the chinese character with the left-right structure are both standard chinese characters after the segmentation, but with the apparatus of embodiment 1 of the present invention, the error detection unit 602 can detect the erroneously split image unit by using the Average Character Width (ACW) and the recognition coding information obtained from the position information of the plurality of image units in the text image; and the error correction unit 603 may merge the two detected adjacent image units that are cut by error, and replace the two adjacent image units that are cut by error with the merged image unit, thereby improving the recognition accuracy and solving the above-mentioned problems in the prior art.
After the error correction unit 603 corrects two adjacent error-sliced image units, the second identification unit 604 may re-identify the corrected image unit to obtain the identification code information corresponding to the corrected image unit, so that the identification code information obtained by the first identification unit 601 and the identification code information corrected by the second identification unit 602 are combined to finally obtain the identification code information for identifying the text image.
In this embodiment, the apparatus may further include a text output unit (not shown) for outputting the standard chinese characters corresponding to the identification code information obtained by the first identification unit 601 and the second identification unit 602.
In the above embodiment, the first recognition unit 601 and the second recognition unit 604 may be implemented by using an OCR engine, and the specific segmentation and recognition manner is similar to that of the prior art, and will not be described herein again.
In the above-described embodiments, each unit may be configured using a logic section such as a field programmable logic section, a microprocessor, a processor used in a computer, or the like.
It can be seen from the above embodiments that identification information is obtained by segmenting and identifying a text image, an erroneously segmented image unit is searched for by using identification coding information and image unit position information in the identification information, adjacent erroneously segmented image units are merged, and the merged image unit replaces the image unit with the erroneously segmented image unit, so that the identification accuracy can be improved, and the problems in the prior art can be solved.
In the above-described embodiment, the error detection unit 602 may detect an error-sliced image unit using an Average Character Width (ACW) obtained from position information of a plurality of image units in a text image and identification code information.
Fig. 7 is a schematic diagram of the configuration of the error detection unit shown in fig. 6. As shown in fig. 7, the error detection unit 602 may specifically include a word width calculation unit 701 and a detection unit 702; wherein,
a word width calculation unit 701 for determining an average word width (ACW) using the position information obtained by the first recognition unit 601; a detecting unit 702, configured to detect a plurality of image units one by one using the identification code information and the average word width obtained by the first identifying unit 601, to find an image unit of half word width (HWS), which is the detected erroneously sliced image unit.
The image unit with wrong segmentation can be detected by identifying the coding information and the average word width obtained by utilizing the position information, the detection method is simple, and the problem that the standard Chinese characters with left and right structures cannot be detected in the prior art, and the segmentation error of the left part and the right part of the standard Chinese characters is solved.
Fig. 8 is a schematic diagram of the configuration of the word width calculation unit 701 in fig. 7. As shown in fig. 7, the word width calculation unit 701 includes: a width calculation unit 801, a sorting unit 802, and a word width determination unit 803; wherein,
a width calculation unit 801 for calculating the width of each image unit using the position information obtained by the first recognition unit 701; a sorting unit 802, configured to place the widths of all the image units in an array and sort the image units; the word width determination unit 803 is used to take the median in the array as the average word width.
In the above embodiment, if the image units are represented by rectangles, the position information of each image unit in the text image can be represented by the coordinate information of each image unit in the one-dimensional direction, so that the width of each image unit is calculated by the coordinate values of the two end points of the side length of the rectangle in which each image unit is located in the one-dimensional direction.
Fig. 9 is a schematic diagram (in cm) of the position information of each image unit in embodiment 1 of the present invention. As shown in fig. 9, for example, the coordinate values of two end points of the side length of the rectangle in which the first image unit "alpha" is located are respectively (0, 0.5), the coordinate values of two end points of the side length of the rectangle in which the second image unit "alpha" is located are respectively (0.5, 1.5), the coordinate values of two end points of the side length of the rectangle in which the third image unit "alpha" is located are respectively (1.5, 3.0), and so on. Thus, the width calculating unit 801 can calculate the width of each image unit according to the coordinate values, for example, the width of the first image unit is 0.5, the width of the second image unit is 1.0, the width of the third image unit is 1.5, and so on, the widths of all the image units can be calculated.
The sorting unit 802 arranges and sorts the widths of all image units in an array, wherein the widths can be sorted from small to large or from large to small. The word width determining unit 803 may take the median of the numbers in the array as the average word width, take the median of the numbers as the median if the number of widths is an odd number, take the median of the two values as the median if the number of widths is an even number, and take the median as the average word width (ACW).
Fig. 10 is a schematic diagram of the configuration of the detection unit in fig. 7. As shown in fig. 10, when the detection unit 702 detects one image unit using the identification code information and the average word width, the detection unit 702 includes: a first judgment unit 1001, a second judgment unit 1002, and a first determination unit 1003; the first judging unit 1001 is configured to judge whether a first candidate identification code corresponding to the detected image unit is a standard chinese character code, where the first candidate identification code is a first code information in the identification code information, and a standard chinese character corresponding to the first candidate identification code is an identified chinese character closest to an actual chinese character; a second judging unit 1002 for judging whether the width of the image unit is smaller than a product of the average word width and a predetermined parameter, when the judgment result of the first judging unit 1001 is yes, wherein the predetermined parameter is a numerical value smaller than 1 and larger than 0, and the predetermined parameter may select any numerical value between 0 and 1, for example, 2/3 or the like; a first determining unit 1003, configured to determine that the detected image unit is a half-word wide image unit (HWS) and to regard the half-word wide image unit as the detected erroneously split image unit when the determination result of the second determining unit 1002 is yes.
Thus, all the image units can be detected by the detection unit, and finally all the image units which are cut by mistake are obtained.
Fig. 11 is a schematic diagram of the configuration of the error correction unit in fig. 6. As shown in fig. 10, the error correction unit 603 includes: a merging unit 1101 and a replacing unit 1002; wherein,
a merging unit 1101 for merging two adjacent half-word wide image units detected by the detection unit; a replacing unit 1102, configured to replace the two adjacent half-word wide image units with the image unit merged by the merging unit 1101.
By the method, the wrongly-segmented image unit can be corrected, particularly the wrongly-segmented image unit with the left and right structures of Chinese characters and the left and right parts of the Chinese characters which are standard Chinese characters is corrected, and finally the recognition accuracy is improved.
In the above-described embodiments, each unit may be configured using a logic section such as a field programmable logic section, a microprocessor, a processor used in a computer, or the like.
It can be seen from the above embodiments that identification information is obtained by segmenting and identifying a text image, an erroneously segmented image unit is searched for by using identification coding information and image unit position information in the identification information, adjacent erroneously segmented image units are merged, and the merged image unit replaces the image unit with the erroneously segmented image unit, so that the identification accuracy can be improved, and the problems in the prior art can be solved.
Fig. 12 is a flowchart of a method for identifying chinese characters according to embodiment 2 of the present invention. As shown in fig. 12, the method includes:
step 1201, a first identification step, namely segmenting and identifying the text image to obtain identification information; the identification information comprises position information of a plurality of image units in the text image obtained after the text image is segmented and identification coding information obtained by identifying the image units;
step 1202, an error detection step, which detects an error-segmented image unit by using the identification code information and the position information in the identification information;
a step 1203 of correcting the detected image unit which is cut by mistake;
and 1204, a second identification step, namely identifying the corrected image unit to obtain corresponding identification code information.
It can be seen from the above embodiments that identification information is obtained by segmenting and identifying a text image, an erroneously segmented image unit is searched for by using identification coding information and image unit position information in the identification information, adjacent erroneously segmented image units are merged, and the merged image unit replaces the image unit with the erroneously segmented image unit, so that the identification accuracy can be improved, and the problems in the prior art can be solved.
In this embodiment, in step 1202, the miscut image unit can be detected by using the Average Character Width (ACW) and the identification code information obtained by the position information of the plurality of image units in the text image, and can be detected as follows:
determining an average word width using the position information; and detecting the plurality of image units one by using the identification coding information and the average word width to find a half-word-width image unit, wherein the half-word-width image unit is the detected image unit which is segmented by errors.
The specific method for determining the average word width by using the position information is as described in embodiment 1, and may include: calculating the width of each image unit by using the position information; arranging the widths of all the image units in an array and sequencing the widths; the median in the array is taken as the average word width.
Further, when detecting one image unit of the plurality of image units using the identification code information and the average word width, the following manner may be specifically adopted:
judging whether the first candidate identification code corresponding to the detected image unit is a standard Chinese character code or not; if yes, further judging whether the width of the image unit is smaller than the product of the average word width and a preset parameter, wherein the preset parameter is a numerical value smaller than 1 and larger than 0; and if so, determining that the detected image unit is a half-word-width image unit, and taking the half-word-width image unit as an image unit which is segmented incorrectly.
In this embodiment, in step 1203, the adjacent half-word wide image unit may be corrected as follows: merging the detected two adjacent half-word-width image units; and replacing the two adjacent half word wide image units with the merged image unit.
The following describes the method for identifying Chinese characters according to the present invention with reference to specific examples. Fig. 13 is a flowchart of an application example of the method for recognizing a chinese character according to embodiment 3 of the present invention, and fig. 14 is a schematic diagram of a recognition result of a text image recognized by using the conventional OCR technology and the recognition method according to the embodiment of the present invention, respectively.
As shown in fig. 13, the method may include the steps of:
step 1301, segmenting and identifying the text image to obtain identification information; the identification information comprises position information of a plurality of image units in the text image obtained after the text image is segmented and identification coding information obtained by identifying the image units;
the identification code information and the position information are as described in embodiment 1, and are not described herein again.
Step 1302, determining an average word width by using the position information obtained in step 1301;
wherein, specifically include: calculating the width of each image unit by using the position information; arranging the widths of all the image units in an array and sequencing the widths; the median in the array is taken as the average word width.
Step 1303, detecting the multiple image units one by using the identification coding information and the average word width to find out half-word-width image units, wherein the half-word-width image units are detected image units which are segmented by errors;
when detecting one image unit of the plurality of image units, the following method can be adopted: judging whether the first candidate identification code corresponding to the detected image unit is a standard Chinese character code or not; if yes, further judging whether the width of the image unit is smaller than the product of the average word width and a preset parameter, wherein the preset parameter is a numerical value smaller than 1 and larger than 0; and if so, determining that the detected image unit is a half-word-width image unit, and taking the half-word-width image unit as an image unit which is segmented incorrectly.
In step 1304, the detected two adjacent half-word wide image units are merged.
Step 1305, the merged image unit replaces the two adjacent half-word wide image units.
And 1306, identifying the corrected image unit to obtain corresponding identification code information.
Step 1307, outputting all the Chinese characters corresponding to the corrected identification coding information;
wherein all the identification code information may include identification code information corresponding to image units other than the image unit modified in step 1301, and identification code information corresponding to the modified image unit.
As shown in fig. 14, when recognition is performed using the conventional OCR recognition technology, the recognition result is shown in 1401, where "outer" is recognized as "outer" and "halter" for "outer"; the letter is recognized as "" alpha "" and "" Yu "", resulting in lower recognition accuracy. As shown in fig. 14, by using the above recognition method according to the embodiment of the present invention, the text image can be accurately recognized, and the occurrence of recognition errors can be reduced, see 1402.
It can be seen from the above embodiments that identification information is obtained by segmenting and identifying a text image, an erroneously segmented image unit is searched for by using identification coding information and image unit position information in the identification information, adjacent erroneously segmented image units are merged, and the merged image unit replaces the image unit with the erroneously segmented image unit, so that the identification accuracy can be improved, and the problems in the prior art can be solved.
It can be seen from the above embodiments that identification information is obtained by segmenting and identifying a text image, an erroneously segmented image unit is searched for by using identification coding information and image unit position information in the identification information, adjacent erroneously segmented image units are merged, and the merged image unit replaces the image unit with the erroneously segmented image unit, so that the identification accuracy can be improved, and the problems in the prior art can be solved.
The above devices and methods of the present invention can be implemented by hardware, or can be implemented by hardware and software. The present invention relates to a computer-readable program which, when executed by a logic section, enables the logic section to realize the above-described apparatus or constituent section, or to realize the above-described various methods or steps. The logic unit is, for example, a field programmable logic unit, a microprocessor, a processor used in a computer, or the like. The present invention also relates to a storage medium such as a hard disk, a magnetic disk, an optical disk, a DVD, a flash memory, or the like, for storing the above program.
While the invention has been described with reference to specific embodiments, it will be apparent to those skilled in the art that these descriptions are illustrative and not intended to limit the scope of the invention. Various modifications and alterations of this invention will become apparent to those skilled in the art based upon the spirit and principles of this invention, and such modifications and alterations are also within the scope of this invention.
Claims (8)
1. A chinese character recognition apparatus, the apparatus comprising:
the first identification unit is used for segmenting and identifying the text image to obtain identification information; the identification information comprises position information of a plurality of image units in the text image obtained after the text image is segmented and identification coding information obtained by identifying the image units;
an error detection unit for detecting an erroneously sliced image unit using the identification code information and the position information obtained by the first identification unit;
an error correction unit configured to correct the erroneously cut image unit detected by the error detection unit;
a second identification unit for identifying the image unit modified by the error correction unit to obtain corresponding identification code information,
wherein the error detection unit includes:
a word width calculation unit for determining an average word width using the position information obtained by the first recognition unit;
and the detection unit is used for detecting the plurality of image units one by utilizing the identification coding information and the average word width obtained by the first identification unit so as to find out half-word-width image units, wherein the half-word-width image units are detected image units which are segmented by errors.
2. The apparatus of claim 1, wherein the word width calculation unit comprises:
a width calculation unit for calculating a width of each image unit using the position information obtained by the first recognition unit;
the sorting unit is used for placing the widths of all the image units in an array and sorting the widths;
a word width determination unit to take the median in the array as an average word width.
3. The apparatus according to claim 1, wherein the detecting unit, when detecting an image unit using the identification code information and the average word width, comprises:
the first judging unit is used for judging whether the first candidate identification code corresponding to the detected image unit is a standard Chinese character code or not;
a second judging unit, configured to judge whether the width of the image unit is smaller than a product of the average word width and a predetermined parameter when the judgment result of the first judging unit is yes, where the predetermined parameter is a numerical value smaller than 1 and larger than 0;
a first determination unit configured to determine that the detected image unit is a half-word wide image unit when a determination result of the second determination unit is yes.
4. The apparatus of claim 1, wherein the error correction unit comprises:
a merging unit configured to merge two adjacent half-word-width image units detected by the detection unit;
and the replacing unit is used for replacing the image unit merged by the merging unit with the adjacent two image units with half word width.
5. A method of chinese character recognition, the method comprising:
a first identification step, namely segmenting and identifying the text image to obtain identification information; the identification information comprises position information of a plurality of image units in the text image obtained after the text image is segmented and identification coding information obtained by identifying the image units;
an error detection step of detecting an error-segmented image unit using the identification code information and position information in the identification information;
an error correction step of correcting the detected erroneously cut image unit;
a second identification step of identifying the corrected image unit to obtain corresponding identification code information,
wherein the error detection step comprises:
determining an average word width using the position information;
and detecting the plurality of image units one by utilizing the identification coding information and the average word width to find half word width image units, wherein the half word width image units are detected image units which are segmented wrongly.
6. The method of claim 5, wherein the detecting an image unit using the identification code information and the average word width comprises:
judging whether the first candidate identification code corresponding to the detected image unit is a standard Chinese character code or not;
if the judgment result is yes, further judging whether the width of the image unit is smaller than the product of the average word width and a preset parameter, wherein the preset parameter is a numerical value smaller than 1 and larger than 0;
and if so, determining that the detected image unit is a half-word-width image unit.
7. The method of claim 5, wherein the obtaining an average word width using the position information comprises:
calculating the width of each image unit by using the position information;
arranging the widths of all the image units in an array and sequencing the widths;
taking the median in the array as the average word width.
8. The method of claim 5, wherein the error correction step comprises:
merging the detected two adjacent half-word-width image units;
and replacing the two adjacent half word width image units with the merged image unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110187137.9A CN102867178B (en) | 2011-07-05 | 2011-07-05 | Method and device for Chinese character recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110187137.9A CN102867178B (en) | 2011-07-05 | 2011-07-05 | Method and device for Chinese character recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102867178A CN102867178A (en) | 2013-01-09 |
CN102867178B true CN102867178B (en) | 2015-06-10 |
Family
ID=47446042
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110187137.9A Active CN102867178B (en) | 2011-07-05 | 2011-07-05 | Method and device for Chinese character recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102867178B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104134064B (en) * | 2013-05-02 | 2018-05-04 | 百度国际科技(深圳)有限公司 | Character recognition method and device |
CN103400132B (en) * | 2013-07-02 | 2017-08-25 | Tcl集团股份有限公司 | A kind of character segmentation method and device |
CN109977713A (en) * | 2017-12-27 | 2019-07-05 | 田雪松 | Dot matrix code identification method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1273401A (en) * | 1999-05-06 | 2000-11-15 | 富士通株式会社 | Character recognition device |
CN1916938A (en) * | 2005-08-18 | 2007-02-21 | 富士通株式会社 | Identifying distance regulator and method thereof and text lines identifier and method thereof |
CN101251892A (en) * | 2008-03-07 | 2008-08-27 | 北大方正集团有限公司 | Method and apparatus for cutting character |
CN101872344A (en) * | 2009-04-27 | 2010-10-27 | 上海百测电气有限公司 | Control method for image scanning |
CN102024138A (en) * | 2009-09-15 | 2011-04-20 | 富士通株式会社 | Character identification method and character identification device |
CN102096814A (en) * | 2009-12-15 | 2011-06-15 | 富士通株式会社 | Font element determining device and font element determining method |
-
2011
- 2011-07-05 CN CN201110187137.9A patent/CN102867178B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1273401A (en) * | 1999-05-06 | 2000-11-15 | 富士通株式会社 | Character recognition device |
CN1916938A (en) * | 2005-08-18 | 2007-02-21 | 富士通株式会社 | Identifying distance regulator and method thereof and text lines identifier and method thereof |
CN101251892A (en) * | 2008-03-07 | 2008-08-27 | 北大方正集团有限公司 | Method and apparatus for cutting character |
CN101872344A (en) * | 2009-04-27 | 2010-10-27 | 上海百测电气有限公司 | Control method for image scanning |
CN102024138A (en) * | 2009-09-15 | 2011-04-20 | 富士通株式会社 | Character identification method and character identification device |
CN102096814A (en) * | 2009-12-15 | 2011-06-15 | 富士通株式会社 | Font element determining device and font element determining method |
Also Published As
Publication number | Publication date |
---|---|
CN102867178A (en) | 2013-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108595410B (en) | Automatic correction method and device for handwritten composition | |
CN113158808B (en) | Method, medium and equipment for Chinese ancient book character recognition, paragraph grouping and layout reconstruction | |
US20150095769A1 (en) | Layout Analysis Method And System | |
US8565474B2 (en) | Paragraph recognition in an optical character recognition (OCR) process | |
CN106255979B (en) | line segmentation method | |
US20180089835A1 (en) | Image processing apparatus for identifying region within image, information processing method, and storage medium | |
CN115984859B (en) | Image character recognition method, device and storage medium | |
CN102867178B (en) | Method and device for Chinese character recognition | |
EP2541468B1 (en) | Method of and device for identifying the direction of characters in an image block | |
US11551461B2 (en) | Text classification | |
US9152876B1 (en) | Methods and systems for efficient handwritten character segmentation | |
CN108596182B (en) | Manchu component cutting method | |
CN108564078B (en) | Method for extracting axle wire of Manchu word image | |
JP2004046723A (en) | Method for recognizing character, program and apparatus used for implementing the method | |
US11354890B2 (en) | Information processing apparatus calculating feedback information for partial region of image and non-transitory computer readable medium storing program | |
CN117373050B (en) | Method for identifying drawing pipeline with high precision | |
US11710331B2 (en) | Systems and methods for separating ligature characters in digitized document images | |
CN108564139B (en) | Manchu component segmentation-based printed style Manchu recognition device | |
CN108564089B (en) | Manchu component set construction method | |
CN102236638A (en) | Method and device for correcting capital and lowercase forms of characters in western language words | |
KR930012142B1 (en) | Individual character extracting method of letter recognition apparatus | |
CN105787415B (en) | Document image processing device and method and scanner | |
CN116386053A (en) | Electronic drawing character recognition result calibration method, system, equipment and medium | |
JPH07319998A (en) | Method for segmenting character | |
CN114048524A (en) | Multi-direction text comparison method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |