CN113673511B - Character segmentation method based on OCR - Google Patents

Character segmentation method based on OCR Download PDF

Info

Publication number
CN113673511B
CN113673511B CN202110869780.3A CN202110869780A CN113673511B CN 113673511 B CN113673511 B CN 113673511B CN 202110869780 A CN202110869780 A CN 202110869780A CN 113673511 B CN113673511 B CN 113673511B
Authority
CN
China
Prior art keywords
character
characters
standard
segmentation
ocr
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110869780.3A
Other languages
Chinese (zh)
Other versions
CN113673511A (en
Inventor
秦应化
李安
吴昆�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Dinnar Automation Technology Co Ltd
Original Assignee
Suzhou Dinnar Automation Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Dinnar Automation Technology Co Ltd filed Critical Suzhou Dinnar Automation Technology Co Ltd
Priority to CN202110869780.3A priority Critical patent/CN113673511B/en
Publication of CN113673511A publication Critical patent/CN113673511A/en
Application granted granted Critical
Publication of CN113673511B publication Critical patent/CN113673511B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Abstract

The invention relates to a character segmentation method based on OCR, which comprises the following steps: step 1: acquiring a template word stock based on an OCR technology, wherein the template word stock comprises standard characters and characteristic data of the standard characters; step 2: recognizing a part of characters in the same batch with the characters to be recognized by using a character recognition model in an OCR technology to obtain a character segmentation result, manually marking error items in the segmentation result, and updating the character recognition model; and step 3: scanning the character to be recognized, initially recognizing the character to be recognized based on the updated character recognition model, and forcibly segmenting the character when the score of the recognition result of a certain character is smaller than a first threshold value; and 4, step 4: normalization processing; and 5: and matching a certain character with the standard character according to the characteristic data after the normalization processing, calculating to obtain the standard character with the highest score, and determining the segmentation position of the current character based on the standard character with the highest score. The invention can improve the accuracy of character segmentation.

Description

Character segmentation method based on OCR
Technical Field
The invention relates to the field of optical character recognition, in particular to a character segmentation method based on OCR.
Background
In fields related to Optical Character Recognition such as printed characters and laser marking, OCR (Optical Character Recognition) plays an important role. At present, almost every product is provided with production batch number and other similar information, and an OCR technology is generally used for ensuring the traceability of the product. However, when text information is actually printed, due to different printing environments (inconsistency in printing by different devices, motion printing and printing), some situations such as text deformation, space change, size change and the like may occur, and after a model is trained according to standard characters in a conventional OCR character library, the model is likely to have a situation in which two characters are combined into one or one character is cut into two due to the situations, which may lead to a reduction in recognition rate.
Therefore, it is an urgent technical problem to be solved by those skilled in the art to provide an OCR-based character segmentation method which is simple in operation and can improve the recognition rate of subsequent characters.
Disclosure of Invention
The invention provides a character segmentation method based on OCR, which aims to solve the technical problem.
In order to solve the above technical problem, the present invention provides an OCR-based character segmentation method, including the steps of:
step 1, data collection: acquiring a template word stock based on an OCR technology, wherein the template word stock comprises standard characters and characteristic data of the standard characters, and the characteristic data at least comprises the gray scale, the size, the length-width ratio, the area gravity center, the area and the space of the standard characters;
step 2, manual marking: recognizing a part of characters in the same batch with the characters to be recognized by utilizing a character recognition model in an OCR technology to obtain a character segmentation result, manually checking the segmentation result, marking error items in the segmentation result, recording the error items and corresponding feature data into a template word stock, and manually modifying the weight of each feature data in the character recognition model according to the updated template word stock to obtain an updated character recognition model;
step 3, pre-segmentation: scanning the character to be recognized, initially recognizing the character to be recognized based on the updated character recognition model, and forcibly segmenting the character when the score of the recognition result of a certain character is smaller than a first threshold value;
step 4, normalization treatment: normalizing the characteristic data of the pre-divided characters and the characteristic data in the template word stock;
step 5, fine adjustment of the segmentation position: matching a certain character with the standard character according to the characteristic data after normalization processing, calculating to obtain the standard character with the highest score, and determining the segmentation position of the current character based on the standard character with the highest score.
Preferably, in step 1, the method for obtaining the template word stock based on the OCR technology includes: and collecting the pictures of the standard characters, and obtaining the template word stock by utilizing the OCR technology for segmentation.
Preferably, in step 2, the number of the characters in the same batch as the character to be recognized is recognized by using the character recognition model is 20-1000.
Preferably, in step 2, the manually modifying the weight of each feature data in the character recognition model according to the updated template word stock includes: and counting the numerical distribution of each kind of characteristic data according to the updated template word stock, and manually modifying the weight based on the stable interval and the change rule of each kind of characteristic data.
Preferably, in step 3, the step of line-scanning the character to be identified includes: one pixel is set as a scanning width, and the scanning of each line of characters is carried out.
Preferably, in step 3, the basis for forcibly dividing the character is as follows: the number of characters per line is made to coincide with the number of standard characters that a line can accommodate.
Preferably, in step 5, before matching a certain character with the standard character according to the normalized feature data, feature points of the character are obtained and filtered.
Preferably, the condition for filtering the feature points of the character includes: the size of the feature point is less than a second threshold.
Compared with the prior art, the character segmentation method based on OCR provided by the invention has the following advantages:
1. according to the invention, a large number of samples can be obtained in the conventional OCR technology to form a template word stock and a trained character recognition model, a collection process of a large number of samples and a model training process are not needed, and the algorithm flow is greatly simplified;
2. the method only needs to take 'a part of characters in the same batch with the characters to be recognized' as samples to test the small batch of data of the character recognition model, manually distributes the weight of the characteristic parameters of the character recognition model according to the test result, and can greatly improve the accuracy of character segmentation on the premise of not increasing a large number of samples;
3. in the invention, the weight of the characteristic parameter is converted from the static parameter in the prior art into the dynamic parameter, so that the characteristic parameter is closer to the actual production condition, and the character segmentation accuracy is improved;
4. the characteristic data of the character obtained after the dynamic parameter segmentation can be provided for a certain proportion weight of a subsequent recognition algorithm to achieve more stable recognition rate;
5. the characters are pre-divided (forcibly divided) according to the length of the characters, the condition of character adhesion is avoided, standard characters with the highest division are selected from the standard characters to serve as the basis for dividing the current characters, the matching degree of the current characters and the standard characters is ensured, and accurate division is further ensured.
Drawings
FIG. 1 is a flowchart of an OCR-based character segmentation method according to an embodiment of the present invention.
FIGS. 2-6 are line graphs of character analysis features in accordance with one embodiment of the present invention.
Detailed Description
In order to more thoroughly express the technical scheme of the invention, the following specific examples are listed to demonstrate the technical effect; it is emphasized that these examples are intended to illustrate the invention and are not to be construed as limiting the scope of the invention.
The character segmentation method based on OCR provided by the invention, as shown in figure 1, comprises the following steps:
step 1, data collection: the method for acquiring the template word stock based on the OCR technology comprises the following steps of acquiring a template word stock based on the OCR technology, wherein the template word stock comprises standard characters and characteristic data of the standard characters, and the characteristic data at least comprises gray scale, size, aspect ratio, area gravity center, area and space of the standard characters, and particularly, the method for acquiring the template word stock based on the OCR technology can comprise the following steps: collecting the image of the standard character, obtaining the template word stock by utilizing the OCR technology for segmentation, for example, scanning the image with the standard character by utilizing a character recognition model in the OCR technology so as to collect the standard character, and then obtaining the characteristic data of the standard character.
Step 2, manual marking: and recognizing a part of characters in the same batch with the characters to be recognized by using a character recognition model in an OCR technology to obtain a character segmentation result. Preferably, the same batch of characters may include features of uniform style, font size, deformation amount, and the like, and the same batch may specifically refer to the same batch of products, the same kind of products, a specification, and the like. In addition, the number of the characters in the same batch with the characters to be recognized is recognized by using the character recognition model is 20-1000, the number is far smaller than the number of training samples required in the traditional OCR technology, but the characters in the same batch have the characteristics of uniform style and the like, and the accuracy rate of character segmentation in the batch can be greatly improved. And manually checking the segmentation result, marking error items in the segmentation result, recording the error items and the corresponding feature data thereof into the template word stock, and manually modifying the weight of each feature data in the character recognition model according to the updated template word stock to obtain the updated character recognition model. Specifically, the manually modifying the weight of each feature data in the character recognition model according to the updated template word stock includes: and counting the numerical distribution of each kind of characteristic data according to the updated template word stock, and manually modifying the weight based on the stable interval and the change rule of each kind of characteristic data. In the invention, the weight of the characteristic parameter is converted from the static parameter in the prior art into the dynamic parameter, so that the characteristic parameter is closer to the actual production condition, and the character segmentation accuracy is improved; and the characteristic data of the character obtained after the dynamic parameter segmentation is adopted can be provided for a certain proportion weight of a subsequent recognition algorithm, so that a more stable recognition rate is achieved.
In the above, the numerical distribution of each feature data is counted according to the updated template word stock, and the weight is manually modified based on the stable interval and the change rule of each feature data, which is specifically as follows:
step a, performing traditional OCR pre-recognition on the same batch of products, if the traditional algorithm cannot be recognized due to segmentation or other reasons, performing correct recognition after forced adjustment by manpower, collecting the feature information of each character of the batch of products, taking five feature information of the gray scale, the size, the aspect ratio, the area and the space of each character as main analysis features, and taking other feature information as assistance to access a local CSV file;
step b, importing the data obtained in the step 1 by adopting excel software, and generating a line graph (shown in figures 2-6) of each character analysis feature in the word stock; the percentage of the characteristic to float from the standard is calculated as follows:
(δ=(
Figure 204983DEST_PATH_IMAGE001
+
Figure 122123DEST_PATH_IMAGE002
) 100%, wherein: (Max value) means maximum value, standard (value) means standard value, Min (value) means minimum value;
step c, acquiring the floating percentage of each feature, wherein the gray level floating percentage is as follows: 4.85%, percent width float: 9.6%, percent height float: 17.11%, percent area float: 12.21%, percent spacing float: 18.25 percent. And after sorting, adjusting the characteristic information with the minimum floating percentage. And (3) defaulting each dimension characteristic to be balanced distribution, manually adjusting the weight of each characteristic dimension according to the current floating percentage data, and increasing the weight when the characteristic with stable floating percentage is subsequently used as segmentation. For example, if 5 features, each default to 20% weight, the gray scale and width can be weighted up to 25% by this data, and the pitch and area can be weighted down to 15% by this data.
Step 3, pre-segmentation: and performing line scanning on the characters to be identified, for example, setting one pixel as a scanning width, and performing scanning of each line of characters. And initially recognizing the character to be recognized based on the updated character recognition model, and when the score of the recognition result of a certain character is smaller than a first threshold value, which indicates that the difference between the character obtained by current segmentation and the characteristic data of the standard character collected in the template word stock is large, if the characteristic data of the character is considered to be abnormal, forcibly segmenting the character. The basis for forcibly dividing the character is as follows: the number of characters per line is made to coincide with the number of standard characters that a line can accommodate. The specific method for forced segmentation may include: and dividing the character with the score of the obtained recognition result of the character being smaller than the first threshold value into two characters, matching the standard characters in the template word stock according to a left-to-right scanning mode, and adjusting the position of the dividing line until the obtained matching degree of the two divided characters and the standard characters is the highest. The invention pre-divides (forcibly divides) the characters according to the number of the characters, and can avoid the condition of character adhesion.
Step 4, normalization treatment: and carrying out normalization processing on the characteristic data of the pre-segmented character and the characteristic data in the template word stock so as to obtain the percentage of the characteristic data of each dimension, and facilitating subsequent calculation.
Step 5, fine adjustment of the segmentation position: matching a certain character with the standard character according to the feature data after normalization processing, calculating to obtain the standard character with the highest score, and determining the segmentation position of the current character based on the standard character with the highest score, for example, automatically adjusting the length and width of the rectangle of the current character selection box. The standard character with the highest score is selected from the standard characters to be used as the basis for segmenting the current character, so that the matching degree of the current character and the standard characters can be ensured, and the segmentation accuracy is further ensured, for example, if the matching degree score of the current character and the standard character 'B' is 90 scores, and the matching degree score of the current character and the standard character '3' is 80 scores, the standard character 'B' with the higher matching degree score is selected to be used as the segmentation standard of the current character.
Preferably, in step 5, before matching a certain character with the standard character according to the normalized feature data, feature points of the character are obtained and filtered. Specifically, the conditions for filtering the feature points of the character include: the size (area) of the feature point is smaller than a second threshold, or the distance between the position of the feature point and other feature points closest to the feature point is larger than a third threshold. Through filtering processing, small interference points or completely unrelated interference points can be directly deleted, interference is reduced, calculation amount is reduced, and accuracy is improved.
The segmentation method in the traditional OCR technology has instability and high error rate, and the adjusted position can well ensure the segmentation of a gap part with low contrast ratio through the optimization of an algorithm, so that the problem of segmenting one character into a plurality of characters is avoided, the condition of character adhesion is reduced, and the small character spacing is obviously improved.
In summary, the OCR-based character segmentation method provided by the present invention includes the following steps: step 1, data collection: acquiring a template word stock based on an OCR technology, wherein the template word stock comprises standard characters and characteristic data of the standard characters, and the characteristic data at least comprises the gray scale, the size, the length-width ratio, the area gravity center, the area and the space of the standard characters; step 2, manual marking: recognizing a part of characters in the same batch with the characters to be recognized by utilizing a character recognition model in an OCR technology to obtain a character segmentation result, manually checking the segmentation result, marking error items in the segmentation result, recording the error items and corresponding feature data into a template word stock, and manually modifying the weight of each feature data in the character recognition model according to the updated template word stock to obtain an updated character recognition model; step 3, pre-segmentation: scanning the character to be recognized, initially recognizing the character to be recognized based on the updated character recognition model, and forcibly segmenting the character when the score of the recognition result of a certain character is smaller than a first threshold value; step 4, normalization treatment: normalizing the characteristic data of the pre-divided characters and the characteristic data in the template word stock; step 5, fine adjustment of the segmentation position: matching a certain character with the standard character according to the characteristic data after normalization processing, calculating to obtain the standard character with the highest score, and determining the segmentation position of the current character based on the standard character with the highest score. The method and the device do not need to collect a large number of samples to train the character recognition model, and can improve the accuracy of character segmentation.
It will be apparent to those skilled in the art that various changes and modifications may be made in the invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (6)

1. An OCR-based character segmentation method is characterized by comprising the following steps:
step 1, data collection: acquiring a template word stock based on an OCR technology, wherein the template word stock comprises standard characters and characteristic data of the standard characters, and the characteristic data at least comprises the gray scale, the size, the length-width ratio, the area gravity center, the area and the space of the standard characters;
step 2, manual marking: recognizing a part of characters in the same batch with the characters to be recognized by utilizing a character recognition model in an OCR technology to obtain a character segmentation result, manually checking the segmentation result, marking error items in the segmentation result, recording the error items and corresponding feature data into a template word stock, and manually modifying the weight of each feature data in the character recognition model according to the updated template word stock to obtain an updated character recognition model;
step 3, pre-segmentation: scanning the character to be recognized, initially recognizing the character to be recognized based on the updated character recognition model, and forcibly segmenting the character; in step 3, the basis for forcibly dividing the character is as follows: the number of characters in each line is consistent with the number of standard characters which can be accommodated in one line;
step 4, normalization treatment: normalizing the characteristic data of the pre-divided characters and the characteristic data in the template word stock;
step 5, fine adjustment of the segmentation position: matching a certain character with the standard character according to the characteristic data after normalization processing, calculating to obtain the standard character with the highest score, and determining the segmentation position of the current character based on the standard character with the highest score;
in step 2, the manually modifying the weight of each feature data in the character recognition model according to the updated template word stock includes: and counting the numerical distribution of each kind of characteristic data according to the updated template word stock, and manually modifying the weight based on the stable interval and the change rule of each kind of characteristic data.
2. An OCR based character segmentation method as claimed in claim 1, wherein in step 1, the OCR based technique template word stock acquisition method includes: and collecting the pictures of the standard characters, and obtaining the template word stock by utilizing the OCR technology for segmentation.
3. An OCR-based character segmentation method as claimed in claim 1, wherein in the step 2, the number of the characters in the same batch with the character to be recognized is recognized by using a character recognition model is 20-1000.
4. An OCR-based character segmentation method as claimed in claim 1, wherein in step 3, scanning the character to be identified comprises: one pixel is set as a scanning width, and the scanning of each line of characters is carried out.
5. An OCR-based character segmentation method as claimed in claim 1, wherein in step 5, before a character is matched with the standard character based on the normalized feature data, feature points of the character are obtained and filtered.
6. An OCR-based character segmentation method as claimed in claim 5, wherein the condition for filtering the feature points of the character includes: the size of the feature point is less than a second threshold.
CN202110869780.3A 2021-07-30 2021-07-30 Character segmentation method based on OCR Active CN113673511B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110869780.3A CN113673511B (en) 2021-07-30 2021-07-30 Character segmentation method based on OCR

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110869780.3A CN113673511B (en) 2021-07-30 2021-07-30 Character segmentation method based on OCR

Publications (2)

Publication Number Publication Date
CN113673511A CN113673511A (en) 2021-11-19
CN113673511B true CN113673511B (en) 2022-03-18

Family

ID=78540858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110869780.3A Active CN113673511B (en) 2021-07-30 2021-07-30 Character segmentation method based on OCR

Country Status (1)

Country Link
CN (1) CN113673511B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117710987A (en) * 2024-02-06 2024-03-15 武汉卓目科技有限公司 Crown word size segmentation method, device, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8934676B2 (en) * 2012-05-04 2015-01-13 Xerox Corporation Robust character segmentation for license plate images
CN104992152A (en) * 2015-06-30 2015-10-21 深圳訾岽科技有限公司 Character recognition method and system based on template character library
CN106295646B (en) * 2016-08-10 2019-08-23 东方网力科技股份有限公司 A kind of registration number character dividing method and device based on deep learning
CN110275874B (en) * 2019-02-25 2022-04-05 广州金越软件技术有限公司 Intelligent resource cataloguing method for big data resource management
CN111401031A (en) * 2020-03-05 2020-07-10 支付宝(杭州)信息技术有限公司 Target text determination method, device and equipment
CN112926563B (en) * 2021-02-23 2024-01-02 辽宁科技大学 Fault diagnosis system for steel coil spray printing mark

Also Published As

Publication number Publication date
CN113673511A (en) 2021-11-19

Similar Documents

Publication Publication Date Title
JP6831480B2 (en) Text detection analysis methods, equipment and devices
CN115018828B (en) Defect detection method for electronic component
US7336827B2 (en) System, process and software arrangement for recognizing handwritten characters
CN113158808B (en) Method, medium and equipment for Chinese ancient book character recognition, paragraph grouping and layout reconstruction
CN114723705B (en) Cloth flaw detection method based on image processing
JP6897749B2 (en) Learning methods, learning systems, and learning programs
CN110598566A (en) Image processing method, device, terminal and computer readable storage medium
US7680357B2 (en) Method and apparatus for detecting positions of center points of circular patterns
CN113673511B (en) Character segmentation method based on OCR
CN111639629A (en) Pig weight measuring method and device based on image processing and storage medium
CN109389050B (en) Method for identifying connection relation of flow chart
CN111488920A (en) Bag opening position detection method based on deep learning target detection and recognition
CN116309577B (en) Intelligent detection method and system for high-strength conveyor belt materials
CN113901933A (en) Electronic invoice information extraction method, device and equipment based on artificial intelligence
CN111652117B (en) Method and medium for segmenting multiple document images
CN110874835B (en) Crop leaf disease resistance identification method and system, electronic equipment and storage medium
CN111507414A (en) Deep learning skin disease picture comparison and classification method, storage medium and robot
CN111340032A (en) Character recognition method based on application scene in financial field
CN111914706B (en) Method and device for detecting and controlling quality of text detection output result
CN111199240A (en) Training method of bank card identification model, and bank card identification method and device
CN116206208B (en) Forestry plant diseases and insect pests rapid analysis system based on artificial intelligence
CN115393861B (en) Method for accurately segmenting handwritten text
CN111814801A (en) Method for extracting labeled strings in mechanical diagram
CN110135425B (en) Sample labeling method and computer storage medium
CN114463767A (en) Credit card identification method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant