CN113673511B

CN113673511B - Character segmentation method based on OCR

Info

Publication number: CN113673511B
Application number: CN202110869780.3A
Authority: CN
Inventors: 秦应化; 李安; 吴昆�
Original assignee: Suzhou Dinnar Automation Technology Co Ltd
Current assignee: Suzhou Dinnar Automation Technology Co Ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2022-03-18
Anticipated expiration: 2041-07-30
Also published as: CN113673511A

Abstract

The invention relates to a character segmentation method based on OCR, which comprises the following steps: step 1: acquiring a template word stock based on an OCR technology, wherein the template word stock comprises standard characters and characteristic data of the standard characters; step 2: recognizing a part of characters in the same batch with the characters to be recognized by using a character recognition model in an OCR technology to obtain a character segmentation result, manually marking error items in the segmentation result, and updating the character recognition model; and step 3: scanning the character to be recognized, initially recognizing the character to be recognized based on the updated character recognition model, and forcibly segmenting the character when the score of the recognition result of a certain character is smaller than a first threshold value; and 4, step 4: normalization processing; and 5: and matching a certain character with the standard character according to the characteristic data after the normalization processing, calculating to obtain the standard character with the highest score, and determining the segmentation position of the current character based on the standard character with the highest score. The invention can improve the accuracy of character segmentation.

Description

Character segmentation method based on OCR

Technical Field

The invention relates to the field of optical character recognition, in particular to a character segmentation method based on OCR.

Background

In fields related to Optical Character Recognition such as printed characters and laser marking, OCR (Optical Character Recognition) plays an important role. At present, almost every product is provided with production batch number and other similar information, and an OCR technology is generally used for ensuring the traceability of the product. However, when text information is actually printed, due to different printing environments (inconsistency in printing by different devices, motion printing and printing), some situations such as text deformation, space change, size change and the like may occur, and after a model is trained according to standard characters in a conventional OCR character library, the model is likely to have a situation in which two characters are combined into one or one character is cut into two due to the situations, which may lead to a reduction in recognition rate.

Therefore, it is an urgent technical problem to be solved by those skilled in the art to provide an OCR-based character segmentation method which is simple in operation and can improve the recognition rate of subsequent characters.

Disclosure of Invention

The invention provides a character segmentation method based on OCR, which aims to solve the technical problem.

In order to solve the above technical problem, the present invention provides an OCR-based character segmentation method, including the steps of:

step 1, data collection: acquiring a template word stock based on an OCR technology, wherein the template word stock comprises standard characters and characteristic data of the standard characters, and the characteristic data at least comprises the gray scale, the size, the length-width ratio, the area gravity center, the area and the space of the standard characters;

step 2, manual marking: recognizing a part of characters in the same batch with the characters to be recognized by utilizing a character recognition model in an OCR technology to obtain a character segmentation result, manually checking the segmentation result, marking error items in the segmentation result, recording the error items and corresponding feature data into a template word stock, and manually modifying the weight of each feature data in the character recognition model according to the updated template word stock to obtain an updated character recognition model;

step 3, pre-segmentation: scanning the character to be recognized, initially recognizing the character to be recognized based on the updated character recognition model, and forcibly segmenting the character when the score of the recognition result of a certain character is smaller than a first threshold value;

step 4, normalization treatment: normalizing the characteristic data of the pre-divided characters and the characteristic data in the template word stock;

step 5, fine adjustment of the segmentation position: matching a certain character with the standard character according to the characteristic data after normalization processing, calculating to obtain the standard character with the highest score, and determining the segmentation position of the current character based on the standard character with the highest score.

Preferably, in step 1, the method for obtaining the template word stock based on the OCR technology includes: and collecting the pictures of the standard characters, and obtaining the template word stock by utilizing the OCR technology for segmentation.

Preferably, in step 2, the number of the characters in the same batch as the character to be recognized is recognized by using the character recognition model is 20-1000.

Preferably, in step 2, the manually modifying the weight of each feature data in the character recognition model according to the updated template word stock includes: and counting the numerical distribution of each kind of characteristic data according to the updated template word stock, and manually modifying the weight based on the stable interval and the change rule of each kind of characteristic data.

Preferably, in step 3, the step of line-scanning the character to be identified includes: one pixel is set as a scanning width, and the scanning of each line of characters is carried out.

Preferably, in step 3, the basis for forcibly dividing the character is as follows: the number of characters per line is made to coincide with the number of standard characters that a line can accommodate.

Preferably, in step 5, before matching a certain character with the standard character according to the normalized feature data, feature points of the character are obtained and filtered.

Preferably, the condition for filtering the feature points of the character includes: the size of the feature point is less than a second threshold.

Compared with the prior art, the character segmentation method based on OCR provided by the invention has the following advantages:

1. according to the invention, a large number of samples can be obtained in the conventional OCR technology to form a template word stock and a trained character recognition model, a collection process of a large number of samples and a model training process are not needed, and the algorithm flow is greatly simplified;

2. the method only needs to take 'a part of characters in the same batch with the characters to be recognized' as samples to test the small batch of data of the character recognition model, manually distributes the weight of the characteristic parameters of the character recognition model according to the test result, and can greatly improve the accuracy of character segmentation on the premise of not increasing a large number of samples;

3. in the invention, the weight of the characteristic parameter is converted from the static parameter in the prior art into the dynamic parameter, so that the characteristic parameter is closer to the actual production condition, and the character segmentation accuracy is improved;

4. the characteristic data of the character obtained after the dynamic parameter segmentation can be provided for a certain proportion weight of a subsequent recognition algorithm to achieve more stable recognition rate;

5. the characters are pre-divided (forcibly divided) according to the length of the characters, the condition of character adhesion is avoided, standard characters with the highest division are selected from the standard characters to serve as the basis for dividing the current characters, the matching degree of the current characters and the standard characters is ensured, and accurate division is further ensured.

Drawings

FIG. 1 is a flowchart of an OCR-based character segmentation method according to an embodiment of the present invention.

FIGS. 2-6 are line graphs of character analysis features in accordance with one embodiment of the present invention.

Detailed Description

In order to more thoroughly express the technical scheme of the invention, the following specific examples are listed to demonstrate the technical effect; it is emphasized that these examples are intended to illustrate the invention and are not to be construed as limiting the scope of the invention.

The character segmentation method based on OCR provided by the invention, as shown in figure 1, comprises the following steps:

step 1, data collection: the method for acquiring the template word stock based on the OCR technology comprises the following steps of acquiring a template word stock based on the OCR technology, wherein the template word stock comprises standard characters and characteristic data of the standard characters, and the characteristic data at least comprises gray scale, size, aspect ratio, area gravity center, area and space of the standard characters, and particularly, the method for acquiring the template word stock based on the OCR technology can comprise the following steps: collecting the image of the standard character, obtaining the template word stock by utilizing the OCR technology for segmentation, for example, scanning the image with the standard character by utilizing a character recognition model in the OCR technology so as to collect the standard character, and then obtaining the characteristic data of the standard character.

Step 2, manual marking: and recognizing a part of characters in the same batch with the characters to be recognized by using a character recognition model in an OCR technology to obtain a character segmentation result. Preferably, the same batch of characters may include features of uniform style, font size, deformation amount, and the like, and the same batch may specifically refer to the same batch of products, the same kind of products, a specification, and the like. In addition, the number of the characters in the same batch with the characters to be recognized is recognized by using the character recognition model is 20-1000, the number is far smaller than the number of training samples required in the traditional OCR technology, but the characters in the same batch have the characteristics of uniform style and the like, and the accuracy rate of character segmentation in the batch can be greatly improved. And manually checking the segmentation result, marking error items in the segmentation result, recording the error items and the corresponding feature data thereof into the template word stock, and manually modifying the weight of each feature data in the character recognition model according to the updated template word stock to obtain the updated character recognition model. Specifically, the manually modifying the weight of each feature data in the character recognition model according to the updated template word stock includes: and counting the numerical distribution of each kind of characteristic data according to the updated template word stock, and manually modifying the weight based on the stable interval and the change rule of each kind of characteristic data. In the invention, the weight of the characteristic parameter is converted from the static parameter in the prior art into the dynamic parameter, so that the characteristic parameter is closer to the actual production condition, and the character segmentation accuracy is improved; and the characteristic data of the character obtained after the dynamic parameter segmentation is adopted can be provided for a certain proportion weight of a subsequent recognition algorithm, so that a more stable recognition rate is achieved.

In the above, the numerical distribution of each feature data is counted according to the updated template word stock, and the weight is manually modified based on the stable interval and the change rule of each feature data, which is specifically as follows:

step a, performing traditional OCR pre-recognition on the same batch of products, if the traditional algorithm cannot be recognized due to segmentation or other reasons, performing correct recognition after forced adjustment by manpower, collecting the feature information of each character of the batch of products, taking five feature information of the gray scale, the size, the aspect ratio, the area and the space of each character as main analysis features, and taking other feature information as assistance to access a local CSV file;

step b, importing the data obtained in the step 1 by adopting excel software, and generating a line graph (shown in figures 2-6) of each character analysis feature in the word stock; the percentage of the characteristic to float from the standard is calculated as follows:

（δ=（

+

) 100%, wherein: (Max value) means maximum value, standard (value) means standard value, Min (value) means minimum value;

step c, acquiring the floating percentage of each feature, wherein the gray level floating percentage is as follows: 4.85%, percent width float: 9.6%, percent height float: 17.11%, percent area float: 12.21%, percent spacing float: 18.25 percent. And after sorting, adjusting the characteristic information with the minimum floating percentage. And (3) defaulting each dimension characteristic to be balanced distribution, manually adjusting the weight of each characteristic dimension according to the current floating percentage data, and increasing the weight when the characteristic with stable floating percentage is subsequently used as segmentation. For example, if 5 features, each default to 20% weight, the gray scale and width can be weighted up to 25% by this data, and the pitch and area can be weighted down to 15% by this data.

Step 3, pre-segmentation: and performing line scanning on the characters to be identified, for example, setting one pixel as a scanning width, and performing scanning of each line of characters. And initially recognizing the character to be recognized based on the updated character recognition model, and when the score of the recognition result of a certain character is smaller than a first threshold value, which indicates that the difference between the character obtained by current segmentation and the characteristic data of the standard character collected in the template word stock is large, if the characteristic data of the character is considered to be abnormal, forcibly segmenting the character. The basis for forcibly dividing the character is as follows: the number of characters per line is made to coincide with the number of standard characters that a line can accommodate. The specific method for forced segmentation may include: and dividing the character with the score of the obtained recognition result of the character being smaller than the first threshold value into two characters, matching the standard characters in the template word stock according to a left-to-right scanning mode, and adjusting the position of the dividing line until the obtained matching degree of the two divided characters and the standard characters is the highest. The invention pre-divides (forcibly divides) the characters according to the number of the characters, and can avoid the condition of character adhesion.

Step 4, normalization treatment: and carrying out normalization processing on the characteristic data of the pre-segmented character and the characteristic data in the template word stock so as to obtain the percentage of the characteristic data of each dimension, and facilitating subsequent calculation.

Step 5, fine adjustment of the segmentation position: matching a certain character with the standard character according to the feature data after normalization processing, calculating to obtain the standard character with the highest score, and determining the segmentation position of the current character based on the standard character with the highest score, for example, automatically adjusting the length and width of the rectangle of the current character selection box. The standard character with the highest score is selected from the standard characters to be used as the basis for segmenting the current character, so that the matching degree of the current character and the standard characters can be ensured, and the segmentation accuracy is further ensured, for example, if the matching degree score of the current character and the standard character 'B' is 90 scores, and the matching degree score of the current character and the standard character '3' is 80 scores, the standard character 'B' with the higher matching degree score is selected to be used as the segmentation standard of the current character.

Preferably, in step 5, before matching a certain character with the standard character according to the normalized feature data, feature points of the character are obtained and filtered. Specifically, the conditions for filtering the feature points of the character include: the size (area) of the feature point is smaller than a second threshold, or the distance between the position of the feature point and other feature points closest to the feature point is larger than a third threshold. Through filtering processing, small interference points or completely unrelated interference points can be directly deleted, interference is reduced, calculation amount is reduced, and accuracy is improved.

The segmentation method in the traditional OCR technology has instability and high error rate, and the adjusted position can well ensure the segmentation of a gap part with low contrast ratio through the optimization of an algorithm, so that the problem of segmenting one character into a plurality of characters is avoided, the condition of character adhesion is reduced, and the small character spacing is obviously improved.

In summary, the OCR-based character segmentation method provided by the present invention includes the following steps: step 1, data collection: acquiring a template word stock based on an OCR technology, wherein the template word stock comprises standard characters and characteristic data of the standard characters, and the characteristic data at least comprises the gray scale, the size, the length-width ratio, the area gravity center, the area and the space of the standard characters; step 2, manual marking: recognizing a part of characters in the same batch with the characters to be recognized by utilizing a character recognition model in an OCR technology to obtain a character segmentation result, manually checking the segmentation result, marking error items in the segmentation result, recording the error items and corresponding feature data into a template word stock, and manually modifying the weight of each feature data in the character recognition model according to the updated template word stock to obtain an updated character recognition model; step 3, pre-segmentation: scanning the character to be recognized, initially recognizing the character to be recognized based on the updated character recognition model, and forcibly segmenting the character when the score of the recognition result of a certain character is smaller than a first threshold value; step 4, normalization treatment: normalizing the characteristic data of the pre-divided characters and the characteristic data in the template word stock; step 5, fine adjustment of the segmentation position: matching a certain character with the standard character according to the characteristic data after normalization processing, calculating to obtain the standard character with the highest score, and determining the segmentation position of the current character based on the standard character with the highest score. The method and the device do not need to collect a large number of samples to train the character recognition model, and can improve the accuracy of character segmentation.

It will be apparent to those skilled in the art that various changes and modifications may be made in the invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An OCR-based character segmentation method is characterized by comprising the following steps:

step 3, pre-segmentation: scanning the character to be recognized, initially recognizing the character to be recognized based on the updated character recognition model, and forcibly segmenting the character; in step 3, the basis for forcibly dividing the character is as follows: the number of characters in each line is consistent with the number of standard characters which can be accommodated in one line;

step 5, fine adjustment of the segmentation position: matching a certain character with the standard character according to the characteristic data after normalization processing, calculating to obtain the standard character with the highest score, and determining the segmentation position of the current character based on the standard character with the highest score;

in step 2, the manually modifying the weight of each feature data in the character recognition model according to the updated template word stock includes: and counting the numerical distribution of each kind of characteristic data according to the updated template word stock, and manually modifying the weight based on the stable interval and the change rule of each kind of characteristic data.

2. An OCR based character segmentation method as claimed in claim 1, wherein in step 1, the OCR based technique template word stock acquisition method includes: and collecting the pictures of the standard characters, and obtaining the template word stock by utilizing the OCR technology for segmentation.

3. An OCR-based character segmentation method as claimed in claim 1, wherein in the step 2, the number of the characters in the same batch with the character to be recognized is recognized by using a character recognition model is 20-1000.

4. An OCR-based character segmentation method as claimed in claim 1, wherein in step 3, scanning the character to be identified comprises: one pixel is set as a scanning width, and the scanning of each line of characters is carried out.

5. An OCR-based character segmentation method as claimed in claim 1, wherein in step 5, before a character is matched with the standard character based on the normalized feature data, feature points of the character are obtained and filtered.

6. An OCR-based character segmentation method as claimed in claim 5, wherein the condition for filtering the feature points of the character includes: the size of the feature point is less than a second threshold.