CN118366175B

CN118366175B - Document image classification method based on word frequency

Info

Publication number: CN118366175B
Application number: CN202410794321.7A
Authority: CN
Inventors: 张志坚; 陈友斌; 申意萍; 徐一波
Original assignee: Hubei Micropattern Technology Development Co ltd
Current assignee: Hubei Micropattern Technology Development Co ltd
Priority date: 2024-06-19
Filing date: 2024-06-19
Publication date: 2024-09-24
Anticipated expiration: 2044-06-19
Also published as: CN118366175A

Abstract

The invention discloses a word frequency-based document image classification method, which relates to the technical field of document image classification and comprises two steps of document image registration and unknown type document image classification to be classified. According to the invention, each document image can complete document image registration by only one sample (namely an example document image), and particularly, for a document image of a form type, only one blank form is needed; the robustness of errors (false alarms and missing alarms) of a text line detection algorithm caused by various reasons is improved; the robustness of errors (false recognition and missing recognition) of a text line recognition algorithm caused by various reasons is improved; the addition of new document image types can be accommodated.

Description

Document image classification method based on word frequency

Technical Field

The invention relates to the technical field of document image classification, in particular to a document image classification method based on word frequency.

Background

As banks, securities, insurance financial institutions move to the digital age, the number of electronic document images is in an ever-increasing situation. The bank account system can generate various document images in the running process, and the images cover various aspects of account opening, transaction certificates, business files and the like. These document images are not only important records of banking process, but also important basis for information management and risk control by banks. Therefore, the document images are efficiently and accurately managed and processed, and the method has important significance for guaranteeing normal operation and risk prevention and control of banking business. For another example, the insurance industry may generate multiple types of document images in its business processes: in the sales link of insurance products, a large number of document images such as insurance application forms, insurance application forms and the like are generated, and the document images record the contents such as personal information, insurance requirements, insurance clauses and the like of clients in detail, so that the document images are important bases for risk assessment and insurance verification of insurance companies; in the process of insurance claim settlement, various document images such as claim settlement application, medical evidence, accident identification and the like are generated, and the images are key basis for the insurance company to perform claim settlement audit and claim payment processing. In view of the great number of electronic materials, the electronic document images are rich in rich industry-related images and text information, which are required to be ordered and classified and subjected to professional recognition processing. If the processing is performed manually, the processing is time-consuming and labor-consuming, and the cost is also increased sharply. Therefore, there is a need to develop an automatic classification method for electronic document images to meet the increasing processing demands.

The existing automatic classification method of document images mainly comprises the following steps:

1) Rule-based classification method: the method judges different layout modules in the page through predefined rules and templates. For example, rules may be used to determine whether a larger font text at the top of a page is a title or its type based on the location, font size, format, etc. characteristics of the text.

2) Machine learning-based classification method: this approach utilizes a large amount of training data to learn a model of document image classification. The automatic classification of the new document image is realized by extracting the characteristics (such as the positions, the sizes, the shapes and the like of the characters and the images) in the document image and training by using a machine learning algorithm (such as a support vector machine, random forest, deep learning and the like).

3) Classification method based on image processing: the document image is preprocessed and feature extracted by image processing techniques such as edge detection, connected domain analysis, projection, etc., and then classified based on these features.

4) Document image segmentation based on hierarchical or non-hierarchical methods: hierarchical methods, such as top-down or bottom-up, identify layout structures by progressively segmenting document images. The non-hierarchical method is more focused on overall analysis and processing, and is suitable for complex document images.

5) The mixing method comprises the following steps: this method combines the above-mentioned several methods, and improves the accuracy and efficiency of document image classification by complementation.

In practical production practice, the classification algorithm of the 5-class document images faces various challenges, and the robustness of the algorithm is insufficient and cannot cope with the diversity and complexity of the document images. Specifically, the challenges include the following 5 aspects:

(1) Differences in shooting environment and equipment: the differences of illumination, angles and backgrounds can cause the problems of uneven brightness, overweight shadow or blurred words of the images, and the problems can influence the quality of the images, so that the classification algorithm is difficult to accurately identify. The different resolution, exposure time and distortion of the device can lead to differences in image sharpness, color rendition and geometry, increasing the difficulty of classification.

(2) Deformation problem of paper material: during scanning or shooting, the paper document may cause distortion of the text due to bending, wrinkling or folding of the paper, which may cause trouble in recognition and analysis of the text.

(3) Character shielding problem: text in a document image may be obscured by seals, stickers, or other items, which may result in loss of some information or difficulty in recognition, which can be a challenge for classification algorithms.

(4) Complexity and diversity of document image content: the content of the document image tends to be complex and variable, and even the same type of document image, the layout, text layout and image layout may be greatly different.

(5) The increasing number of new document image types: new document image types may appear over time. For pre-trained classifiers, this means that the data needs to be re-collected, training the model, to accommodate the new classification requirements. This not only increases the effort and cost, but may also affect the performance and stability of the classifier.

Disclosure of Invention

In order to solve the technical problem of document image classification, the invention provides a document image classification method based on word frequency. The following technical scheme is adopted:

a method for classifying document images based on word frequency comprises two steps of document image registration and unknown type document image classification to be classified;

the document image registration includes the steps of:

step 10, obtaining an example document image for each type of document image to be classified, and forming an example document image set ，For the number of types of document images to be classified，Is the firstAn example document image of the class document image;

Step 11, counting the title character sets of all the document images by using the example document images;

step 12, counting the registration header word frequency of each type of document image;

step 13, when the newly added document image type exists, updating the title character set of all the document images and the registration title word frequency of each type of document images;

step 14, counting key character sets of all document images by using the example document images;

Step 15, counting the registration key word frequency of each type of document image;

Step 16, when the newly added document image type exists, updating the key character set of all the document images and the registration key word frequency of each type of document images;

The classification of the unknown type of document image to be classified comprises the following steps:

Step 20, performing text line detection and text line recognition on the document image to be classified;

Step 21, dividing the text line detection result into two types of title and text;

Step 22, obtaining the title word frequency of the document image to be classified;

Step 23, obtaining the key word frequency of the document image to be classified;

and step 24, calculating the score based on the cosine distance between the registered keyword frequency and the keyword frequency of the document image to be classified and the cosine distance between the registered keyword frequency and the keyword frequency of the document image to be classified, and obtaining the type label result of the document image classification based on the score analysis.

By adopting the technical scheme, each document image can finish document image registration by only one sample (namely an example document image), and particularly, for a document image of a form type, only one blank form is needed.

The error (false alarm and missing alarm) of the text line detection algorithm caused by various reasons is robust;

the method is robust to errors (false recognition and missing recognition) of text line recognition algorithms caused by various reasons;

The addition of new document image types can be accommodated.

Optionally, step 11 comprises the sub-steps of:

Step 111: obtaining the title of each example document image, composing the title set T, ，For the number of types of document images to be classified，Is thatIs a title of (2);

Step 112: forming a title character set from characters in each title in the title set T P is the number of characters in the title character set.

Each example document image is manually reviewed and a title of each example document image is obtained in step 111.

It should be noted that, step 111 must be performed manually, and cannot be obtained by using OCR technology, so that the correctness of the title character set a can be ensured.

Optionally, in step 13, when there is a new added document image type, an example document image of the new added document image type is obtained, an example document image set is added, steps 11 and 12 are repeated, and the title character set and the registration title word frequency are updated.

Optionally, step 14 comprises the sub-steps of:

Step 141: manually checking each example document image and obtaining keys to form a key set K;

step 142: forming a key character set from the characters in each key in the key set K Q is the number of characters in the key character set.

It should be noted that, step 141 must be performed manually, and cannot be obtained by using OCR technology, so that the correctness of the key character set B can be ensured.

Optionally, in step 16, when there is a new added document image type, an example document image of the new added document image type is obtained, an example document image set is added, steps 14 and 15 are repeated, and the key character set and the registration key word frequency are updated.

Optionally, the classification is performed in step 21 based on three features of the title, where the three features of the title are: the character size is larger than or equal to the character size of the text and is positioned in the first 1-3 rows of the document image and is centered.

A computer-readable storage medium storing a document image classification program designed using a document image classification method based on word frequency.

A document image classification device based on word frequency comprises a memory and a processor, wherein the memory stores a document image classification program designed by a document image classification method based on word frequency, and the processor is in communication connection with the memory, runs the document image classification program and outputs a document image classification result.

Optionally, the system further comprises a display, wherein the display is in communication connection with the processor, and the processor controls the display to display the document image classification result.

In summary, the invention has at least the following beneficial technical effects:

The invention can provide a document image classification method based on word frequency, each document image can finish document image registration by only one sample (namely an example document image), and particularly, for document images of form type, only one blank form is needed;

The addition of new document image types can be accommodated.

Drawings

FIG. 1 is a flow chart of a document image classification method based on word frequency according to the present invention;

FIG. 2 is an example document image of an identity card of a particular embodiment;

FIG. 3 is an example document image of a post qualification of a particular embodiment;

FIG. 4 is an example document image of revenue certification for a particular embodiment;

FIG. 5 is a diagram of a post qualification to be classified in accordance with an embodiment;

FIG. 6 is a text line detection of a post qualifier to be classified in accordance with an embodiment.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The embodiment of the invention discloses a document image classification method based on word frequency.

Referring to fig. 1 to 6, a document image classification method based on word frequency includes two steps of document image registration and unknown type document image classification to be classified;

the document image registration includes the steps of:

Step 20: and detecting text lines of the unknown type document image image_test to be classified, and identifying each obtained text line. There is no limitation on what text line detection algorithm is used, nor on what text line recognition algorithm is used.

Step 21: the text line detection result is divided into two types of title and body. The basis for classification is three features of the title: the character size of the title of the document image is larger than or equal to that of the text and is positioned in the first 1-3 rows of the document image.

Step 22: the title word frequency u of image_test is obtained.

U=p dimension all 0 vectors

If image_test has a title

Step 20 and step 21 have obtained the title content text

For c in text# traverse each character c of text

for p in 0, 1, 2, …, P-1

break

Step 23: the key frequency v of image_test is obtained.

V=q dimension all 0 vectors

If image_test has a key

；

Combining the text line identification results of the text lines obtained in the step 20 and the step 21, wherein the combined result is text;

for c in text# traverse each character c of text

for q in 0, 1, 2, …, Q-1

Break；

Specifically, the following logic is adopted for classification: setting max_score=0, max_id= -1; traversing N types of document images one by one; for the i-th type document image, if both the document image to be classified and the i-th type document image have titles, the title similarity score_t is assigned as the title frequency u of the document image to be classified and the registered title frequency of the i-th typeIf not, score_t is assigned 0; if both the document image to be classified and the document image of the i-th type have keys, assigning the key similarity score_k as the key frequency v of the document image to be classified and the registered key frequency of the i-th typeCosine distance of (2); calculating a similarity score of the document image to be classified and the i-th type document image in four cases, score= (score_t+score_k)/2 if score_t >0 and score_k >0, score=score_t if score_t >0 and score_k= 0, score=score_k if score_t= 0 and score_k >0, score=0 if score_t= 0 and score_k= 0; if the similarity score of the document image to be classified and the ith document image is larger than the max_score, updating the max_score to be score, and recording the current category sequence max_id=i; and iteratively processing each type of document image until the type of the document image to be classified is obtained.

max_score=0

max_id = -1

for i in 0, 1, 2, 3, …, N-1

score_t = 0

The# image_test has a title, and the i-th document image has a title

Score_t=u andCosine distance of (2)

score_k= 0

# Image_test has a key, and the type i document image has a key

Score_k=v andCosine distance of (2)

if score_t>0 and score_k>0

score = (score_t + score_k)/2

if score_t>0 and score_k==0

score = score_t

if score_t==0 and score_k>0

score = score_k

if score_t==0 and score_k ==0

score = 0

if score>max_score

max_score = score

max_id = i

Image_test is a max_id-like document image.

Only one sample (i.e., an example document image) is needed for each document image to complete document image registration, and particularly, for a document image of a form type, only one blank form is needed.

The addition of new document image types can be accommodated.

Step 11 comprises the sub-steps of:

for i in 0, 1, 2, …, N-1

for c in # Traversal No.Title of class document imageIs defined as each character c of

If c does not belong to the title character set A

Adding c to the title character set A

Step 12: and counting the registration header word frequency of each type of document image.

for i in 0, 1, 2, …, N-1

if Having a title

for c in # TraversalIs defined as each character c of

for p in 0,1,2,...,P-1

break；

In step 13, when there is a newly added document image type, an example document image of the newly added document image type is acquired, an example document image set is added, and steps 11 and 12 are repeated.

Step 14 comprises the sub-steps of:

step 141: each example document image is manually reviewed and keys are obtained, constituting a set of keys K.

，For the number of types of document images to be classified，Is thatOf (3), wherein，Is thatNumber of keys of (a)。

Step 142: adding characters in keys of each type of document image to be classified into a key character setQ is the number of characters in the key character set.

for i in 0, 1, 2, …, N-1

for j in 0, 1, 2, …,

for c in Traversing the j-th key of the i-th classIs defined as each character c of

If c does not belong to key character set B

Adding c to the key character set B.

Step 15, counting the registration key word frequency of each type of document image.

for i in 0, 1, 2, …, N-1

Vector of dimension 0

if Having keys

for j in 0, 1, 2, …,

for c in Traversing the j-th key of the i-th classIs a character of each of the plurality of characters

for q in 0,1,2,…,Q-1

Break；

The classification is based on three features of the title in step 21. The title has three features: the character size is larger than or equal to the character size of the text and is positioned in the first 1-3 rows of the document image and is centered.

The system also comprises a display, wherein the display is in communication connection with the processor, and the processor controls the display to display the document image classification result.

The following description adopts specific embodiments to explain the implementation principle of a document image classification method based on word frequency:

the identity card shown in fig. 2, the post qualification shown in fig. 3, and the revenue certification shown in fig. 4 are classified.

Step 10: obtaining an example document image for each type of document image to be classified to form an example document image set，。In the case of the identification card shown in figure 2,For the post qualification shown in figure 3,For revenue demonstration as shown in fig. 4.

Step 11: and counting the title character sets of all the document images.

Step 111: manually viewing each example document image and obtaining a title of each example document image, composing a title set T,，For the number of types of document images to be classified，Is thatIs a title of (c). For the position qualification, a "position qualification" may be added to the header set T, and for the income proof, a "income proof" may be added to the header set T, whereby the header set t= { position qualification, income proof }.

The title character set a= { 'bit', 'enter', 'post', 'receive', 'bright', 'grid', 'syndrome', 'resource', p=8 obtained in step 112.

Step 12: the result of counting the registration header word frequency of each type of document image is

Step 14: and counting key character sets of all document images.

Step 141: the example document images shown in fig. 2, 3, and 4 are manually reviewed and keys are obtained, constituting a key set K. For the second generation card, the keys of the second generation card such as name, gender, ethnicity, birthday, address, citizen identity number are added to the key set K, and for the post qualification card such as name, gender, department, position and business are added to the key set K, thereby obtaining the key set K= { name, gender, ethnicity, birthday, address, citizen identity number, name, gender, department, position and business }

Step 142: the obtained key character set b= { ' industry ', ' part ', ' position ', ' living ', ' public ', ' category ', ' business ', ' number ', ' first name ', ' address ', ' last name ', ' sex ', ' family ', ' day ', ' people ', ' birth ', ' code ', ' job ', ' body ', ' part ', ' door ', ' q=21

Step 15: counting the registration key word frequency of each type of document image, wherein the result is as follows:

the classification of document images of unknown type comprises five steps: text line detection, text line distinguishing title and text, obtaining a title word frequency, obtaining a key word frequency, and obtaining a type tag.

Step 20: referring to fig. 5, text line detection is performed on the image_test of fig. 5, and text line recognition is performed, and the result of the text line detection is shown in fig. 6;

The recognition result of each text line is: post card, name Zhang Taili, sex, female, department research, advanced software engineer, business security.

It can be seen that the text line detection result and the text line recognition result have different errors.

Step 21: the text line detection result is divided into two types of title and body. The results were: the post evidence is the title and the rest is the text.

Step 22: the header word frequency u= [1, 0, 1, 0,0, 1, 0] of the image_test is obtained.

Step 23: the key frequency v= [1, 0, 0, 0, 0,1, 1, 0,1, 0, 0,1, 1, 1, 0, 0, 0, 0, 0,1, 1] of image_test is obtained.

Step 24: determining the type of image_test, and the result of program execution is

i= 0 score_t= 0.000 score_k= 0.333 score= 0.333 max_score= 0.333 max_id= 0；

i= 1 score_t= 0.775 score_k= 0.894 score= 0.835 max_score= 0.835 max_id= 1；

i= 2 score_t= 0.289 score_k= 0.000 score= 0.289 max_score= 0.835 max_id= 1；

Judging the image_test as a type max_id=1 document image, and judging that the conclusion is correct.

The above embodiments are not intended to limit the scope of the present invention, and therefore: all equivalent changes in structure, shape and principle of the invention should be covered in the scope of protection of the invention.

Claims

1. A method for classifying document images based on word frequency is characterized by comprising two steps of document image registration and unknown type document image classification to be classified;

the document image registration includes the steps of:

Step 10, obtaining an example document image for each type of document image to form an example document image set ，Is the firstAn example document image of the class document image;

Step 24, calculating a score based on the cosine distance between the registered keyword frequency and the keyword frequency of the document image to be classified and the cosine distance between the registered keyword frequency and the keyword frequency of the document image to be classified, and obtaining a type label result of the document image classification based on score analysis;

step 24 uses the following logic for classification: setting max_score=0, max_id= -1; traversing N types of document images one by one; for the i-th type document image, if both the document image to be classified and the i-th type document image have titles, the title similarity score_t is assigned as the title frequency u of the document image to be classified and the registered title frequency of the i-th type If not, score_t is assigned 0; if both the document image to be classified and the document image of the i-th type have keys, assigning the key similarity score_k as the key frequency v of the document image to be classified and the registered key frequency of the i-th typeCosine distance of (2); calculating a similarity score of the document image to be classified and the i-th type document image in four cases, score= (score_t+score_k)/2 if score_t >0 and score_k >0, score=score_t if score_t >0 and score_k= 0, score=score_k if score_t= 0 and score_k >0, score=0 if score_t= 0 and score_k= 0; if the similarity score of the document image to be classified and the ith document image is larger than the max_score, updating the max_score to be score, and recording the current category sequence max_id=i; and iteratively processing each type of document image until the type of the document image to be classified is obtained.

2. The word frequency based document image classification method according to claim 1, wherein: step 11 comprises the sub-steps of:

Step 111: obtaining the title of each example document image, composing the title set T, ，Is thatIs a title of (2);

3. The word frequency based document image classification method according to claim 2, wherein: each example document image is manually reviewed and a title of each example document image is obtained.

4. The word frequency based document image classification method according to claim 1, wherein: in step 13, when there is a newly added document image type, an example document image of the newly added document image type is acquired, an example document image set is added, and steps 11 and 12 are repeated.

5. The word frequency based document image classification method according to claim 1, wherein: step 14 comprises the sub-steps of:

Step 141: obtaining keys of each example document image to form a key set K;

6. The word frequency based document image classification method according to claim 5, wherein: a key is employed to manually view each example document image and obtain each example document image.

7. The word frequency based document image classification method according to claim 1, wherein: in step 21, classification is performed based on three features of the title, which are: the character size is larger than or equal to the character size of the text and is positioned in the first 1-3 rows of the document image and is centered.

8. A computer-readable storage medium, characterized by: the storage medium stores a document image classification program designed using a document image classification method based on a word frequency as claimed in any one of claims 1 to 7.

9. A document image classification device based on word frequency is characterized in that: the method comprises a memory and a processor, wherein the memory stores a document image classification program designed by the document image classification method based on word frequency according to any one of claims 1-7, and the processor is in communication connection with the memory, runs the document image classification program and outputs a document image classification result.

10. The word frequency based document image classification apparatus according to claim 9, wherein: the system also comprises a display, wherein the display is in communication connection with the processor, and the processor controls the display to display the document image classification result.