CN118366175B - Document image classification method based on word frequency - Google Patents
Document image classification method based on word frequency Download PDFInfo
- Publication number
- CN118366175B CN118366175B CN202410794321.7A CN202410794321A CN118366175B CN 118366175 B CN118366175 B CN 118366175B CN 202410794321 A CN202410794321 A CN 202410794321A CN 118366175 B CN118366175 B CN 118366175B
- Authority
- CN
- China
- Prior art keywords
- document image
- score
- title
- type
- word frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000001514 detection method Methods 0.000 claims abstract description 16
- 238000012545 processing Methods 0.000 claims description 10
- 238000004458 analytical method Methods 0.000 claims description 6
- 238000004891 communication Methods 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 abstract description 12
- 238000012360 testing method Methods 0.000 description 13
- 238000012797 qualification Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 5
- 238000007635 classification algorithm Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000012550 audit Methods 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012776 electronic material Substances 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 238000012954 risk control Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 208000011580 syndromic disease Diseases 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a word frequency-based document image classification method, which relates to the technical field of document image classification and comprises two steps of document image registration and unknown type document image classification to be classified. According to the invention, each document image can complete document image registration by only one sample (namely an example document image), and particularly, for a document image of a form type, only one blank form is needed; the robustness of errors (false alarms and missing alarms) of a text line detection algorithm caused by various reasons is improved; the robustness of errors (false recognition and missing recognition) of a text line recognition algorithm caused by various reasons is improved; the addition of new document image types can be accommodated.
Description
Technical Field
The invention relates to the technical field of document image classification, in particular to a document image classification method based on word frequency.
Background
As banks, securities, insurance financial institutions move to the digital age, the number of electronic document images is in an ever-increasing situation. The bank account system can generate various document images in the running process, and the images cover various aspects of account opening, transaction certificates, business files and the like. These document images are not only important records of banking process, but also important basis for information management and risk control by banks. Therefore, the document images are efficiently and accurately managed and processed, and the method has important significance for guaranteeing normal operation and risk prevention and control of banking business. For another example, the insurance industry may generate multiple types of document images in its business processes: in the sales link of insurance products, a large number of document images such as insurance application forms, insurance application forms and the like are generated, and the document images record the contents such as personal information, insurance requirements, insurance clauses and the like of clients in detail, so that the document images are important bases for risk assessment and insurance verification of insurance companies; in the process of insurance claim settlement, various document images such as claim settlement application, medical evidence, accident identification and the like are generated, and the images are key basis for the insurance company to perform claim settlement audit and claim payment processing. In view of the great number of electronic materials, the electronic document images are rich in rich industry-related images and text information, which are required to be ordered and classified and subjected to professional recognition processing. If the processing is performed manually, the processing is time-consuming and labor-consuming, and the cost is also increased sharply. Therefore, there is a need to develop an automatic classification method for electronic document images to meet the increasing processing demands.
The existing automatic classification method of document images mainly comprises the following steps:
1) Rule-based classification method: the method judges different layout modules in the page through predefined rules and templates. For example, rules may be used to determine whether a larger font text at the top of a page is a title or its type based on the location, font size, format, etc. characteristics of the text.
2) Machine learning-based classification method: this approach utilizes a large amount of training data to learn a model of document image classification. The automatic classification of the new document image is realized by extracting the characteristics (such as the positions, the sizes, the shapes and the like of the characters and the images) in the document image and training by using a machine learning algorithm (such as a support vector machine, random forest, deep learning and the like).
3) Classification method based on image processing: the document image is preprocessed and feature extracted by image processing techniques such as edge detection, connected domain analysis, projection, etc., and then classified based on these features.
4) Document image segmentation based on hierarchical or non-hierarchical methods: hierarchical methods, such as top-down or bottom-up, identify layout structures by progressively segmenting document images. The non-hierarchical method is more focused on overall analysis and processing, and is suitable for complex document images.
5) The mixing method comprises the following steps: this method combines the above-mentioned several methods, and improves the accuracy and efficiency of document image classification by complementation.
In practical production practice, the classification algorithm of the 5-class document images faces various challenges, and the robustness of the algorithm is insufficient and cannot cope with the diversity and complexity of the document images. Specifically, the challenges include the following 5 aspects:
(1) Differences in shooting environment and equipment: the differences of illumination, angles and backgrounds can cause the problems of uneven brightness, overweight shadow or blurred words of the images, and the problems can influence the quality of the images, so that the classification algorithm is difficult to accurately identify. The different resolution, exposure time and distortion of the device can lead to differences in image sharpness, color rendition and geometry, increasing the difficulty of classification.
(2) Deformation problem of paper material: during scanning or shooting, the paper document may cause distortion of the text due to bending, wrinkling or folding of the paper, which may cause trouble in recognition and analysis of the text.
(3) Character shielding problem: text in a document image may be obscured by seals, stickers, or other items, which may result in loss of some information or difficulty in recognition, which can be a challenge for classification algorithms.
(4) Complexity and diversity of document image content: the content of the document image tends to be complex and variable, and even the same type of document image, the layout, text layout and image layout may be greatly different.
(5) The increasing number of new document image types: new document image types may appear over time. For pre-trained classifiers, this means that the data needs to be re-collected, training the model, to accommodate the new classification requirements. This not only increases the effort and cost, but may also affect the performance and stability of the classifier.
Disclosure of Invention
In order to solve the technical problem of document image classification, the invention provides a document image classification method based on word frequency. The following technical scheme is adopted:
a method for classifying document images based on word frequency comprises two steps of document image registration and unknown type document image classification to be classified;
the document image registration includes the steps of:
step 10, obtaining an example document image for each type of document image to be classified, and forming an example document image set ,For the number of types of document images to be classified,Is the firstAn example document image of the class document image;
Step 11, counting the title character sets of all the document images by using the example document images;
step 12, counting the registration header word frequency of each type of document image;
step 13, when the newly added document image type exists, updating the title character set of all the document images and the registration title word frequency of each type of document images;
step 14, counting key character sets of all document images by using the example document images;
Step 15, counting the registration key word frequency of each type of document image;
Step 16, when the newly added document image type exists, updating the key character set of all the document images and the registration key word frequency of each type of document images;
The classification of the unknown type of document image to be classified comprises the following steps:
Step 20, performing text line detection and text line recognition on the document image to be classified;
Step 21, dividing the text line detection result into two types of title and text;
Step 22, obtaining the title word frequency of the document image to be classified;
Step 23, obtaining the key word frequency of the document image to be classified;
and step 24, calculating the score based on the cosine distance between the registered keyword frequency and the keyword frequency of the document image to be classified and the cosine distance between the registered keyword frequency and the keyword frequency of the document image to be classified, and obtaining the type label result of the document image classification based on the score analysis.
By adopting the technical scheme, each document image can finish document image registration by only one sample (namely an example document image), and particularly, for a document image of a form type, only one blank form is needed.
The error (false alarm and missing alarm) of the text line detection algorithm caused by various reasons is robust;
the method is robust to errors (false recognition and missing recognition) of text line recognition algorithms caused by various reasons;
The addition of new document image types can be accommodated.
Optionally, step 11 comprises the sub-steps of:
Step 111: obtaining the title of each example document image, composing the title set T, ,For the number of types of document images to be classified,Is thatIs a title of (2);
Step 112: forming a title character set from characters in each title in the title set T P is the number of characters in the title character set.
Each example document image is manually reviewed and a title of each example document image is obtained in step 111.
It should be noted that, step 111 must be performed manually, and cannot be obtained by using OCR technology, so that the correctness of the title character set a can be ensured.
Optionally, in step 13, when there is a new added document image type, an example document image of the new added document image type is obtained, an example document image set is added, steps 11 and 12 are repeated, and the title character set and the registration title word frequency are updated.
Optionally, step 14 comprises the sub-steps of:
Step 141: manually checking each example document image and obtaining keys to form a key set K;
step 142: forming a key character set from the characters in each key in the key set K Q is the number of characters in the key character set.
It should be noted that, step 141 must be performed manually, and cannot be obtained by using OCR technology, so that the correctness of the key character set B can be ensured.
Optionally, in step 16, when there is a new added document image type, an example document image of the new added document image type is obtained, an example document image set is added, steps 14 and 15 are repeated, and the key character set and the registration key word frequency are updated.
Optionally, the classification is performed in step 21 based on three features of the title, where the three features of the title are: the character size is larger than or equal to the character size of the text and is positioned in the first 1-3 rows of the document image and is centered.
A computer-readable storage medium storing a document image classification program designed using a document image classification method based on word frequency.
A document image classification device based on word frequency comprises a memory and a processor, wherein the memory stores a document image classification program designed by a document image classification method based on word frequency, and the processor is in communication connection with the memory, runs the document image classification program and outputs a document image classification result.
Optionally, the system further comprises a display, wherein the display is in communication connection with the processor, and the processor controls the display to display the document image classification result.
In summary, the invention has at least the following beneficial technical effects:
The invention can provide a document image classification method based on word frequency, each document image can finish document image registration by only one sample (namely an example document image), and particularly, for document images of form type, only one blank form is needed;
the error (false alarm and missing alarm) of the text line detection algorithm caused by various reasons is robust;
the method is robust to errors (false recognition and missing recognition) of text line recognition algorithms caused by various reasons;
The addition of new document image types can be accommodated.
Drawings
FIG. 1 is a flow chart of a document image classification method based on word frequency according to the present invention;
FIG. 2 is an example document image of an identity card of a particular embodiment;
FIG. 3 is an example document image of a post qualification of a particular embodiment;
FIG. 4 is an example document image of revenue certification for a particular embodiment;
FIG. 5 is a diagram of a post qualification to be classified in accordance with an embodiment;
FIG. 6 is a text line detection of a post qualifier to be classified in accordance with an embodiment.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
The embodiment of the invention discloses a document image classification method based on word frequency.
Referring to fig. 1 to 6, a document image classification method based on word frequency includes two steps of document image registration and unknown type document image classification to be classified;
the document image registration includes the steps of:
step 10, obtaining an example document image for each type of document image to be classified, and forming an example document image set ,For the number of types of document images to be classified,Is the firstAn example document image of the class document image;
Step 11, counting the title character sets of all the document images by using the example document images;
step 12, counting the registration header word frequency of each type of document image;
step 13, when the newly added document image type exists, updating the title character set of all the document images and the registration title word frequency of each type of document images;
step 14, counting key character sets of all document images by using the example document images;
Step 15, counting the registration key word frequency of each type of document image;
Step 16, when the newly added document image type exists, updating the key character set of all the document images and the registration key word frequency of each type of document images;
The classification of the unknown type of document image to be classified comprises the following steps:
Step 20: and detecting text lines of the unknown type document image image_test to be classified, and identifying each obtained text line. There is no limitation on what text line detection algorithm is used, nor on what text line recognition algorithm is used.
Step 21: the text line detection result is divided into two types of title and body. The basis for classification is three features of the title: the character size of the title of the document image is larger than or equal to that of the text and is positioned in the first 1-3 rows of the document image.
Step 22: the title word frequency u of image_test is obtained.
U=p dimension all 0 vectors
If image_test has a title
Step 20 and step 21 have obtained the title content text
For c in text# traverse each character c of text
for p in 0, 1, 2, …, P-1
break
Step 23: the key frequency v of image_test is obtained.
V=q dimension all 0 vectors
If image_test has a key
;
Combining the text line identification results of the text lines obtained in the step 20 and the step 21, wherein the combined result is text;
for c in text# traverse each character c of text
for q in 0, 1, 2, …, Q-1
Break;
And step 24, calculating the score based on the cosine distance between the registered keyword frequency and the keyword frequency of the document image to be classified and the cosine distance between the registered keyword frequency and the keyword frequency of the document image to be classified, and obtaining the type label result of the document image classification based on the score analysis.
Specifically, the following logic is adopted for classification: setting max_score=0, max_id= -1; traversing N types of document images one by one; for the i-th type document image, if both the document image to be classified and the i-th type document image have titles, the title similarity score_t is assigned as the title frequency u of the document image to be classified and the registered title frequency of the i-th typeIf not, score_t is assigned 0; if both the document image to be classified and the document image of the i-th type have keys, assigning the key similarity score_k as the key frequency v of the document image to be classified and the registered key frequency of the i-th typeCosine distance of (2); calculating a similarity score of the document image to be classified and the i-th type document image in four cases, score= (score_t+score_k)/2 if score_t >0 and score_k >0, score=score_t if score_t >0 and score_k= 0, score=score_k if score_t= 0 and score_k >0, score=0 if score_t= 0 and score_k= 0; if the similarity score of the document image to be classified and the ith document image is larger than the max_score, updating the max_score to be score, and recording the current category sequence max_id=i; and iteratively processing each type of document image until the type of the document image to be classified is obtained.
max_score=0
max_id = -1
for i in 0, 1, 2, 3, …, N-1
score_t = 0
The# image_test has a title, and the i-th document image has a title
Score_t=u andCosine distance of (2)
score_k= 0
# Image_test has a key, and the type i document image has a key
Score_k=v andCosine distance of (2)
if score_t>0 and score_k>0
score = (score_t + score_k)/2
if score_t>0 and score_k==0
score = score_t
if score_t==0 and score_k>0
score = score_k
if score_t==0 and score_k ==0
score = 0
if score>max_score
max_score = score
max_id = i
Image_test is a max_id-like document image.
Only one sample (i.e., an example document image) is needed for each document image to complete document image registration, and particularly, for a document image of a form type, only one blank form is needed.
The error (false alarm and missing alarm) of the text line detection algorithm caused by various reasons is robust;
the method is robust to errors (false recognition and missing recognition) of text line recognition algorithms caused by various reasons;
The addition of new document image types can be accommodated.
Step 11 comprises the sub-steps of:
Step 111: obtaining the title of each example document image, composing the title set T, ,For the number of types of document images to be classified,Is thatIs a title of (2);
Step 112: forming a title character set from characters in each title in the title set T P is the number of characters in the title character set.
for i in 0, 1, 2, …, N-1
for c in # Traversal No.Title of class document imageIs defined as each character c of
If c does not belong to the title character set A
Adding c to the title character set A
Step 12: and counting the registration header word frequency of each type of document image.
for i in 0, 1, 2, …, N-1
if Having a title
for c in # TraversalIs defined as each character c of
for p in 0,1,2,...,P-1
break;
Each example document image is manually reviewed and a title of each example document image is obtained in step 111.
It should be noted that, step 111 must be performed manually, and cannot be obtained by using OCR technology, so that the correctness of the title character set a can be ensured.
In step 13, when there is a newly added document image type, an example document image of the newly added document image type is acquired, an example document image set is added, and steps 11 and 12 are repeated.
Step 14 comprises the sub-steps of:
step 141: each example document image is manually reviewed and keys are obtained, constituting a set of keys K.
,For the number of types of document images to be classified,Is thatOf (3), wherein,Is thatNumber of keys of (a)。
It should be noted that, step 141 must be performed manually, and cannot be obtained by using OCR technology, so that the correctness of the key character set B can be ensured.
Step 142: adding characters in keys of each type of document image to be classified into a key character setQ is the number of characters in the key character set.
for i in 0, 1, 2, …, N-1
for j in 0, 1, 2, …,
for c in Traversing the j-th key of the i-th classIs defined as each character c of
If c does not belong to key character set B
Adding c to the key character set B.
It should be noted that, step 141 must be performed manually, and cannot be obtained by using OCR technology, so that the correctness of the key character set B can be ensured.
Step 15, counting the registration key word frequency of each type of document image.
for i in 0, 1, 2, …, N-1
Vector of dimension 0
if Having keys
for j in 0, 1, 2, …,
for c in Traversing the j-th key of the i-th classIs a character of each of the plurality of characters
for q in 0,1,2,…,Q-1
Break;
The classification is based on three features of the title in step 21. The title has three features: the character size is larger than or equal to the character size of the text and is positioned in the first 1-3 rows of the document image and is centered.
A computer-readable storage medium storing a document image classification program designed using a document image classification method based on word frequency.
A document image classification device based on word frequency comprises a memory and a processor, wherein the memory stores a document image classification program designed by a document image classification method based on word frequency, and the processor is in communication connection with the memory, runs the document image classification program and outputs a document image classification result.
The system also comprises a display, wherein the display is in communication connection with the processor, and the processor controls the display to display the document image classification result.
The following description adopts specific embodiments to explain the implementation principle of a document image classification method based on word frequency:
the identity card shown in fig. 2, the post qualification shown in fig. 3, and the revenue certification shown in fig. 4 are classified.
Step 10: obtaining an example document image for each type of document image to be classified to form an example document image set,。In the case of the identification card shown in figure 2,For the post qualification shown in figure 3,For revenue demonstration as shown in fig. 4.
Step 11: and counting the title character sets of all the document images.
Step 111: manually viewing each example document image and obtaining a title of each example document image, composing a title set T,,For the number of types of document images to be classified,Is thatIs a title of (c). For the position qualification, a "position qualification" may be added to the header set T, and for the income proof, a "income proof" may be added to the header set T, whereby the header set t= { position qualification, income proof }.
The title character set a= { 'bit', 'enter', 'post', 'receive', 'bright', 'grid', 'syndrome', 'resource', p=8 obtained in step 112.
Step 12: the result of counting the registration header word frequency of each type of document image is
Step 14: and counting key character sets of all document images.
Step 141: the example document images shown in fig. 2, 3, and 4 are manually reviewed and keys are obtained, constituting a key set K. For the second generation card, the keys of the second generation card such as name, gender, ethnicity, birthday, address, citizen identity number are added to the key set K, and for the post qualification card such as name, gender, department, position and business are added to the key set K, thereby obtaining the key set K= { name, gender, ethnicity, birthday, address, citizen identity number, name, gender, department, position and business }
Step 142: the obtained key character set b= { ' industry ', ' part ', ' position ', ' living ', ' public ', ' category ', ' business ', ' number ', ' first name ', ' address ', ' last name ', ' sex ', ' family ', ' day ', ' people ', ' birth ', ' code ', ' job ', ' body ', ' part ', ' door ', ' q=21
Step 15: counting the registration key word frequency of each type of document image, wherein the result is as follows:
the classification of document images of unknown type comprises five steps: text line detection, text line distinguishing title and text, obtaining a title word frequency, obtaining a key word frequency, and obtaining a type tag.
Step 20: referring to fig. 5, text line detection is performed on the image_test of fig. 5, and text line recognition is performed, and the result of the text line detection is shown in fig. 6;
The recognition result of each text line is: post card, name Zhang Taili, sex, female, department research, advanced software engineer, business security.
It can be seen that the text line detection result and the text line recognition result have different errors.
Step 21: the text line detection result is divided into two types of title and body. The results were: the post evidence is the title and the rest is the text.
Step 22: the header word frequency u= [1, 0, 1, 0,0, 1, 0] of the image_test is obtained.
Step 23: the key frequency v= [1, 0, 0, 0, 0,1, 1, 0,1, 0, 0,1, 1, 1, 0, 0, 0, 0, 0,1, 1] of image_test is obtained.
Step 24: determining the type of image_test, and the result of program execution is
i= 0 score_t= 0.000 score_k= 0.333 score= 0.333 max_score= 0.333 max_id= 0;
i= 1 score_t= 0.775 score_k= 0.894 score= 0.835 max_score= 0.835 max_id= 1;
i= 2 score_t= 0.289 score_k= 0.000 score= 0.289 max_score= 0.835 max_id= 1;
Judging the image_test as a type max_id=1 document image, and judging that the conclusion is correct.
The above embodiments are not intended to limit the scope of the present invention, and therefore: all equivalent changes in structure, shape and principle of the invention should be covered in the scope of protection of the invention.
Claims (10)
1. A method for classifying document images based on word frequency is characterized by comprising two steps of document image registration and unknown type document image classification to be classified;
the document image registration includes the steps of:
Step 10, obtaining an example document image for each type of document image to form an example document image set ,Is the firstAn example document image of the class document image;
Step 11, counting the title character sets of all the document images by using the example document images;
step 12, counting the registration header word frequency of each type of document image;
step 13, when the newly added document image type exists, updating the title character set of all the document images and the registration title word frequency of each type of document images;
step 14, counting key character sets of all document images by using the example document images;
Step 15, counting the registration key word frequency of each type of document image;
Step 16, when the newly added document image type exists, updating the key character set of all the document images and the registration key word frequency of each type of document images;
The classification of the unknown type of document image to be classified comprises the following steps:
Step 20, performing text line detection and text line recognition on the document image to be classified;
Step 21, dividing the text line detection result into two types of title and text;
Step 22, obtaining the title word frequency of the document image to be classified;
Step 23, obtaining the key word frequency of the document image to be classified;
Step 24, calculating a score based on the cosine distance between the registered keyword frequency and the keyword frequency of the document image to be classified and the cosine distance between the registered keyword frequency and the keyword frequency of the document image to be classified, and obtaining a type label result of the document image classification based on score analysis;
step 24 uses the following logic for classification: setting max_score=0, max_id= -1; traversing N types of document images one by one; for the i-th type document image, if both the document image to be classified and the i-th type document image have titles, the title similarity score_t is assigned as the title frequency u of the document image to be classified and the registered title frequency of the i-th type If not, score_t is assigned 0; if both the document image to be classified and the document image of the i-th type have keys, assigning the key similarity score_k as the key frequency v of the document image to be classified and the registered key frequency of the i-th typeCosine distance of (2); calculating a similarity score of the document image to be classified and the i-th type document image in four cases, score= (score_t+score_k)/2 if score_t >0 and score_k >0, score=score_t if score_t >0 and score_k= 0, score=score_k if score_t= 0 and score_k >0, score=0 if score_t= 0 and score_k= 0; if the similarity score of the document image to be classified and the ith document image is larger than the max_score, updating the max_score to be score, and recording the current category sequence max_id=i; and iteratively processing each type of document image until the type of the document image to be classified is obtained.
2. The word frequency based document image classification method according to claim 1, wherein: step 11 comprises the sub-steps of:
Step 111: obtaining the title of each example document image, composing the title set T, ,Is thatIs a title of (2);
Step 112: forming a title character set from characters in each title in the title set T P is the number of characters in the title character set.
3. The word frequency based document image classification method according to claim 2, wherein: each example document image is manually reviewed and a title of each example document image is obtained.
4. The word frequency based document image classification method according to claim 1, wherein: in step 13, when there is a newly added document image type, an example document image of the newly added document image type is acquired, an example document image set is added, and steps 11 and 12 are repeated.
5. The word frequency based document image classification method according to claim 1, wherein: step 14 comprises the sub-steps of:
Step 141: obtaining keys of each example document image to form a key set K;
step 142: forming a key character set from the characters in each key in the key set K Q is the number of characters in the key character set.
6. The word frequency based document image classification method according to claim 5, wherein: a key is employed to manually view each example document image and obtain each example document image.
7. The word frequency based document image classification method according to claim 1, wherein: in step 21, classification is performed based on three features of the title, which are: the character size is larger than or equal to the character size of the text and is positioned in the first 1-3 rows of the document image and is centered.
8. A computer-readable storage medium, characterized by: the storage medium stores a document image classification program designed using a document image classification method based on a word frequency as claimed in any one of claims 1 to 7.
9. A document image classification device based on word frequency is characterized in that: the method comprises a memory and a processor, wherein the memory stores a document image classification program designed by the document image classification method based on word frequency according to any one of claims 1-7, and the processor is in communication connection with the memory, runs the document image classification program and outputs a document image classification result.
10. The word frequency based document image classification apparatus according to claim 9, wherein: the system also comprises a display, wherein the display is in communication connection with the processor, and the processor controls the display to display the document image classification result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410794321.7A CN118366175B (en) | 2024-06-19 | 2024-06-19 | Document image classification method based on word frequency |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410794321.7A CN118366175B (en) | 2024-06-19 | 2024-06-19 | Document image classification method based on word frequency |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118366175A CN118366175A (en) | 2024-07-19 |
CN118366175B true CN118366175B (en) | 2024-09-24 |
Family
ID=91882342
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410794321.7A Active CN118366175B (en) | 2024-06-19 | 2024-06-19 | Document image classification method based on word frequency |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118366175B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114138979A (en) * | 2021-10-29 | 2022-03-04 | 中南民族大学 | Cultural relic safety knowledge map creation method based on word expansion unsupervised text classification |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101249183B1 (en) * | 2006-08-22 | 2013-04-03 | 에스케이커뮤니케이션즈 주식회사 | Method for extracting subject and sorting document of searching engine, computer readable record medium on which program for executing method is recorded |
CN110298338B (en) * | 2019-06-20 | 2021-08-24 | 北京易道博识科技有限公司 | Document image classification method and device |
CN110750995B (en) * | 2019-10-29 | 2023-06-02 | 上海德拓信息技术股份有限公司 | File management method based on custom map |
CN111639181A (en) * | 2020-04-30 | 2020-09-08 | 深圳壹账通智能科技有限公司 | Paper classification method and device based on classification model, electronic equipment and medium |
CN115422125B (en) * | 2022-09-29 | 2023-05-19 | 浙江星汉信息技术股份有限公司 | Electronic document automatic archiving method and system based on intelligent algorithm |
CN118113837A (en) * | 2024-02-28 | 2024-05-31 | 中海石油(中国)有限公司海南分公司 | Dialogue model for searching intellectual property document by semantic matching of sentence vectors |
-
2024
- 2024-06-19 CN CN202410794321.7A patent/CN118366175B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114138979A (en) * | 2021-10-29 | 2022-03-04 | 中南民族大学 | Cultural relic safety knowledge map creation method based on word expansion unsupervised text classification |
Also Published As
Publication number | Publication date |
---|---|
CN118366175A (en) | 2024-07-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210124919A1 (en) | System and Methods for Authentication of Documents | |
Van Beusekom et al. | Text-line examination for document forgery detection | |
US20160092730A1 (en) | Content-based document image classification | |
Ahmed et al. | Forgery detection based on intrinsic document contents | |
CN113255642B (en) | Medical bill information integration method for human injury claim settlement | |
WO2022126978A1 (en) | Invoice information extraction method and apparatus, computer device and storage medium | |
CN111598099B (en) | Image text recognition performance testing method, device, testing equipment and medium | |
US20230147685A1 (en) | Generalized anomaly detection | |
Kumar et al. | Forged character detection datasets: passports, driving licences and visa stickers | |
CN114511866A (en) | Data auditing method, device, system, processor and machine-readable storage medium | |
CN117523586A (en) | Check seal verification method and device, electronic equipment and medium | |
CN118366175B (en) | Document image classification method based on word frequency | |
Joren et al. | Ocr graph features for manipulation detection in documents | |
US20240144204A1 (en) | Systems and methods for check fraud detection | |
US20230069960A1 (en) | Generalized anomaly detection | |
Tornés et al. | Receipt Dataset for Document Forgery Detection | |
Van Beusekom et al. | Document signature using intrinsic features for counterfeit detection | |
CN114359918A (en) | Bill information extraction method and device and computer equipment | |
Hamido et al. | The Use of Background Features, Template Synthesis and Deep Neural Networks in Document Forgery Detection | |
BJ et al. | Identification of Seal, Signature and Fingerprint from Malayalam Agreement Documents using Connected Component Analysis | |
Markham et al. | Open-Set: ID Card Presentation Attack Detection Using Neural Style Transfer | |
Bogahawatte et al. | Online Digital Cheque Clearance and Verification System using Block Chain | |
EP4361971A1 (en) | Training images generation for fraudulent document detection | |
CN114820211B (en) | Method, device, computer equipment and storage medium for checking and verifying quality of claim data | |
US20230316795A1 (en) | Auto-Document Detection & Capture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |