CN116758561A

CN116758561A - Document image classification method and device based on multi-mode structured information fusion

Info

Publication number: CN116758561A
Application number: CN202311033101.4A
Authority: CN
Inventors: 申意萍; 陈友斌; 张志坚; 徐一波
Original assignee: Hubei Micropattern Technology Development Co ltd
Current assignee: Hubei Micropattern Technology Development Co ltd
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-09-15

Abstract

The invention discloses a document image classification method and a device based on multi-mode structured information fusion, which relate to the technical field of document image classification, and are characterized in that firstly, layout analysis is carried out on document images, and key areas in the document images, such as titles, characters, figures, figure titles, tables, table titles, seals and the like, are positioned; then extracting text key information of the region according to the type of the key region, then carrying out word segmentation and word vector extraction on the text key information, and finally merging all word vectors to classify. The invention can realize the rapid and accurate classified filing treatment of a large number of electronic materials, and can effectively avoid the classification difficulty and the misclassification caused by different shooting environments; the problem of text classification caused by paper material deformation can be solved; the problems of partial shielding of characters by seals or other things, incomplete documents, titles of some documents, no titles of some documents and complicated material content can be solved.

Description

Document image classification method and device based on multi-mode structured information fusion

Technical Field

The invention relates to the technical field of document image classification, in particular to a document image classification method and device based on multi-mode structured information fusion.

Background

As digital transformation progresses through the industries, the number of electronic document images continues to increase. In the financial field (such as banks, insurance, securities, tax, etc.), in order to preserve a wide variety of paper materials for a long period of time, it is necessary to process them electronically, thus forming a huge electronic document image dataset. In recent years, various remote financial behaviors are continuously popularized due to the influence of epidemic situations, such as remote account opening, online reimbursement and the like. In these remote financial activities, it is necessary to electronically render the paper material, typically using a user's cell phone or tablet. A large number of electronic materials require sorting, archiving and identification processes. Electronic documents contain a large amount of industry-related image and text information, and manual processing of such information is time consuming and costly, so that automatic classification of electronic document images is highly desirable. However, classifying these document images faces the following difficulties:

(1) The shooting environments, such as illumination, angles and backgrounds, are different, and shooting devices, such as resolution, have large differences in generated document images and are difficult to unify and normalize;

(2) The paper material is a non-rigid body, and is easy to deform, so that the characters are distorted and deformed, and the accuracy of text recognition is affected;

(3) The text is partially blocked by the stamp or other things, for example, titles of various notes are blocked by the stamp;

(4) Document incomplete;

(5) The documents are various, some documents have titles, and some documents have no title;

(6) The document material content is complex and changeable, the document layout of the same kind is not uniform, and the intra-class difference is large. Taking medical documents as an example, the examination reports with different names are not in a fixed form, some examination reports only have images shot by a medical camera, some examination reports only have tables, some examination reports only have text descriptions, and some examination reports comprise at least two of the 3 items; for cases, there are both printed cases and handwritten cases. The machine-made cases, generally, do not have a uniform format; the handwritten cases are usually written on the medical record book printed with the keywords, and the written contents are very difficult to identify;

(7) Even with the same type of document, different institutions produce documents that differ.

Disclosure of Invention

In order to solve the technical problems, the invention provides a document image classification method and device based on multi-mode structured information fusion. The following technical scheme is adopted:

the document image classification method based on multi-mode structured information fusion comprises the following steps:

step 1, performing layout analysis on a document image to be classified, and locating a key area;

step 2, extracting text key information from the key areas according to types;

step 3, word segmentation and word vector extraction are respectively carried out on the text key information;

and 4, classifying the documents based on the word vectors and the types to which the word vectors belong.

Optionally, in step 1, the key area is a title, a text, a graph title, a table title, and a stamp.

By adopting the technical scheme, firstly, layout analysis is carried out on the document image, and key areas in the document image, such as titles, characters, figures, figure titles, tables, table titles, seals and the like, are positioned; then extracting text key information of the region according to the type of the key region, then carrying out word segmentation and word vector extraction on the text key information, and finally merging all word vectors to classify.

The method can realize rapid and accurate classified filing and identification processing of a large number of electronic materials, and can effectively avoid classification difficulty and misclassification caused by different shooting environments; the problem of text recognition caused by paper material deformation can be solved; the problems of partial shielding of characters by seals or other things, incomplete documents, titles of some documents, no titles of some documents and complicated material content can be solved.

Optionally, in step 1, the key area is located based on the layout analysis algorithm.

By adopting the technical scheme, the key region can be rapidly positioned by specifically using the LayoutLM algorithm.

Optionally, in step 2, the specific method for extracting the text key information respectively includes:

for the title area, text detection and text recognition are carried out to obtain text content which is used as key information;

for a text region, performing text detection, text recognition, semantic entity recognition and relation extraction to obtain a plurality of key-value pairs, and taking a key as key information;

for the graph, if the corresponding icon questions exist, text detection and text recognition are carried out on the icon questions to obtain text contents, the text contents are used as key information, if the corresponding icon questions do not exist, the graph is used for generating text description, and the text description is used as key information;

if the table title exists in the table, text detection and text recognition are carried out on the table title to obtain text content serving as key information; if the title does not exist, treating according to the text area, and acquiring key information;

for the seal, text recognition is adopted to extract text content in the seal as key information.

Optionally, in extracting text key information of the title region, text detection is performed based on DBNet or FCENT, and text recognition is performed based on CRNN+CTC;

in the text key information extraction of the text region, semantic entity identification is to classify each detected text based on a LayoutXLM model, and the types comprise keys, values and titles;

the relation extraction is based on pairing key and value based on a LayoutXLM model;

in the text key information extraction of the graph, viT or CNN+LSTM is adopted to generate text description.

By adopting the technical scheme, text detection, text recognition, semantic entity recognition and relation extraction are carried out on a text region, a series of key-value pairs are obtained, and keys are taken as key information; semantic entity identification, classifying each detected text, wherein the types can comprise key, value, title and the like, and a LayoutXLM model can be used. The relation extraction is to pair the key and the value, and can be realized by using LayoutXLM model training. For the graph, if the corresponding icon questions exist, text detection and text recognition are carried out on the icon questions to obtain text contents, and the text contents are used as key information; if there is no corresponding icon question, a word description such as ViT, or CNN+LSTM, is generated using the graph as key information. According to the method, the graph can be converted into text content information, and fusion of different mode structural information is realized. For the table, if the table title exists, text detection and text recognition are carried out on the table title to obtain text content which is used as key information; and if the title does not exist, treating according to the text area, and acquiring key information.

For the seal, text content in the seal is extracted by text recognition to serve as key information. The seal information has a great influence on the classification of the documents, for example, two seals are arranged on a qualified invoice, the medical expense list comprises one seal, the seals possibly exist in the cases in the medical document image, and the seals are not arranged on the documents such as the second-generation identity card and the bank card of the inspection report and the card type.

Optionally, in step 3, word vectors are extracted based on word2vec or glove.

By adopting the technical scheme, word2vec or glove can quickly and accurately extract word vectors.

Optionally, in step 4, the documents are classified based on the word vector of the TextCNN algorithm and the type to which the word vector belongs, where the type to which the word vector belongs is which of the title, the text, the image, the table, and the seal the source of the word vector belongs.

By adopting the technical scheme, the documents are classified by using the word vectors and the types to which the word vectors belong, such as a TextCNN algorithm. Word vectors obtained from 5 types of text key information of titles, characters, figures, tables and seals can be classified in a feature fusion mode or a decision fusion mode or a hybrid fusion mode.

The document image classification device based on multi-mode structured information fusion comprises a buffer, a processor and a memory, wherein images to be classified are stored in the buffer, a document image classification program is preloaded in the memory, and the processor runs the document image classification program in the memory to complete classification of the images to be classified.

Optionally, the system further comprises a shooting device, wherein the shooting device is in communication connection with the buffer and is used for shooting images of the documents to be classified and storing the images in the buffer.

Optionally, the system further comprises a display, wherein the display is in communication connection with the processor and displays the classification result of the image to be classified under the control of the processor.

In summary, the invention has at least the following beneficial technical effects:

the invention can provide a document image classification method and device based on multi-mode structured information fusion, firstly, layout analysis is carried out on document images, and key areas in the document images are positioned; then extracting text key information of the region according to the type of the key region, then carrying out word segmentation and word vector extraction on the text key information, and finally merging all word vectors to classify, so that a large amount of electronic materials can be classified and filed rapidly and accurately, and classification difficulty and error classification caused by different shooting environments can be effectively avoided; the problem of text image classification caused by paper material deformation can be solved; the problems of partial shielding of characters by seals or other things, incomplete documents, titles of some documents, no titles of some documents and complicated material content can be solved.

Drawings

FIG. 1 is a flow diagram of a document image classification method based on multimodal structured information fusion of the present invention;

FIG. 2 is a schematic diagram of the connection principle of the document image classification device based on multi-modal structured information fusion;

fig. 3 is a schematic representation of an embodiment of the present invention.

Reference numerals illustrate: 1. a buffer; 2. a processor; 3. a memory; 4. a photographing device; 5. a display.

Description of the embodiments

The present invention will be described in further detail with reference to the accompanying drawings.

The embodiment of the invention discloses a document image classification method and device based on multi-mode structured information fusion.

Referring to fig. 1 to 3, the document image classification method based on multi-modal structured information fusion includes the steps of:

step 2, extracting text key information from the key areas according to types;

In step 1, the key areas are titles, characters, figures, drawing titles, tables, table titles and seals.

Firstly, carrying out layout analysis on a document image, and positioning key areas in the document image, such as titles, characters, figures, picture titles, tables, table titles, seals and the like; then extracting text key information of the region according to the type of the key region, then carrying out word segmentation and word vector extraction on the text key information, and finally merging all word vectors to classify.

In step 1, a key area is located based on a layout analysis algorithm.

In particular, the LayoutLM algorithm can quickly locate the critical area.

In the step 2, the specific method for respectively extracting the text key information comprises the following steps:

In the text key information extraction of the title area, text detection is carried out based on DBNet or FCENT, and text recognition is carried out based on CRNN+CTC;

in the text key information extraction of a text region, semantic entity identification is to classify each detected text based on a LayoutXLM model, and the types comprise keys, values and titles;

the relation extraction is to pair the key and the value based on a LayoutXLM model;

For a text region, performing text detection and text recognition, semantic entity recognition and relation extraction to obtain a series of key-value pairs, and taking a key as key information; semantic entity identification, classifying each detected text, wherein the types can comprise key, value, title and the like, and a LayoutXLM model can be used. The relation extraction is to pair the key and the value, and can be realized by using LayoutXLM model training. For the graph, if the corresponding icon questions exist, text detection and text recognition are carried out on the icon questions to obtain text contents, and the text contents are used as key information; if there is no corresponding icon question, a word description such as ViT, or CNN+LSTM, is generated using the graph as key information. According to the method, the graph can be converted into text content information, and fusion of different mode structural information is realized. For the table, if the table title exists, text detection and text recognition are carried out on the table title to obtain text content which is used as key information; and if the title does not exist, treating according to the text area, and acquiring key information.

In step 3, word vectors are extracted based on word2vec or glove.

word2vec or glove can quickly and accurately extract word vectors.

In step 4, classifying the documents based on the word vector of the TextCNN algorithm and the type to which the word vector belongs, wherein the type to which the word vector belongs is which of the title, the text, the image, the table and the seal the source of the word vector belongs.

The documents are classified by the word vector and the type to which the word vector belongs, such as TextCNN algorithm. Word vectors obtained from 5 types of text key information of titles, characters, figures, tables and seals can be classified in a feature fusion mode or a decision fusion mode or a hybrid fusion mode.

The document image classification device based on multi-mode structured information fusion comprises a buffer 1, a processor 2 and a memory 3, wherein images to be classified are stored in the buffer 1, a document image classification program is preloaded in the memory 3, and the processor 2 runs the document image classification program in the memory 3 to finish classification of the images to be classified.

The system further comprises a shooting device 4, wherein the shooting device 4 is in communication connection with the buffer 1 and is used for shooting images of the documents to be classified and storing the images in the buffer 1.

The system also comprises a display 5, wherein the display 5 is in communication connection with the processor 2 and displays the classification result of the images to be classified under the control of the processor 2.

The document image classification method and device based on multi-mode structured information fusion in the embodiment of the invention has the implementation principle that:

the method comprises the steps of carrying out digital filing on a batch of cases, respectively shooting photos of the batch of cases through shooting equipment 4, storing the photos in a buffer 1, and executing a document image classification program on the photos by a processor 2 to obtain classification results. Referring to fig. 3, layout analysis results in a title area, a text area, and an image area. For the title area, the text content of the title is extracted as follows: the ultrasonic detection report form of the people hospital in the city of senior citizens' is used as key information; for an image area, no image title is detected, so that a description "ultrasonic image" of the image area can be obtained as key information in a mode of generating words from the image; and (3) carrying out text detection and text recognition, semantic entity recognition and relation extraction on the text region, and finally obtaining a series of key-value pairs, wherein the number of keys is 11 keys, namely a name, a gender, an age, an examination number, a report date, a department for censoring, a hospitalization number, a bed number, examination equipment, an examination part and a description. Performing word segmentation and extracting word vectors on the 3 types of key information; finally, the word vector and the type (title, image and text area) to which the word vector belongs are taken as input, and the TextCNN is used for classifying to obtain the category to which the document belongs as a check report.

The above embodiments are not intended to limit the scope of the present invention, and therefore: all equivalent changes in structure, shape and principle of the invention should be covered in the scope of protection of the invention.

Claims

1. The document image classification method based on multi-mode structured information fusion is characterized by comprising the following steps of:

step 2, extracting text key information from the key areas according to types;

step 4, classifying the documents based on the word vectors and the types to which the word vectors belong;

for the seal, text recognition is adopted to extract text content in the seal as key information;

2. The document image classification method based on multi-modal structured information fusion according to claim 1, wherein: in step 1, the key areas are titles, characters, figures, drawing titles, tables, table titles and seals.

3. The document image classification method based on multi-modal structured information fusion according to claim 2, wherein:

in step 1, a key area is located based on a layout analysis algorithm.

4. The document image classification method based on multi-modal structured information fusion according to claim 3, wherein: in step 3, word vectors are extracted based on word2vec or glove.

5. The document image classification method based on multi-modal structured information fusion according to claim 4, wherein: in step 4, classifying the documents based on the word vector of the TextCNN algorithm and the type to which the word vector belongs, wherein the type to which the word vector belongs is which of the title, the text, the image, the table and the seal the source of the word vector belongs.

6. Document image classification device based on multimode structured information fusion, its characterized in that: the method comprises a buffer (1), a processor (2) and a memory (3), wherein the buffer (1) stores images to be classified and is preloaded with a document image classification program designed according to the method of claim 5, and the processor (2) runs the document image classification program in the memory (3) to complete classification of the images to be classified.

7. The document image classification apparatus based on multi-modal structured information fusion as set forth in claim 6, wherein: the system further comprises a shooting device (4), wherein the shooting device (4) is in communication connection with the buffer (1) and is used for shooting images of the documents to be classified and storing the images in the buffer (1).

8. The document image classification apparatus based on multi-modal structured information fusion as set forth in claim 7, wherein: the system also comprises a display (5), wherein the display (5) is in communication connection with the processor (2) and displays the classification result of the images to be classified under the control of the processor (2).