CN114708595A

CN114708595A - Image document structured analysis method, system, electronic device, and storage medium

Info

Publication number: CN114708595A
Application number: CN202210255581.8A
Authority: CN
Inventors: 王则远; 刘鹏
Original assignee: Lingxi Quantum Beijing Medical Technology Co ltd
Current assignee: Lingxi Quantum Beijing Medical Technology Co ltd
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-07-05

Abstract

The invention provides an image document structured analysis method, a system, an electronic device and a storage medium, wherein the method comprises the following steps: splicing images of all pages in the image document in sequence to obtain a composite image; covering the corresponding part in the composite image based on the preset text content information to be eliminated; performing layout integration on the composite image to obtain an image to be analyzed; inputting the graph to be analyzed into an LX-BioLayoutLM model to obtain an analysis document with a structured label; the LX-BioLayoutLM model is based on a BERT model and a LayoutLM model, and alignment of image information and text information in a graph to be analyzed is completed. The method realizes structured analysis of image documents in batches, and is convenient for structured extraction of document data in complex scenes.

Description

Image document structured analysis method, system, electronic device, and storage medium

Technical Field

The invention relates to the technical field of document processing, in particular to an image document structured analysis method, an image document structured analysis system, electronic equipment and a storage medium.

Background

The extraction and analysis of the image text content have wide requirements in many practical production scenes, and with the vigorous development of the related algorithm technology of the Computer Vision (Computer Vision) task based on artificial intelligence, the analysis of the image text content by using the AI technology is a direction with great value and significance.

The image document usually contains several parts of information such as title, keyword, abstract, text, chart, reference document, etc., and how to accurately identify and extract the information of each part of the document, that is, the page layout analysis of the image document, is a very important research topic in the process of analyzing the image document.

At present, there are many different ways to analyze the text in the image document, for example, the text in the image document can be directly analyzed and output by using a built-in toolkit of Python language, but this analyzing way is only to read the text in the image, and it is not possible to classify and identify each text according to the category attribute, and this kind of method can only be directed to the text content, and cannot solve the non-text content such as the chart; still adopt some image analysis tools, carry out the split of image and content identification through uploading the image file, this kind of mode often only is effective for specific type of image, and can not realize the structured output of image content equally.

Therefore, how to extract the content in the image document in a structured manner by a certain fixed rule becomes a solution worthy of exploration.

Disclosure of Invention

The invention provides an image literature structured analysis method, an image literature structured analysis system, an electronic device and a storage medium, aiming at the problems in the prior art.

The invention provides an image document structured analysis method, which comprises the following steps:

splicing images of all pages in the image document in sequence to obtain a composite image;

covering the corresponding part in the composite based on the preset text content information to be eliminated;

performing layout integration on the composite image to obtain an image to be analyzed;

inputting the graph to be analyzed into an LX-BioLayoutLM model to obtain an analysis document with a structured label;

the LX-BioLayoutLM model is based on a BERT model and a LayoutLM model, and alignment of image information and text information in a graph to be analyzed is completed.

According to the structured analytic method of the image document, the image to be analyzed is input into an LX-BioLayoutLM model to obtain an analytic document with a structured tag, and then the method comprises the following steps:

comparing the length of the text in the analyzed document with the length of the text in the image to be analyzed to obtain the integrity of the text structuralization;

and if the integrity exceeds a preset threshold value, confirming the analyzed document as the document after the structural analysis of the image document.

According to the method for structured parsing of the image document provided by the present invention, if the integrity exceeds the predetermined threshold, the parsed document is determined as a document after structured parsing of the image document, and then the method comprises:

based on the structural labels in the analyzed document, a plurality of required labels are selected, and paragraph texts in the image documents corresponding to the labels are extracted in batches.

According to the image document structured analysis method provided by the invention, the training data set of the LX-BioLayoutLM model is the image document of the structured marker.

According to the image document structured analysis method provided by the invention, the BERT model part in the LX-BioLayoutLM model takes the text in the graph to be analyzed and the position information corresponding to the text as input, and takes a text vector for expressing the semantic understanding of the text and a position embedding vector for representing the mapping relation between the text paragraph and the image as output.

According to the structured analytic method of the image document, the LayoutLM model part in the LX-BioLayoutLM model takes a picture to be analyzed, a text in the picture to be analyzed and position information corresponding to the text as input, and takes a character-level 2D position embedding vector and an image embedding vector reflecting image characteristic information as output.

According to the image document structured analytic method provided by the invention, the LX-BioLayoutLM model comprises an image alignment layer, wherein the image alignment layer takes the image to be analyzed, a text vector for expressing the semantic understanding of the text, a position embedding vector for representing the mapping relation between a text paragraph and an image, a character-level 2D position embedding vector and an image embedding vector for expressing the characteristic information of the image as input, and takes an analyzed document with a structured label as output.

The invention also provides an image document structured analysis system, which comprises:

the image synthesis module sequentially splices images of all pages in the image document to obtain a synthesis image;

the user-defined information removing module is used for covering a corresponding part in the composite image based on preset text content information to be removed;

the image layout resetting module is used for integrating the layout of the composite to obtain a to-be-analyzed picture;

the analysis module inputs the graph to be analyzed into an LX-BioLayoutLM model to obtain an analysis document with a structured label;

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the image document structured solution method according to any one of the above items.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the image document structured analysis method as described in any one of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, carries out the steps of the image document structured interpretation method as defined in any one of the preceding claims.

The image document structured analysis method, the image document structured analysis system, the electronic equipment and the storage medium provided by the invention realize the structured analysis of image documents in batches, and are convenient for structured extraction of document data in complex scenes.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for structured image document analysis according to the present invention;

FIG. 2 is a schematic structural diagram of an image document structured analysis system according to the present invention;

fig. 3 is a schematic physical structure diagram of an electronic device according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The image document structured analysis method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings by specific embodiments and application scenarios thereof.

Fig. 1 is a schematic flow chart of an image document structured solution method provided by the present invention, and as shown in fig. 1, the image document structured solution method provided by the present invention includes:

and step 100, splicing the images of all pages in the image document in sequence to obtain a composite image.

In the present embodiment, the image document refers to a PDF document in the medical field.

Preferably, the image files are sequentially spliced into a large image, and the process is jointly completed by the built-in PIL package and the cv2 package of the python language and the image splicing and synthesizing logic.

Step 200, covering the corresponding part in the composite image based on the preset text content information to be eliminated.

According to the method, in the process of extracting the image content, image documents in the same batch may have some same non-important text content information which has the risk of interfering the subsequent identification, so that the relevant content of the composite image is automatically covered by the non-important text content information which can be ignored in the image documents in the current batch; when the text content information is not required to be eliminated, the non-important text content information is set to be a null value.

And step 300, integrating the layout of the composite to obtain the graph to be analyzed.

After the input image is subjected to the synthesis and covering processing, a redundant structure of the layout can be generated, and the redundant information structure is cut to be integrated into a picture form closer to the training data of the subsequent model.

Step 400, inputting the graph to be analyzed into an LX-BioLayoutLM model to obtain an analysis document with a structured label;

Preferably, the LX-BioLayoutLM model is a simple and effective pre-training model based on medical image understanding task training, and mainly comprises a BERT model and a LayoutLM model, understanding of image document semantics is completed through the BERT model, information such as visual features and relative positions of texts of the image document is captured through the LayoutLM model, and the problem of a visual information level is solved.

The method and the device realize the structured analysis of the image documents in batches, and are convenient for structured extraction of document data in complex scenes.

Further, in another embodiment, this embodiment provides an image document structured parsing method, inputting a graph to be parsed into an LX-BioLayoutLM model, obtaining a parsed document with structured tags, and then including:

comparing the length of the text in the analyzed document with the length of the text in the graph to be analyzed to obtain the integrity of the text structuralization;

It should be noted that the LX-BioLayoutLM model is generally used to process a plurality of image documents in batch, and the LX-BioLayoutLM model finally generates a plurality of html-like code segment files, each corresponding to an original image document.

And judging the integrity of the recognition result based on the length of the image text entering the LX-BioLayoutLM model and the length of the text in the final output file, and finding that the threshold is determined to be 75% more appropriately based on experimental data.

The embodiment discloses how to determine whether the analysis of the image document meets the preset requirement, and for the analysis which fails to reach the standard, the structured analysis text generated by the analysis and the original image document are returned at the same time.

Further, in another embodiment, the present embodiment provides an image document structured parsing method, wherein if the integrity exceeds a predetermined threshold, the parsing document is determined as a document after structured parsing of the image document, and then the method includes:

and selecting a plurality of required tags based on the structured tags in the analyzed document, and extracting paragraph texts in the image documents corresponding to the tags in batch.

It should be noted that, in a general task, a batch of image documents is specified to require to extract contents of a specific module (for example, a title and a keyword) in the image documents, an analytic document is obtained through a model, a structured tag corresponding to the specific module is found, and then the contents corresponding to the tag in the analytic document are extracted to complete the task.

According to the embodiment, information such as titles, abstracts, keywords, texts, icons and reference documents in the image documents is extracted and output according to module attributes according to requirements, so that the difficulty of analyzing the image documents by people is greatly reduced, and the precision and the efficiency of image data processing are improved.

Further, in another embodiment, the embodiment provides an image document structured solution method, and the training data set of the LX-BioLayoutLM model is a structured labeled image document.

Preferably, 8000 medical image documents are structurally labeled, and a training data set is constructed by labeling the titles, abstracts, keywords, texts, charts and references of the whole image document with html-like tags such as < title > </title >, < abstract >, < k >, < text >, < table > </table >, < r > </r > and the like, respectively.

The training data is obtained by manually marking the biomedical literature database data, and the trained model has better effect on the biomedical image literature.

This example discloses a method for constructing a training dataset for the LX-BioLayoutLM model.

Further, in another embodiment, the present embodiment provides an image document structured solution method, where a layout to be resolved, a text in the layout to be resolved, and position information corresponding to the text are used as input in a layout to be resolved in an LX-BioLayoutLM model, and a 2D position embedding vector at a character level and an image embedding vector representing image feature information are used as output; the method comprises the following steps that a layout to be analyzed, a text in the layout to be analyzed and position information corresponding to the text are used as input in a layout LayoutLM model part in an LX-BioLayoutLM model, and a 2D position embedding vector of a character level and an image embedding vector which embodies image characteristic information are used as output; the LX-BioLayoutLM model comprises an image alignment layer, wherein the image alignment layer takes an image to be analyzed, a text vector for expressing the semantic understanding of the text, a position embedding vector for representing the mapping relation between a text paragraph and an image, a character-level 2D position embedding vector and an image embedding vector for expressing the characteristic information of the image as input, and takes an analyzed document with a structured label as output.

It should be noted that the LX-BioLayoutLM model itself has an OCR function, and when acquiring an image to be analyzed, a text in the image and position information corresponding to the text can be acquired accordingly. A 2D position embedding vector for representing relative position markers within the document for capturing relationships between symbols within the document; image embedding vectors are used to capture some expressive features such as the direction, type and color of the word. In order to align the image feature information of the document with the text information, an image embedding vector layer is added to the model to represent the image features in the language expression. The image characteristic information of the document is aligned with the character information, and is a process that the image characteristics such as fonts, colors and other information correspond to the corresponding character content information.

The LX-BioLayoutLM can be subdivided into a BERT model and a LayoutLM model, wherein the BERT model is a model of a natural language processing direction and is also used for processing text information so as to solve the problem of semantic understanding in the task; the layout model is used for capturing information such as visual features of a document image, relative positions of texts and the like, solving the problem of visual information layers, and is obtained by combining two parts and utilizing medical images and text data for pre-training, so that the layout recognition of the documents is realized for the information in the document image, and finally, the information such as titles, abstracts, texts, charts and the like in the image document is distinguished, and the structured labeling output is realized.

The image alignment layer takes a to-be-analyzed image, a text vector for expressing the semantic understanding of the text, a position embedded vector for representing the mapping relation between a text paragraph and an image, a character-level 2D position embedded vector and an image embedded vector for expressing the image characteristic information as input, the four vectors are combined with the to-be-analyzed image to perform alignment, namely, the semantic understanding of the content and the captured image characteristic are normalized, and finally, the html structured file is synthesized.

The image alignment layer is a convolutional neural network layer, an image is converted into a matrix consisting of pixel values by using an image to be analyzed, then the matrix is scanned line by line (the matrix and the matrix corresponding to a scanning area are multiplied and then added, and then the average value is obtained), a new feature matrix is obtained, the new feature matrix and the matrix formed by splicing the four vectors are subjected to weighted summation, then an activation function is used for normalization, and the normalized matrix is decoded to obtain the html structured file. The vector acquisition in this embodiment is realized by transform network coding.

In addition, multitask learning targets are added during training of the LX-BioLayoutLM Model, including Mask Visual Language Model (MVLM) loss and Multi-label Document Classification (MDC) loss, so that combined pre-training of text and layout is driven more.

This example discloses the training and application process of the LX-BioLayoutLM model.

The image document structured analysis system provided by the invention is described below, and the image document structured analysis system described below and the image document structured analysis method described above can be referred to correspondingly.

Fig. 2 is a schematic structural diagram of an image document structured analysis system provided by the present invention, and as shown in fig. 2, the image document structured analysis system provided by the present invention further includes:

the image synthesis module is used for splicing the images of all pages in the image document in sequence to obtain a synthetic image;

the user-defined information removing module is used for covering the corresponding part in the composite based on the preset text content information to be removed;

the image layout resetting module is used for integrating the layout of the combined image to obtain a to-be-analyzed image;

Through training fine adjustment, the accuracy of the system can reach 96.5%, the accuracy of an actual reasoning result is determined to be 93% through manual verification, and the problem that manual data extraction is time-consuming and expensive can be perfectly solved through the system provided by the invention for carrying out structural analysis on the image document.

Fig. 3 is a schematic physical structure diagram of an electronic device provided in the present invention, and as shown in fig. 3, the electronic device may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform a method of structured interpretation of image documents, the method comprising:

covering the corresponding part in the composite image based on the preset text content information to be eliminated;

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the image document structured solution method provided by the above methods, the method comprising:

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the image document structured parsing method provided above, the method comprising:

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for structured interpretation of image documents, the method comprising:

2. The method for structured parsing of image document according to claim 1, wherein said inputting the graph to be parsed into LX-BioLayoutLM model to obtain a parsed document with structured tags, then comprising:

and if the integrity exceeds a preset threshold value, confirming the analyzed document as a document after the structural analysis of the image document.

3. The method according to claim 2, wherein if the integrity exceeds a predetermined threshold, the method identifies the parsed document as a structured parsed document of the image document, and then comprises:

and selecting a plurality of required tags based on the structured tags in the analysis document, and extracting paragraph texts in the image documents corresponding to the tags in batches.

4. The method of claim 1, wherein the training dataset of the LX-BioLayoutLM model is a structurally labeled image document.

5. The image document structured parsing method of claim 1, wherein the BERT model part in the LX-BioLayoutLM model takes as input a text in the graph to be parsed and position information corresponding to the text, and takes as output a text vector representing semantic understanding of the text and a position embedding vector representing a mapping relationship between a text paragraph and an image.

6. The image document structured solution method of claim 5, wherein the layout model part in the LX-Biolayout model takes the to-be-resolved image, the text in the to-be-resolved image and the position information corresponding to the text as input, and takes the 2D position embedding vector at the character level and the image embedding vector embodying the image feature information as output.

7. The image document structured parsing method of claim 6, wherein the LX-BioLayoutLM model comprises an image alignment layer, the image alignment layer takes the graph to be parsed, a text vector representing semantic understanding of the text, a position embedding vector representing mapping relationship between text paragraphs and images, a character-level 2D position embedding vector, and an image embedding vector representing image feature information as input, and takes the parsed document with structured tags as output.

8. An image document structured interpretation system, the system comprising:

the image synthesis module sequentially splices images of all pages in the image document to obtain a synthetic image;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, implements the steps of the image document structured parsing method according to any of claims 1-7.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the image document structured interpretation method according to any of claims 1 to 7.