CN114782965A

CN114782965A - Visual rich document information extraction method, system and medium based on layout relevance

Info

Publication number: CN114782965A
Application number: CN202210223134.4A
Authority: CN
Inventors: 唐国志; 薛洋; 金连文
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-03-07
Filing date: 2022-03-07
Publication date: 2022-07-22

Abstract

The invention discloses a visual rich document information extraction method, a system and a medium based on layout relevance, wherein the method comprises the following steps: acquiring a visual rich document image, and labeling the visual rich document image to obtain a data set; constructing a document information extraction model aiming at layout relevance, and training the document information extraction model by adopting a data set; acquiring a visual rich document picture, inputting the visual rich document picture into a trained document information extraction model, and acquiring an information extraction result; the layout relevance refers to the position relation between a static field and a dynamic field, the static field is a field with fixed characters in the same template, and the dynamic field is a field which changes according to actual contents in the same template. The invention provides a scheme for extracting visual rich document information by utilizing document layout relevance, which can realize a high-precision visual rich document information extraction task under small sample data and can be widely applied to the field of visual information extraction.

Description

Visual rich document information extraction method, system and medium based on layout relevance

Technical Field

The invention relates to the field of visual information extraction, in particular to a visual rich document information extraction method, a visual rich document information extraction system and a visual rich document information extraction medium based on layout relevance.

Background

The visual rich document picture is distributed in daily life of people, and for example, cards, bills, contracts and the like are all included in the category of the visual rich document. The pictures play an important role in all aspects of people's life, for example, identity cards, bank cards, invoices and the like record some important information. The pictures have the common characteristic that the semantic hierarchical structure is determined by the semantic meaning of the characters, and is related to the font color, size, position, typesetting and the like of the text. While the conventional text detection and recognition technology can only transcribe all the above texts into electronic versions. However, the method is far from enough, daily work and life need to know not only the characters on the pictures but also information behind the characters, so that the information extraction work aiming at the visual rich documents is very important, for example, daily automatic information filling, automatic correction, automatic resume extraction and the like can not leave the information extraction technology. The key information extraction of the visual rich document aims to extract named entities concerned by people from the existing OCR results of the pictures to form key value pairs.

In recent years, with the development of deep learning technology, OCR character detection and recognition technology based on deep learning has been developed greatly, and mature products have been put into the aspects of people's daily life. At the same time, in addition to knowing what the characters are in the document image in the scene, people are more concerned about the meaning behind these characters, such as: date, name, amount, etc. Therefore, a visual rich document information extraction method based on deep learning gradually becomes a focus of research and attention. Based on the method of BilSTM-CRF and the like, the related research introduces a graph volume network based on the method and incorporates the characteristics of layout and the like into a document graph structure as reference. In addition, a method is also provided for constructing a raster image, and words are embedded into the features and fused in channel dimensions. These methods performed well by training and testing under labeled data sets. However, problems are also accompanied, for example, the existing method often continues the idea of solving problems in natural language processing, models in a manner that two-dimensional text information with spatial layout relationships is flattened into a sequence, and in the process, the relationships between layouts are ignored and each field is split and analyzed. This effectively ignores the existing spatial relationships of the visually rich document. In addition, most of the existing methods are trained and tested aiming at design models in specific scenes. The model is only suitable for fixed scenes, and the generalization performance is greatly reduced. The above difficulties seriously hinder the development of information extraction technology.

Disclosure of Invention

To solve at least one of the technical problems in the prior art to some extent, an object of the present invention is to provide a method, a system, and a medium for extracting visual rich document information based on layout relevance.

The technical scheme adopted by the invention is as follows:

a visual rich document information extraction method based on layout relevance comprises the following steps:

acquiring a visual rich document image, and labeling the visual rich document image to obtain a data set;

constructing a document information extraction model aiming at layout relevance, and training the document information extraction model by adopting a data set;

acquiring a visual rich document picture, inputting the visual rich document picture into a trained document information extraction model, and acquiring an information extraction result;

the layout relevance refers to the position relation between a static field and a dynamic field, the static field is a field with fixed characters in the same template, and the dynamic field is a field which changes according to actual contents in the same template.

Further, the training process of the document information extraction model comprises the following steps:

using a unique 1024-dimensional embedded vector for different semantics in the field, equating the position characteristics of the field, such as the relative position of the central point position coordinate, into numbers, and inputting the processed semantics and the position characteristics together as input characteristics;

calculating the mean value of all input features in each category in the preset model to serve as the category center of each category;

the classification is done by measuring the distance of the sample from the center of the class.

Further, the document information extraction model performs the following processing on the input visual rich document picture:

acquiring each independent field in the visual rich document picture as a node in the picture;

acquiring edge connection relations among the nodes; wherein, the edge connection relation is { | X_i-j|，|Y_i-j|，W_i/W_j，H_i/H_j}，|X_i-jI denotes the distance of two field nodes on the abscissa, Y_i-jI denotes the distance of two field nodes on the ordinate, W_i/W_jDenotes the ratio of the widths of the two rectangular boxes corresponding to the two field nodes, H_i/H_jThe ratio of the height of the two rectangular frames corresponding to the two field nodes is represented;

and acquiring the connection relation between all the static fields and all the dynamic fields, and acquiring the matching relation between the static fields and the dynamic fields according to the connection relation.

Further, the matching relationship is obtained by:

acquiring matching probability values between a preset field and all fields;

and selecting the matching relation with the matching probability value larger than the threshold value to indicate that the two fields are in a matching relation.

Further, the matching relationship comprises a one-to-one matching relationship, a one-to-many matching relationship and a many-to-one matching relationship;

and when the one-to-many matching relation and the many-to-one matching relation occur, acquiring the optimal probability matching according to the probability value.

Further, the obtaining of the optimal probability match according to the probability value includes:

obtaining a probability value R of each dynamic field about the classification result;

sorting the probability values R of each dynamic field about the classification result from big to small;

traversing an element i in the set of probability values R;

adding an element i with a probability value of being ranked at the top three into the set Q;

traversing other elements j except the element i in the set of the probability value R;

if the accumulated probability sum of the current probability value is larger than the existing probability accumulated sum in the set, adding the current probability value into the set Q, and simultaneously removing the old value;

the set Q is updated.

Further, the visual rich document image includes a ticket image, an invoice image, a certificate image, and a certificate image.

The invention adopts another technical scheme that:

a visual rich document information extraction system based on layout relevance comprises:

the data acquisition module is used for acquiring the visual rich document image and labeling the visual rich document image to obtain a data set;

the model training module is used for constructing a document information extraction model aiming at the layout relevance and training the document information extraction model by adopting a data set;

the information extraction module is used for acquiring the visual rich document picture, inputting the visual rich document picture into the trained document information extraction model and obtaining an information extraction result;

The invention adopts another technical scheme that:

a visual rich document information extraction system based on layout relevance, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method described above.

The other technical scheme adopted by the invention is as follows:

a computer readable storage medium, in which a program executable by a processor is stored, the program executable by the processor being for performing the method as described above when executed by the processor.

The beneficial effects of the invention are: the invention provides a scheme for extracting visual rich document information by utilizing document layout relevance, which can realize a high-precision visual rich document information extraction task under small sample data and has positive effects on information extraction work under the scenes of bank cards, bills, contracting documents and the like with similar layout information.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made on the drawings of the embodiments of the present invention or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart illustrating steps of a visual rich document information extraction method based on layout relevance according to an embodiment of the present invention;

FIG. 2 is a schematic of a visual rich document information extraction data set sampling used in an embodiment of the present invention;

FIG. 3 is a model diagram of an information extraction matching method according to an embodiment of the present invention;

fig. 4 is a schematic diagram of the information extraction matching effect in the embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention and are not to be construed as limiting the present invention. For the step numbers in the following embodiments, they are set for convenience of illustration only, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as upper, lower, front, rear, left, right, etc., is based on the orientation or positional relationship shown in the drawings only for the convenience of description of the present invention and simplification of the description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.

In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number. If there is a description of first and second for the purpose of distinguishing technical features only, this is not to be understood as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of technical features indicated.

In the description of the present invention, unless otherwise specifically limited, terms such as set, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention by combining the specific contents of the technical solutions.

As shown in fig. 1, the present embodiment provides a visual rich document information extraction method based on layout relevance, including the following steps:

and S1, acquiring the visual rich document image, labeling the visual rich document image, and acquiring a data set.

And S2, constructing a document information extraction model aiming at the layout relevance, and training the document information extraction model by adopting a data set.

And S3, obtaining the visual rich document picture, inputting the visual rich document picture into the trained document information extraction model, and obtaining an information extraction result.

The training and use of the document information extraction model is explained in detail below in conjunction with FIGS. 2-4.

Firstly, some visual rich document data sets to be extracted are required to be collected, the visual rich document data sets respectively cover samples under the conditions of complex layout and simple layout, the samples are respectively collected from data sets such as EATEN, SROIE, CORD, and wrldrechipt, the data sets sequentially comprise 2000 image data sheets including 1000, 300, 400 and 300, and then the 2000 image data sheets are labeled, wherein the three elements mainly include the content of the text, the position of the text, the attribute of the named entity and the like. The pictures of the data set sample are shown in fig. 2, and the pictures can be found to have relevance in layout, and the attribute of some fields can be determined by the semantics of the fields adjacent to the fields. Depending on the layout in this dataset, information can be divided into several categories. The specific statistical categories are shown in table 1 below:

TABLE 1 statistics of visual rich document data set distribution

	EATEN	SROIE	CORD	wirldreceipt
					Type of note	Ticket business card	Bill	Bill	Bill
Number of samples	1000	300	400	300
					Number of layouts	4	4	2	2

The model diagram of the visual rich document information extraction model is shown in fig. 3, the query set in the diagram is a set of samples required to be subjected to information extraction, and the support set refers to a set of templates used for reference.

Training a visual rich document information extraction model: and (3) according to the following steps of 1: a ratio of 1 randomly partitions the training and test sets. I.e., 1600 pictures in the training data and 400 pictures in the test data set. Dividing the layout into 12 types according to the types of the layout, wherein the training set comprises 8 types, and the test set comprises 4 types; it should be noted that the specific type can be determined according to the actual use case. Fields are defined into two types, the first type being called static fields, which are usually invariable words such as "date", "name", "amount", etc. the text used to interpret the other fields. The fields are fixedly distributed in the image and have certain layout distribution similarity. Another type is called dynamic field, which is adjusted according to the specific situation, such as the name, date, etc. of each person. The overall idea of the method is that a prototype network and a matching network in the learning process are utilized, and the difference is that the method is mainly used for modeling the characteristics of the image compared with the traditional method. What is mainly modeled here is the above-mentioned positional relationship between static and dynamic fields, as well as their own semantic features.

The idea of the document graph is adopted, each independent field is regarded as a node in the graph, and the edge connection relationship between each node is defined as the spatial position relationship between the two fields. Take node i and node j as examples. The connection relation between them is { | X_i-j|，|Y_i-j|，W_i/W_j，H_i/H_jAnd respectively indicate the abscissa and ordinate between the center points thereof, the aspect ratio, and the like. Semantic features of fields, namely, features of nodes, adopt semantic representation on sentence meanings extracted by a large-scale pre-training language model, and the representation is represented by a feature with 768 dimensions. Specifically, for each word in the vocabulary, there is one and only one unique 768-dimensional vector to represent the degree of similarity between words. Regarding feature extraction, firstly, edge and node features between all static fields and dynamic fields are calculated, the edge and node features are organized according to a relation of a node-edge-node triple, and a feature value of a category center is calculated according to the category number of the dynamic fields. Specifically, the average value after the feature values between the static fields under all the connection relationships of one dynamic field are summed is calculated. And then combining each field to be queried with the feature value of the category center obtained by calculation by using an attention mechanism to be fed into two layers of multi-layer perceptionA machine is provided. And obtaining a classification probability value of the dynamic field category number multiplied by the total category number based on the thought of the matching network of meta-learning, calculating the difference between the classification probability value and a target value through a cross entropy loss function, and updating the model parameters of deep learning through gradient feedback.

It is worth mentioning that in order to make the whole parameter calculation derivable, the method adopts a DD-ILP solver. Finally, the parameters of the designed model yield the accuracy as shown in table 2:

TABLE 2 accuracy of the proposed visual rich document information extraction method

	F1-score(％)
		Method (input: node characteristics)	78.69
Method (input: node character, edge character)	83.65

Furthermore, the number of combinations of matches obtained may be many-to-one, one-to-many, etc. At this time, a corresponding rule needs to be additionally designed for filtering. For this reason, in order to solve this problem, the following algorithm is designed:

after the model is trained, a final result is obtained by inputting a visual rich document picture and by a method of extracting few-sample visual rich document information of document layout relevance, as shown in fig. 4.

In summary, the present embodiment solves the problem of how to learn the migration relationship of a layout feature according to the relevance of the text layout between samples under small sample data. Therefore, the task of extracting the visual rich document information with high precision can be realized only by labeling a small number of samples. The method has a positive effect on information extraction work under the scenes of bank cards, bills, contract documents and the like with similar layout information. The method has the advantages of strong generalization, high precision and high processing speed.

The embodiment also provides a visual rich document information extraction system based on layout relevance, which comprises:

the data acquisition module is used for acquiring the visual rich document image and marking the visual rich document image to obtain a data set;

the model training module is used for constructing a document information extraction model aiming at layout relevance and training the document information extraction model by adopting a data set;

the information extraction module is used for acquiring the visual rich document pictures, inputting the visual rich document pictures into the trained document information extraction model and acquiring an information extraction result;

The visual rich document information extraction system based on layout relevance can execute the visual rich document information extraction method based on layout relevance provided by the method embodiment of the invention, can execute any combination of the implementation steps of the method embodiment, and has corresponding functions and beneficial effects of the method.

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method of fig. 1.

The embodiment of the application also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor, causing the computer device to perform the method illustrated in fig. 1.

The embodiment also provides a storage medium, which stores an instruction or a program capable of executing the visual rich document information extraction method based on layout relevance provided by the method embodiment of the invention, and when the instruction or the program is run, the method embodiment can be executed in any combination of implementation steps, and the method has corresponding functions and beneficial effects.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be understood that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer given the nature, function, and interrelationships of the modules. Accordingly, those of ordinary skill in the art will be able to practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the foregoing description of the specification, reference to the description of "one embodiment/example," "another embodiment/example," or "certain embodiments/examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A visual rich document information extraction method based on layout relevance is characterized by comprising the following steps:

2. The visual rich document information extraction method based on layout relevance according to claim 1,

the training process of the document information extraction model comprises the following steps:

expressing different semantics in the field as a 1024-dimensional embedded vector, quantizing the position characteristics of the field into numbers, and inputting the processed semantics and the position characteristics as input characteristics;

calculating the mean value of all input features in each category in the preset model as the category center of each category;

3. The method for extracting visually rich document information based on layout relevance according to claim 1,

the document information extraction model performs the following processing on the input visual rich document picture:

acquiring edge connection relation between nodes; wherein, the edge connection relationship is { | X_i-j|，|Y_i-j|，W_i/W_j，H_i/H_j}，|X_i-jI denotes the distance of two field nodes on the abscissa, Y_i-jI denotes the distance of two field nodes on the ordinate, W_i/W_jDenotes the ratio of the widths of the two rectangular boxes corresponding to the two field nodes, H_i/H_jThe ratio of the height of the two rectangular frames corresponding to the two field nodes is represented;

4. The visual rich document information extraction method based on layout relevance according to claim 3,

the matching relationship is obtained by:

acquiring matching probability values between a preset field and all fields;

5. The method for extracting visually rich document information based on layout relevance according to claim 3,

the matching relationship comprises a one-to-one matching relationship, a one-to-many matching relationship and a many-to-one matching relationship;

6. The method for extracting visually rich document information based on layout relevance according to claim 4,

the obtaining of the optimal probability matching according to the probability value includes:

traversing an element i in the set of probability values R;

adding an element i with a probability value of three in the top rank into a set Q;

the set Q is updated.

7. The visual rich document information extraction method based on layout relevance according to claim 1,

the visual rich document image includes a ticket image, an invoice image, a certificate image, and a certificate image.

8. A visual rich document information extraction system based on layout relevance is characterized by comprising:

9. A visual rich document information extraction system based on layout relevance is characterized by comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, in which a program executable by a processor is stored, wherein the program executable by the processor is adapted to perform the method according to any one of claims 1 to 7 when executed by the processor.