CN118036597A

CN118036597A - Structured information extraction method, system and medium based on self-adaptive space measurement

Info

Publication number: CN118036597A
Application number: CN202410203859.6A
Authority: CN
Inventors: 石芳; 覃勋辉; 刘科; 邓金玉
Original assignee: Chongqing Sign Digital Technology Co ltd
Current assignee: Chongqing Sign Digital Technology Co ltd
Priority date: 2024-02-23
Filing date: 2024-02-23
Publication date: 2024-05-14

Abstract

The invention relates to a structured information extraction method, a structured information extraction system and a structured information extraction medium based on self-adaptive space measurement, which comprise the following steps: and (3) offline template construction: marking position information of the keywords in the templates, wherein the marked templates are a set K containing elements K; character recognition and sequencing: performing character recognition on the document image to be processed to obtain a set S containing the element S; keyword extraction: for each element S in the set S, calculating the minimum unit number required to be moved when the text box corresponding to the element S is moved to be overlapped with the text box corresponding to the keyword i in the template, marking as D _1i, and finding the keyword x with the minimum value of D _1i; judging whether the character string of the element s contains a keyword x to be identified, if yes, storing the position information of the element x, the position information of the element s into a result set R as the element R, otherwise, storing the position information of the element s into a candidate set C as the element C; reconstructing corresponding values: for all elements C in the candidate set C, calculating the minimum number of units required to be moved when the text box corresponding to the element C moves to be adjacent to the text box corresponding to the element R _j in the result set R, recording as D _2j, finding the element R with the minimum value of D _2j, and adding the character string of the element C to the position of the value of the element R. The method has higher mobility and greatly improves the accuracy of information extraction.

Description

Structured information extraction method, system and medium based on self-adaptive space measurement

Technical Field

The invention relates to the technical field of information, in particular to a structured information extraction method, a structured information extraction system and a structured information extraction medium based on self-adaptive space measurement.

Background

With the wider and wider application of word recognition in recent years, the research of a relation extraction method of structured document recognition is also gradually attracting attention and exploration of more and more students, but a lot of problems still need to be solved at present. Firstly, from the current common structured information extraction method, most of the information extraction methods are coupled with specific character recognition methods, and the mobility of the algorithm is reduced to a certain extent; secondly, most information extraction algorithms only consider content information of text recognition results, and space position information of the text is not fully utilized, so that the extraction results can be inaccurate under the conditions of inaccurate keywords, inaccurate recognition characters and various information types; finally, the conventional template matching method tends to label more priori information as much as possible, and although more priori references can be added, the final key information matching errors can be caused by the mutual influence of excessive labeled information in the template.

Therefore, there is a need to develop a new method, system and medium for extracting structured information based on adaptive spatial metrics.

Disclosure of Invention

The invention aims to provide a structured information extraction method, a structured information extraction system and a structured information extraction medium based on self-adaptive space measurement, which not only can extract key information focused by a user from disordered text lines, but also can be widely applied to any text recognition algorithm with a certain recognition precision, and has higher mobility and can greatly improve the accuracy of information extraction.

In a first aspect, the present invention provides a structured information extraction method based on adaptive spatial metric, including the following steps:

And (3) offline template construction: marking position information of keywords in a template, wherein the marked template is a set K containing elements K, and the elements K comprise the keywords and text box coordinate set information;

Character recognition: performing text recognition on the document image to be processed to obtain a set S containing elements S, wherein the elements S comprise character strings and text box position information;

Keyword extraction: for each element S in the set S, calculating the minimum unit number required to be moved when a text box corresponding to the element S moves to overlap with a text box corresponding to a keyword i in the template, and marking as D _1i, wherein i is 1,2, … …, m and m are the number of marked keywords in the template, and finding a keyword x with the minimum value of D _1i; judging whether the character string of the element s contains a keyword x to be identified, if yes, storing the position information of the [ x, s-x, s ] as the element R into a result set R, wherein: s-x represents the character string left after the character string of the element s is removed from the keyword x, otherwise, the position information of s is stored as the element C into the candidate set C;

Reconstructing corresponding values: for all elements C in the candidate set C, calculating the minimum number of units required to be moved when the text box corresponding to the element C moves to be adjacent to the text box corresponding to the element R _j in the result set R, and recording as D _2j, wherein j is 1,2, … …, n, n is the number of elements R in the result set R, finding the element R with the minimum value of D _2j, and adding the character string of the element C to the position of the value of the element R.

Optionally, at the time of offline template construction, defining element k as [ keyword, text box coordinate set, type of keyword correspondence value ];

after the reconstruction of the corresponding value is completed, type checking and correction are also executed, specifically:

and filtering out the elements R of which the types of the keywords and the corresponding values of the keywords are not in accordance in the result set R according to the types of the corresponding values of the keywords.

Optionally, the calculation formula of D _1i is as follows:

D_1i(A,B)＝α(A,B)·min(|A_x1-B_x1|,|A_x2-B_x2|)+min(|A_y1-B_y1|,|A_y2-B_y2|)

Wherein: a represents a text box corresponding to a keyword i in a template; b represents a text box corresponding to the element s; (a _x1,A_y1) represents the upper left corner coordinates of text box a; (a _x2,A_y2) represents the lower right corner coordinates of text box a; (B _x1,B_y1) represents the upper left corner coordinates of the text box B and (B _x2,B_y2) represents the lower right corner coordinates of the text box B; α (a, B) represents the relative positions of text box a and text box B.

Optionally, the calculation formula of D _2j is as follows:

D_2j(E,F)＝α(E,F)·min(|E_x1-F_x1|,|E_x1-F_x2|,|E_x2-F_x2|,|E_x2-F_x1|)

+min(|E_y1-F_y1|,|E_y1-F_y2|,|E_y2-F_y1|,|E_y2-F_y2|)

Wherein: e represents a text box corresponding to the element r _j; f represents a text box corresponding to the element c; (E _x1,E_y1) represents the upper left corner coordinates of the text box E; (E _x2,E_y2) represents the lower right corner coordinates of the text box E; (F _x1,F_y1) represents the upper left corner coordinates of the text box F and (F _x2,F_y2) represents the lower right corner coordinates of the text box F; α (E, F) represents the relative positions of text box E and text box F.

Optionally, the method for calculating α (a, B) is as follows:

Where k is a scale parameter, if a _x1>B_x1, it indicates that text box a is on the right side of text box B, and if a _x1<B_x1, it indicates that text box a is on the left side of text box B.

Optionally, the method for calculating α (E, F) is as follows:

Where k is a scale parameter, if E _x1>F_x1, it indicates that the text box E is on the right side of the text box F, and if E _x1<F_x1, it indicates that the text box E is on the left side of the text box F.

Optionally, before keyword extraction is performed, the elements S in the set S are ordered in a left-to-right, top-to-bottom order according to "text box positions".

Optionally, the string of element c is added to the position of the "value" of element r, specifically:

The character string of the element c is extended to the back of the "value" of the element r, while the "position information" of the element c is overlaid on the "position information" of the element r.

Optionally, whether the character string of the element s contains the keyword x to be identified is determined specifically as follows:

And calculating the editing distance between the keyword x and the element s, if the editing distance is smaller than a preset threshold value, considering that the character string of the element s contains the keyword x to be identified, otherwise, considering that the character string of the element s does not contain the keyword x to be identified.

In a second aspect, the present invention provides a structured information extraction system based on adaptive space metrics, which uses a memory and a controller, where the memory stores a computer readable program, and the computer readable program can execute the steps of the structured information extraction method based on adaptive space metrics according to the present invention when the computer readable program is called by the controller.

In a third aspect, the present invention provides a medium, in which a computer readable program is stored, where the computer readable program is capable of executing the steps of the structured information extraction method according to the present invention based on adaptive spatial metrics when called by a controller.

The application has the beneficial effects that: firstly, the method can be widely applied to any text recognition algorithm with certain precision, and the recognition result comprising relative positions and contents is subjected to structural relation extraction; secondly, the method provides an adaptive space measurement method based on template matching, which can be used for adaptively giving out similarity measurement according to the distance and the direction between the text boxes by combining the position information of the text boxes, so that the accuracy of structured information extraction is further improved; finally, in the labeling process of the template, only keywords are labeled in the template, and a special measuring method is designed for the content of the corresponding value of the keywords to be matched, so that the influence of excessive labeling information in the template on the precision of a final matching result is effectively improved. The method is mainly used for identifying key information of cards and documents, and compared with the traditional structuring method, the method has the advantages of more accurate identification and higher identification speed.

Drawings

FIG. 1 is a flow chart of a method for extracting structured information based on adaptive spatial metrics according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a template according to an embodiment of the present application;

FIG. 3 is a schematic block diagram of a structured information extraction system based on adaptive spatial metrics in an embodiment of the present application.

Detailed Description

Further advantages and effects of the present invention will become readily apparent to those skilled in the art from the disclosure herein, by referring to the following description of the embodiments of the present invention with reference to the accompanying drawings and preferred examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be understood that the preferred embodiments are presented by way of illustration only and not by way of limitation.

As shown in fig. 1, in an embodiment of the present application, a method for extracting structured information based on adaptive spatial metric includes the following steps:

Step 1: constructing an offline template;

the offline template construction is mainly to mark the position information of each keyword aiming at the text type needing to be extracted. The noted templates are a set K containing elements K, wherein the elements K comprise keywords and text box coordinate set information.

In the embodiment of the application, the element k is shaped as [ key: text box coordinates sets, other information). Wherein the function of the set of text box coordinates is to determine the position of one text box, so that coordinates of at least 2 points are required. Such as: the position of the text box is represented by [ x1, y1, x2, y2] coordinates of the upper left corner and the lower right corner of the text box, respectively.

As shown in fig. 2, an identification card is taken as an example for explanation, and the pixel coordinate positions where the keywords such as "name", "gender", "ethnicity" and the like are located need to be marked in the identification card, if the size of the template picture is high (h): 400 pixels; width (w): 600 pixels, the pixel coordinate of a point of the text box where the keyword "name" is located is (50, 70), and then the normalized coordinate is calculated as (w, h) = (50/600, 70/400) = (0.083,0.175), and other keyword coordinates can be calculated in the same manner.

In the embodiment of the application, after the positions of the marked keywords are calculated, the type corresponding to the value can be marked in the template according to the type of each key value pair, for example, the name can only be text, and the citizen identification number can only be a number except the last digit.

Because the template position matching is mainly performed in a distance measurement-based mode in the embodiment of the application, too-labeled dense templates may introduce deviation in the subsequent information extraction, so that only the positions of keywords are labeled in the templates, and the positions of corresponding values of the keywords are not labeled any more.

Step 2: character recognition;

the text recognition is carried out on the document image to be processed, and the text recognition result is a set S containing elements S, wherein the elements S are binary groups of [ character strings and text box positions ].

Before keyword extraction, each element S in the set S is ordered in a left-to-right, top-to-bottom order according to the "text box position".

In the embodiment of the application, after the character recognition and the sorting in the step 2 are completed, the method is carried out in two steps, namely the keyword extraction in the step 3 and the corresponding value reconstruction in the step 4, and the purposes of the two steps are mainly to determine the positions and the contents of the keywords and the corresponding values by referring to the pre-constructed templates, and the most critical problem is to adopt a measurement mode to match the keywords and the corresponding values.

The embodiment of the application provides an adaptive space measurement method based on template matching, which is improved based on classical Manhattan distance and is mainly used for optimizing the traditional distance measurement method by combining the distribution characteristics of keywords and corresponding values and the corresponding relation between the keywords and the corresponding values. The measuring method has the advantages that the relative distance score can be given in a self-adaptive mode by referring to the direction relation and the distance relation of any two text boxes, and the result that the distances between some text boxes are similar but the directions are wrong is automatically and efficiently screened out based on the distance score in the template matching process. According to the method, definitions D _1i and D _2j suitable for keyword extraction and corresponding value reconstruction steps are respectively designed according to the constructed offline template marking information and the distribution characteristics of keywords and corresponding values. The D _1i is used for representing the minimum unit number required to be moved when two text boxes are moved to overlap, and is mainly used for keyword position matching in the subsequent keyword extraction step, and the design principle is that the text boxes marked in the template constructed in the step 1 only contain keyword information, so that the keyword position information in the template can be directly referred to, and the keyword positions corresponding to one by one in the recognition result can be confirmed through the simple relative distance of the text boxes. The D _2j is used to represent the minimum number of units to be moved when two text boxes are moved to be adjacent, and is mainly used for matching the positions of the corresponding values in the subsequent corresponding value reconstruction step. The measurement method has the advantages that the content of the template can be simplified and the influence of complex information in the template can be reduced by using the D _1i and the D _2j to match the keywords and the values respectively.

Step 3: extracting keywords;

In keyword extraction, for each element S in the ordered set S, calculating the minimum unit number required to be moved when a text box corresponding to the element S moves to overlap with a text box corresponding to a keyword i in a template, namely D _1i, wherein i is 1,2, … …, m, m is the number of labeled keywords in the template, and finding a keyword x with the minimum value of D _1i; calculating the editing distance between the keyword x and the element s, if the editing distance is smaller than the preset threshold value, considering that the character string of the element s contains the keyword x to be identified, and storing the position information of the [ x, s-x, s ] as the element R into a result set R, wherein: s-x represents the character string left after the character string of the element s is removed from the keyword x, otherwise, the position information of s is stored as the element C in the candidate set C.

In this embodiment, the edit distance is an index for measuring the degree of difference between two character strings. It defines the minimum number of editing operations required to convert from one character string to one target character string. These editing operations include insertion, deletion, and substitution, wherein:

insertion: a character is inserted in the source string.

Deletion: one character is deleted from the source string.

Replacement: and replacing one character in the source character string with one character in the target character string.

After this step is completed, two sets are obtained, namely a result set R and a candidate set C.

The result set R stores therein the identified keywords and the location information of the keywords, and may also have already been identified values. For example: if a character string is "name three", the result set R will store an element of [ name, name three, location information ] after the above procedure. If the recognition result is "name", "Zhang San", and is divided into two character strings, the result set R will store [ name, location information ], and the candidate set C will store [ Zhang San, location information ] as one element.

In the embodiment of the present application, the calculation formula of D _1i is as follows:

D_1i(A,B)＝α(A,B)·min(|A_x1-B_x1|,|A_x2-B_x2|)+min(|A_y1-B_y1|,|A_y2-B_y2|)

In most structured information, the keywords are often to the right of the value, and thus different metric scores need to be given to targets in different directions but at the same distance in the distance metric. In the embodiment of the application, the text boxes from different directions are limited by using an adjusted exponential function, and the alpha (A, B) calculating method is as follows:

If a _x1>B_x1 indicates that the text box a is on the right side of the text box B, in this case, the structure does not conform to the general structured text, and since the exponential function has a larger growth rate on the positive half axis, the effect of the corresponding term will be amplified due to the larger calculation. Conversely, if a _x1<B_x1 indicates that text box a is to the left of text box B, in this case a structure conforming to the general structured text, then the value of α (a, B) will be close to 1, so that the overall calculated distance is relatively small. Where k is a scale parameter used to scale the range of α in actual operation, and experiments prove that k=4 achieves better effect.

Step 4: reconstructing corresponding values;

The corresponding value reconstruction is then based mainly on the result set R and the candidate set C. The actual meaning of the element R in the result set R is: r= [ key, value, location information ]; the meaning of the expression of element C in candidate set C is: c= [ character string, position information ].

For all elements C in the candidate set C, calculating the minimum number of units required to be moved when the text box corresponding to the element C moves to be adjacent to the text box corresponding to the element R _j in the result set R, namely D _2j, wherein j is 1,2, … …, n and n are the number of elements R in the result set R, finding the element R with the minimum value of D _2j, and adding the character string of the element C to the position of the value of the element R.

In the embodiment of the present application, the calculation formula of D _2j is as follows:

D_2j(E,F)＝α(E,F)·min(|E_x1-F_x1|,|E_x1-F_x2|,|E_x2-F_x2|,|E_x2-F_x1|)+min(|E_y1-F_y1|,|E_y1-F_y2|,|E_y2-F_y1|,|E_y2-F_y2|)

In most structured information, the keywords are often to the right of the value, and thus different metric scores need to be given to targets in different directions but at the same distance in the distance metric. In the embodiment of the application, the text boxes from different directions are limited by using an adjusted exponential function, specifically: the method for calculating alpha (E, F) comprises the following steps:

If E _x1>F_x1 indicates that the text box E is on the right side of the text box F, in this case, the structure does not conform to the general structured text, and since the exponential function has a larger growth rate on the positive half axis, the effect of the corresponding term will be amplified. Conversely, if E _x1<F_x1, the text box E is to the left of text box F, which in this case is a structure conforming to the general structured text, the value of α (E, F) will be close to 1, so that the overall calculated distance is relatively small. Where k is a scale parameter used to scale the range of α in actual operation, and experiments prove that k=4 achieves better effect.

In the embodiment of the present application, the character string of the element c is added to the position of the "value" of the element r, specifically:

if the "value" of the element r ' is not null, the character string of the element c is directly expanded to the back of the "value" of the element r ', and the "position information" of the element c is overlaid on the "position information" of the element r '. If the "value" of the element r ' is null, the character string of the element c is directly placed at the "value" of the element r ', and the "position information" of the element c is overlaid on the "position information" of the element r '.

For example: if an element R, r= [ name, sheet, location information 1] is already stored in the result set R, and an element c= [ three, location information 2] from the candidate set C is present, after calculating D _2j, the element C is closest to the element R, then the element R is modified to [ name, sheet three, location information 2].

Step 5: type checking and correction;

In general, after the key extraction and value extraction are completed, most of the structured information can be extracted accurately in the form of key-value pairs. However, there are still some relatively difficult or information due to problems with the detection and recognition algorithms that cannot be extracted. At this time, correction may be performed in combination with the method of type checking to improve accuracy. Such as: in the initial definition of the template, the element k in the template is defined as [ key, text box coordinate set, other information ], and some illegal inputs are filtered out by predefining the type of corresponding value of a certain key. For example, "name" may only be a chinese character, the number of digits of "citizen identification number" is fixed, and only digits other than the last digit are made up. The individual elements of the result set R may be further processed with this information to improve the final accuracy. And the result after type inspection and correction is the final structured extraction result.

In addition, for some special and difficult fields, the text boxes outside the boundaries can be accurately searched by taking the positions of the adjacent fields as the boundaries through templates so as to finish information extraction. Still taking the identification card of fig. 2 as an example, assume that "address" is a more difficult message to match. Then the string for which the most recent result calculated by D _2j is "address" may be skipped at step 4. In this step, the candidate set C is searched again, and the text box contents above the "civilian id number" are sequentially added to the column of "address" with the text box position below the "birth", so that the key information extraction can be completed.

As shown in fig. 3, in an embodiment of the present application, a structured information extraction system based on an adaptive space metric employs a memory and a controller, where the memory has a computer readable program, and the computer readable program can execute the steps of the structured information extraction method based on the adaptive space metric according to the embodiment of the present application when the computer readable program is called by the controller.

In an embodiment of the present application, a medium has a computer readable program stored therein, where the computer readable program when called by a controller can execute the steps of the structured information extraction method based on adaptive spatial metric according to the embodiment of the present application.

Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of embodiments of the present application, a medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The medium may be a machine-readable signal medium or a machine-readable storage medium. The medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The embodiments described above are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the embodiments described above, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principles of the present invention should be made in the equivalent manner, and are included in the scope of the present invention.

Claims

1. The structured information extraction method based on the adaptive space measurement is characterized by comprising the following steps of:

2. The method for extracting structured information based on adaptive spatial metric according to claim 1, wherein: when an offline template is constructed, defining an element k as [ a keyword, a text box coordinate set and a type of a keyword corresponding value ];

3. The method for extracting structured information based on adaptive spatial metric according to claim 1, wherein: the calculation formula of D _1i is as follows:

D_1i(A,B)＝α(A,B)·min(|A_x1-B_x1|,|A_x2-B_x2|)+min(|A_y1-B_y1|,|A_y2-B_y2|)

4. The method for extracting structured information based on adaptive spatial metric according to claim 1, wherein: the calculation formula of D _2j is as follows:

5. A structured information extraction method based on adaptive spatial metrics according to claim 3, characterized in that: the method for calculating alpha (A, B) comprises the following steps:

6. The method for structured information extraction based on adaptive spatial metrics according to claim 4, wherein: the method for calculating alpha (E, F) comprises the following steps:

7. The method for extracting structured information based on adaptive spatial metric according to claim 2, wherein: before keyword extraction is performed, the elements S in the set S are ordered in a left-to-right, top-to-bottom order according to the "text box position".

8. The method for extracting structured information based on adaptive spatial metric according to claim 1, wherein: the character string of the element c is added to the position of the "value" of the element r, specifically:

9. The method for extracting structured information based on adaptive spatial metric according to claim 1, wherein: judging whether the character string of the element s contains a keyword x to be identified, specifically:

10. A structured information extraction system based on adaptive spatial metrics, characterized by: a memory and a controller, the memory having a computer readable program stored therein, the computer readable program when invoked by the controller being capable of performing the steps of the structured information extraction method based on adaptive spatial metrics as claimed in any one of claims 1 to 9.

11. A medium, characterized by: in which a computer readable program is stored, which, when being called by a controller, is capable of performing the steps of the structured information extraction method based on adaptive spatial metrics as claimed in any one of claims 1 to 9.