CN115937887A

CN115937887A - Method and device for extracting document structured information, electronic equipment and storage medium

Info

Publication number: CN115937887A
Application number: CN202211222617.9A
Authority: CN
Inventors: 王雷; 张睿; 周元剑
Original assignee: Shanghai Hongji Information Technology Co Ltd
Current assignee: Shanghai Hongji Information Technology Co Ltd
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2023-04-07

Abstract

The application provides a method and a device for extracting document structured information, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a document picture to be processed; extracting the position information and attribute information of each text area in the document picture through the trained structured model; extracting the position information and text content of each text line in the document picture through an OCR engine; for each text line, determining attribute information of the text line according to the position information of the text line and the position information and attribute information of each text region; and obtaining the structured information of the document picture according to the attribute information and the text content of each text line. According to the scheme, a template does not need to be independently customized for each document, any OCR engine can be adapted, the automation degree of document structured information extraction is improved, and the generalization and the robustness are improved.

Description

Method and device for extracting document structured information, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a method and an apparatus for extracting document structured information, an electronic device, and a storage medium.

Background

In recent years, with the continuous and deep development of digitalization in China, the electronization of paper documents has become a trend. However, due to the complex and various formats of the paper documents, the document contents need to be manually extracted and structurally recorded into an electronic system. There are a lot of drawbacks in the manual entry process, be the manual entry can inevitable appear miss wrong item problem on the one hand, on the other hand needs to dispose a large amount of manpower resources and carries out high repeatability work, and this not only can bring very big pressure for the personnel of entering, and inefficiency moreover.

For the electronization of paper documents, it is generally necessary to perform a structuring process on a Character Recognition result after recognizing Character information in an image by an OCR (Optical Character Recognition) technology. The existing structured processing mode generally adopts a rule-based mode or a learning-based mode, but the robustness and the generalization of the two modes are low.

Disclosure of Invention

The embodiment of the application provides a method for extracting document structured information, which is used for improving generalization and robustness.

The embodiment of the application provides a method for extracting document structured information, which comprises the following steps:

acquiring a document picture to be processed;

extracting the position information and the attribute information of each text area in the document picture through a trained structural model;

extracting the position information and the text content of each text line in the document picture through an OCR engine;

for each text line, determining attribute information of the text line according to the position information of the text line and the position information and attribute information of each text region;

and obtaining the structural information of the document picture according to the attribute information and the text content of each text line.

In one embodiment, the structured model comprises a feature extraction module adopting Swin-Transformer architecture, a first decoding module, a second decoding module and a third decoding module; the first decoding module, the second decoding module and the third decoding module all adopt a transformer architecture;

the output of the feature extraction module is used as the input of the first decoding module, and the output of the first decoding module is respectively used as the input of the second decoding module and the input of the third decoding module.

In an embodiment, the extracting, by the trained structured model, the position information and the attribute information of each text region in the document picture includes:

extracting the image characteristics of the document picture through the characteristic extraction module;

decoding the image features through the first decoding module to obtain initial decoding features;

and inputting the initial decoding features into the second decoding module and the third decoding module respectively to obtain the attribute information of each text region output by the second decoding module and the position information of each text region output by the third decoding module.

In an embodiment, the determining, for each text line, attribute information of the text line according to the position information of the text line and the position information and attribute information of each text region includes:

for each text line, determining a target text region corresponding to the text line according to the position information of the text line and the position information of each text region;

and obtaining the attribute information of the text line according to the attribute information of the target text area.

In an embodiment, the determining, for each text line, a target text region corresponding to the text line according to the position information of the text line and the position information of each text region includes:

for each text line, determining the overlapping degree between the text line and each text region according to the position information of the text line and the position information of each text region;

and determining a target text region corresponding to the text line according to the overlapping degree between the text line and each text region.

In an embodiment, the determining, according to the degree of overlap between the text line and each text region, a target text region corresponding to the text line includes:

and according to the overlapping degree between the text line and each text region, taking the text region with the maximum overlapping degree as a target text region corresponding to the text line.

In an embodiment, before the extracting, by the trained structured model, the position information and the attribute information of each text region in the document picture, the method further includes:

and performing deep learning according to the sample document picture marked with the position information and the attribute information of each text box, and training to obtain the structural model.

The embodiment of the present application further provides an apparatus for extracting document structured information, including:

the image acquisition module is used for acquiring a document image to be processed;

the information extraction module is used for extracting the position information and the attribute information of each text area in the document picture through the trained structural model;

the character recognition module is used for extracting the position information and the text content of each text line in the document picture through an OCR engine;

the attribute determining module is used for determining the attribute information of each text line according to the position information of the text line and the position information and the attribute information of each text area;

and the result output module is used for obtaining the structural information of the document picture according to the attribute information and the text content of each text line.

An embodiment of the present application further provides an electronic device, where the electronic device includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the above method for extracting document structured information.

The embodiment of the application also provides a computer readable storage medium, wherein the storage medium stores a computer program, and the computer program can be executed by a processor to complete the method for extracting the document structured information.

According to the technical scheme provided by the embodiment of the application, the position information and the attribute information of each text region in the document picture are extracted through a trained structural model; extracting the position information and the text content of each text line in the document picture through an OCR engine; determining attribute information of the text line according to the position information of the text line and the position information and attribute information of each text area; according to the scheme, a template does not need to be customized for each document independently, and therefore funds, manpower and time are saved. Structured model and OCR engine are independent each other, and structured model's output result does not rely on OCR engine's output result, so can any OCR engine of adaptation, compares prior art, can promote the degree of automation of the structured information extraction of document, has improved generalization and robustness.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.

FIG. 1 is a diagram illustrating comparison of recognition results of two OCR engines in the prior art;

fig. 2 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

FIG. 3 is a flowchart illustrating a method for extracting document structured information according to an embodiment of the present application;

FIG. 4 is an architectural diagram of a structured model provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a Swin-Transformer architecture provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a Swin-Transformer architecture provided in an embodiment of the present application;

FIG. 7 is a diagram illustrating a plurality of text regions extracted from a document picture according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram illustrating a principle of overlap calculation provided by an embodiment of the present application;

fig. 9 is a block diagram of an apparatus for extracting document structured information according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

The machine Process Automation (RPA) technology can simulate the operation of an employee on a computer through a keyboard and a mouse in daily work, and can replace a human to execute operations of logging in a system, operating software, reading and writing data, downloading files, reading mails and the like. The automatic robot is used as the virtual labor force of an enterprise, so that the staff can be liberated from repeated and low-value work, and the energy is put into the work with high added value, so that the enterprise can realize the reduction of cost and the increase of benefit while realizing the digital intelligent transformation.

The RPA is a software robot which replaces manual tasks in business processes and interacts with a front-end system of a computer like a human, so the RPA can be regarded as a software program robot running in a personal PC or a server, and replaces human beings to automatically repeat operations such as mail retrieval, attachment downloading, system logging, data processing and analysis and other activities by imitating the operations performed by users on the computer, and is fast, accurate and reliable. Although the problems of speed and accuracy in human work are solved by specific rules set like the traditional physical robot, the traditional physical robot is a robot combining software and hardware, and can execute work only by matching with software under the support of specific hardware; the RPA robot is in a pure software layer, and can be deployed to any PC and server to complete specified work as long as corresponding software is installed.

That is, RPA is a way to perform business operations using "digital staff" instead of people and its related technology. In essence, the RPA realizes unmanned operation of objects such as systems, software, web pages, documents and the like on a computer by a human simulator through a software automation technology, acquires service information, executes service actions, and finally realizes automatic process, labor cost saving and processing efficiency improvement. It can be known from the description that in order to implement RPA, it is necessary to identify documents, screens, etc. first, and determine the contents of different positions of the documents or screens, so as to automatically operate the contents. Extracting structured information from document pictures becomes one of the technologies of interest for implementing RPA.

To extract structured information from document pictures, the related art can be roughly divided into two categories: rule-based approaches and learning-based approaches. The rule-based approach is implemented by configuring a template, but the robustness of the approach is low. For example, even if the same document, such as admission receipt, is used, admission receipt of different hospitals needs to individually customize a set of rules, which will definitely limit the application. In addition, such template rule-based implementation methods have high requirements on picture quality, and particularly, new ticket picture data to be structurally identified need to be kept as consistent as possible with the template. Therefore, the existing method cannot be used to complete an automatic structuring process.

A structural model is learned through a large amount of labeled data in a learning-based manner. The robustness of the method is obviously improved, but the method needs to rely on the output result of a front-end OCR engine for data annotation. As shown in fig. 1, (1) is the detection result of OCR engine a, (2) is the detection result of OCR engine B, if the OCR engine a is used, the structured model needs to be labeled in the format of "key = operator, value = Li Chunxiang" if the OCR engine a is used, and in the format of "value = operator Li Chunxiang" if the OCR engine B is used, which means that the structured model needs to be updated synchronously once the OCR engine is updated. In addition to the above reasons, because the value range of the acquired key cannot be represented, the generalization based on the learning manner is limited at present.

Fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 200 may be configured to execute the method for extracting document structured information provided in the embodiment of the present application. As shown in fig. 2, the electronic device 200 includes: one or more processors 202, and one or more memories 204 storing processor-executable instructions. Wherein, the processor 202 is configured to execute the method for extracting document structured information provided by the following embodiments of the present application.

The processor 202 may be a device containing a Central Processing Unit (CPU), a Graphics Processing Unit (GPU) or other form of processing unit having data processing and/or instruction execution capabilities, may process data for other components in the electronic device 200, and may control other components in the electronic device 200 to perform desired functions.

The memory 204 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 202 to implement the document structured information extraction method described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

In one embodiment, the electronic device 200 shown in FIG. 2 may also include an input device 206, an output device 208, and a data collection device 210, which are interconnected via a bus system 212 and/or other form of connection mechanism (not shown). It should be noted that the components and configuration of the electronic device 200 shown in fig. 2 are exemplary only, and not limiting, and the electronic device 200 may have other components and configurations as desired.

The input device 206 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like. The output device 208 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like. The data acquisition device 210 may acquire an image of a subject and store the acquired image in the memory 204 for use by other components. Illustratively, the data acquisition device 210 may be a camera.

In an embodiment, the components in the example electronic device 200 for implementing the method for extracting document structured information according to the embodiment of the present application may be integrally disposed or may be separately disposed, such as integrally disposing the processor 202, the memory 204, the input device 206, and the output device 208, and separately disposing the data acquisition device 210.

In an embodiment, the example electronic device 200 for implementing the method for extracting document structured information of the embodiment of the present application may be implemented as an intelligent device such as a notebook computer, a desktop computer, a smart phone, a server, and the like.

Fig. 3 is a flowchart illustrating an extraction method of document structured information according to an embodiment of the present application. As shown in fig. 3, the method includes:

step S310: and acquiring a document picture to be processed.

The document picture can be obtained by directly photographing a paper document or can be converted into a picture format from a PDF format. The document picture may be acquired by a local camera, or may be stored locally in advance or acquired from an external device, for example, the server may acquire the document picture from the client. The embodiments described below in this application require that structured information be extracted from the picture of this document.

Step S320: and extracting the position information and the attribute information of each text region in the document picture through the trained structural model.

The structured model may be a deep learning model trained in advance for extracting location information and attribute information.

As shown in fig. 4, the structured model includes a feature extraction module adopting Swin-Transformer architecture, a first decoding module, a second decoding module and a third decoding module. The feature extraction module is used for extracting image features of the document pictures. The first decoding module, the second decoding module and the third decoding module all adopt a transformer architecture. The output of the feature extraction module is used as the input of the first decoding module, and the output of the first decoding module is respectively used as the input of the second decoding module and the input of the third decoding module.

In an embodiment, the step S320 specifically includes: extracting image features of the document picture through the feature extraction module architecture; decoding the image features through the first decoding module to obtain initial decoding features; and inputting the initial decoding characteristics into the second decoding module and the third decoding module respectively to obtain the attribute information of each text region output by the second decoding module and the position information of each text region output by the third decoding module.

Theoretically, the image features of the document picture can be extracted by using a CNN (Convolutional Neural Network). However, considering that a sequence needs to be output in prediction, the decoding module adopts a transform architecture, so self-attention operation needs to be performed on the last layer output of the CNN. However, the self-attention operation performed on the output of CNN is not only time-complex but also space-complex, and thus the practical application is limited. For the reasons, the embodiment adopts the Swin-Transformer architecture to extract the image features of the document picture.

Figure 5 is a schematic diagram of the Swin-Transformer architecture. By adopting a hierarchical design, as shown in fig. 5, the resolution of the input feature map is reduced for each layer, and the receptive field is enlarged layer by layer like CNN. The Swin-Transformer architecture is specifically shown in FIG. 6 (a), and includes 4 stages. At the beginning of the input, a Patch partition is made, and the picture is sliced into individual blocks. In each Stage, consisting of Patch Merging and multiple Swin transducer blocks. Wherein the Patch Merging module reduces the picture resolution primarily at the beginning of each Stage. The structure of the Swin-Transformer block of each stage is shown in FIG. 6 (b), and mainly comprises LayerNorm, MLP, window Attention and Shifted Window Attention. As shown in fig. 6, no matter how large the document picture is input, a feature sequence with dimension (B, 16, C) is finally output, where B is batch size (the number of features), 16=4 × 4 is the length of the feature, and C is the dimension of the feature.

The attribute information may include three attributes, i.e., key, value, and other. The attribute of Other may indicate that the text region is ignored.

As shown in fig. 7, each box (text box) represents a text region, the attribute corresponding to box1 is other, the attribute corresponding to box2 is other, the attribute corresponding to box3 is key, the attribute corresponding to box4 is value, the attribute corresponding to box5 is key, the attribute corresponding to box6 is value, the attribute corresponding to box7 is key, the attribute corresponding to box8 is value, the attribute corresponding to … box11 is key, and the attribute corresponding to box12 is value, so that the second decoding module can output an attribute sequence of er, other, key, value, key, value, …, key, value.

The position information of each text region corresponds to the attribute information one-to-one, so the third decoding module can output the position information of box1, box2, box3, …, box 12. The position information of each text region may be represented using (x 1, y1, x2, y 2), i.e., composed of the upper left-hand coordinates (x 1, y 1) and the lower right-hand coordinates (x 2, y 2) of the text region. As can be seen from fig. 7, if multiple lines of words belong to the same key, the multiple lines of words may be in one text box, for example, a value corresponding to a business place contains two lines of words, and the two lines of words are in one text box.

Step S330: and extracting the position information and the text content of each text line in the document picture through an OCR engine.

It should be noted that the order of step S320 and step S330 is not limited, step S320 may be executed first and then step S330 is executed, step S330 may be executed first and then step S320 is executed, and step S320 and step S330 may also be executed simultaneously for saving time.

An OCR (Optical Character Recognition) engine is an existing software package with text detection and Recognition functions. Generally, the OCR engine includes a text detection model and a text recognition model, so that the OCR engine can analyze and process the document picture, automatically detect the position information of the text line and recognize the text content of the text line. The processing of the OCR engine can be seen in prior art implementations.

Step S340: and determining the attribute information of each text line according to the position information of the text line and the position information and attribute information of each text region.

Referring to FIG. 1, as the output result of the OCR engine, the "operator" is a text line (i.e. a rectangular box) and the "Li Chunxiang" is another text line (i.e. another rectangular box) due to the difference between the OCR engines; it is also possible that "operator Li Chunxiang" are the same line of text (i.e., within the same rectangular box). Referring to the position of each text region output by the structured model shown in fig. 7, box3 is a text region, box4 is a text region, and each box represents a text region. Therefore, due to the differences in OCR engines, a text row (e.g., "proprietor") may include a text region, e.g., including text region box3 as shown in fig. 7; a text line (e.g., "proprietor Li Chunxiang") may also include two text regions, including, for example, text region box3 and text region box4 shown in fig. 7.

In an embodiment, for each text line, a target text region corresponding to the text line may be determined according to the position information of the text line and the position information of each text region; and obtaining the attribute information of the text line according to the attribute information of the target text area.

The target text area refers to a text area whose position information matches the position information of the text line. In an embodiment, for each text line, determining an overlap degree (IOU) between the text line and each text region according to the position information of the text line and the position information of each text region; and determining a target text region corresponding to the text line according to the overlapping degree between the text line and each text region. In an embodiment, according to the degree of overlap between the text line and each text region, the text region with the largest degree of overlap may be used as the target text region corresponding to the text line.

The degree of overlap between a text line and a text region may be a coincidence area between the text line and the text region divided by an overall area of the text line and the text region. For example, as shown in fig. 8, the position information of the text line includes coordinates (x 3, y 3) of the upper left vertex and coordinates (x 4, y 4) of the lower right vertex of the text line, so that the area of the text line can be calculated from the position information of the text line. The position information of the text region includes coordinates (x 5, y 5) of the upper left vertex and coordinates (x 6, y 6) of the lower right vertex of the text region, so the area of the text region can be calculated from the position information of the text region. The overlapping area between the text line and the text region is referred to as the area of the shaded portion shown in fig. 8, so the overlapping degree between the text line and the text region may be the area of the shaded portion divided by the total area occupied by the text line and the text region.

By traversing all the text areas, the overlapping degree between the text line and each text area can be calculated, and the text area with the maximum overlapping degree is selected as the target text area corresponding to the text line. By analogy, for each text line, a target text region with the largest degree of overlap with the text line can be found. The attribute information of the target text region may be directly used as the attribute information of the text line, thereby obtaining the attribute information of each text line. For example, assuming that the text line of "Li Chunxiang" has the position coordinate maximum with the IOU of box4, since the attribute of box4 is value, the attribute of "Li Chunxiang" is value.

Step S350: and obtaining the structural information of the document picture according to the attribute information and the text content of each text line.

The structured information may include attribute information and text content for each line of text. For example, the document picture shown in fig. 7 may obtain the following structured information:

"other = business license (corresponding box 1)

…

key = composition form (corresponding to box 11), value = personal business (corresponding to box 12) ".

According to the technical scheme provided by the embodiment of the application, the position information and the attribute information of each text area in the document picture are extracted through the trained structural model; extracting the position information and the text content of each text line in the document picture through an OCR engine; determining attribute information of the text line according to the position information of the text line and the position information and attribute information of each text area; according to the scheme, the automation degree of extracting the document structured information can be improved, a specific template does not need to be customized for each document independently, and the fund, the labor and the time are saved. The structured model and the OCR engine are independent from each other, and the output result of the structured model does not depend on the output result of the OCR engine, so that the training of the structured model does not need the output result of the OCR engine and can be adapted to any OCR engine. The value range of the key does not need to be known in advance. Only after the extraction of the structured information is completed, the required key value is customized. Compared with the prior art, the generalization and the robustness are improved.

In an embodiment, before the step S320, the method provided in the embodiment of the present application may further include: and performing deep learning according to the sample document picture marked with the position information and the attribute information of each text box, and training to obtain the structural model.

The sample document picture is used as training data, and the position information and the attribute information of each text box are known in the sample document picture. As above, the position information of the text box may include the coordinates of the upper left vertex and the coordinates of the lower right vertex of the text box, and the attribute information may include three attributes of other, key, and value. If multiple lines of characters belong to the same key, only one text box needs to be marked.

Specifically, referring to the model architecture shown in fig. 4, a sample document picture may be used as an input, attribute information of each text box may be used as a correct output sequence of an upper branch, position information of each text box may be used as a correct output sequence of a lower branch, and parameters of Swin-Transformer, decoder1, decoder2, and Decoder3 are trained to minimize a difference between a predicted output sequence of an upper branch and a correct output sequence and a difference between a predicted output sequence of a lower branch and a correct output sequence.

It should be noted that, the training of the structured model does not need to rely on the output result of the OCR engine, and for the labeling result of the sample document picture (i.e. the position information and the attribute information of the text box), it is not needed to adjust the labeling specification according to the output result of the OCR engine, so that the labeling cost of the training data is low and the labeling specification is uniform.

The following is an embodiment of the apparatus of the present application, which can be used to execute the above-mentioned embodiment of the method for extracting document structured information of the present application. For details that are not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method for extracting structured information in the present application.

Fig. 9 is a block diagram of an apparatus for extracting document structured information according to an embodiment of the present application, where the apparatus includes, as shown in fig. 9:

the picture acquiring module 910 is configured to acquire a document picture to be processed.

The information extracting module 920 is configured to extract, through the trained structural model, position information and attribute information of each text region in the document picture.

And the character recognition module 930 is used for extracting the position information and the text content of each text line in the document picture through an OCR engine.

An attribute determining module 940 is configured to determine, for each text line, attribute information of the text line according to the position information of the text line and the position information and attribute information of each text region.

And a result output module 950, configured to obtain structured information of the document picture according to the attribute information and the text content of each text line.

The implementation process of the function and the effect of each module in the device is specifically described in the implementation process of the corresponding step in the extraction method of the document structured information, and is not described in detail herein.

In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

Claims

1. A method for extracting document structured information is characterized by comprising the following steps:

acquiring a document picture to be processed;

2. The method of claim 1, wherein the structured model comprises a feature extraction module, a first decoding module, a second decoding module, and a third decoding module employing Swin-Transformer architecture; the first decoding module, the second decoding module and the third decoding module all adopt a transformer architecture;

3. The method according to claim 2, wherein the extracting, through the trained structured model, the position information and the attribute information of each text region in the document picture comprises:

and inputting the initial decoding characteristics into the second decoding module and the third decoding module respectively to obtain the attribute information of each text region output by the second decoding module and the position information of each text region output by the third decoding module.

4. The method according to claim 1, wherein the determining, for each text line, the attribute information of the text line according to the position information of the text line and the position information and the attribute information of each text region comprises:

5. The method according to claim 4, wherein the determining, for each text line, a target text region corresponding to the text line according to the position information of the text line and the position information of each text region comprises:

6. The method of claim 5, wherein determining the target text region corresponding to the text line according to the overlap between the text line and each text region comprises:

7. The method according to claim 1, wherein before the extracting the position information and the attribute information of each text region in the document picture through the trained structural model, the method further comprises:

8. An apparatus for extracting document structured information, comprising:

9. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the method of extracting document structured information of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the method of extracting document structured information according to any one of claims 1 to 7.