CN114973286A

CN114973286A - Document element extraction method, device, equipment and storage medium

Info

Publication number: CN114973286A
Application number: CN202210679246.0A
Authority: CN
Inventors: 王超凡; 宋时德; 梅林海
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-06-16
Filing date: 2022-06-16
Publication date: 2022-08-30

Abstract

The embodiment of the application discloses a document element extraction method, a device, equipment and a storage medium, wherein the method comprises the following steps: obtaining layout structure information of a document; coding each character in the document according to the layout structure information of the document; and determining the element label of each word according to the encoding result of each word. When each character in the document is coded, the layout structure information of the document is fused, the element label of each character is determined based on the character coding result fused with the document coding structure information, and the accuracy of document element extraction is improved.

Description

Document element extraction method, device, equipment and storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting document elements.

Background

The element extraction work is mainly to extract structured information from unstructured text, and is a very important sub-field in natural language processing. The existing document element extraction method is mainly based on a deep learning-based model, but the accuracy of the existing document element extraction method is poor.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, a device and a storage medium for extracting document elements, so as to improve accuracy of document element extraction.

In order to achieve the above object, the following solutions are proposed:

a document element extraction method includes:

obtaining layout structure information of the document;

coding each character in the document according to the layout structure information;

and determining the element label of each word according to the encoding result of each word.

In the above method, preferably, the obtaining layout structure information of the document includes:

processing the picture containing the document to obtain semantic features of each word in the document and position features corresponding to each text line;

for each text line, fusing the semantic features and the corresponding position features of each word in the text line to obtain the coding features of the text line;

and decoding the coding characteristics of each text line to obtain the layout structure information of the document.

The method, preferably, the process of obtaining the coding characteristics of each text line and the layout structure information of the document includes:

inputting the semantic features and the corresponding position features of each character in each text line into a layout analysis model in a document element extraction model to obtain the layout analysis model, fusing the semantic features and the corresponding position features of each character in the text line for each text line to obtain the coding features of the text line, and decoding the coding features of each text line to output layout structure information;

the layout analysis model is obtained by taking the semantic features and the corresponding position features of each character in each text line of a sample picture as input, taking the labeled layout structure information of the sample picture as a sample label and taking the layout structure information output by the layout analysis model approaching the sample label as target training.

Preferably, the processing the picture including the document to obtain the semantic features of each word in the document and the position features corresponding to each text line includes:

carrying out optical character recognition on the picture through a character recognition model in the document element extraction model to obtain each document line in the document and coordinates of the document line;

carrying out first coding on each word in each text line through a context representation model in the document element extraction model to obtain semantic features of each word;

extracting the features of the picture through a first feature extraction module of a text line position feature extraction model in the document element extraction model to obtain a feature map; extracting the position feature corresponding to each text line in the feature map according to the coordinate of each text line through a second feature extraction module of the text line position feature extraction model; the first feature extraction module is a feature extraction module of a pre-trained text line boundary detection model.

In the above method, preferably, the text line boundary detection model is obtained by training in the following manner:

inputting a sample picture into the text line boundary detection model, and performing feature extraction on the input sample picture through a feature extraction module of the text line boundary detection model to obtain a feature map of the sample picture;

processing the characteristic graph of the sample picture through an output module of the text line boundary detection model to obtain text line boundary coordinates in the sample picture;

updating the parameters of the text line boundary detection model by taking a label of a text line boundary coordinate output by the text line boundary detection model approaching the sample picture as a target;

the labels of the sample pictures are: and marking boundary coordinates of each text line for the sample picture.

processing the feature map of the sample picture through an output module of the text line boundary detection model to obtain text line boundary coordinates in the sample picture and the category of a corresponding area of each text line boundary coordinate;

updating the parameters of the text line boundary detection model by taking the text line boundary coordinates output by the text line boundary detection model and the category of the corresponding area of each text line boundary coordinate, and taking the label approaching the sample picture as a target;

the labels of the sample pictures are: and marking boundary coordinates of each text line aiming at the sample picture, and the category of a corresponding area of each text line boundary coordinate.

Preferably, in the method, the encoding each word in the document according to the layout structure information, and determining the element tag to which each word belongs according to the encoding result of each word includes:

coding each word in the document according to the layout structure information through an extraction model in the document element extraction model, and determining an element label of each word according to a coding result of each word; the extraction model is obtained by training in the following way:

inputting the layout structure information and each text line in the document into the extraction model to obtain the extraction model, coding each character in the input text line according to the input layout structure information, and determining the element label of each character according to the coding result of each character;

updating the parameters of the extraction model by taking the element label of each word output by the extraction model as the target to approach the label of the sample picture;

the labels of the sample pictures are: and aiming at the element label to which each word labeled by the sample picture belongs.

In the above method, preferably, the layout structure information at least includes: paragraph division, title hierarchy, header, footer.

Preferably, in the method, the encoding each word in the document according to the layout structure information includes:

constructing an abnormal composition based on the document according to the layout structure information, wherein nodes in the abnormal composition comprise word nodes, title nodes and text segment nodes; the edges in the heteromorphic graph include: the relation between words, the relation between words and text segments, and the relation between text segments and titles;

carrying out graph convolution on the abnormal graph to obtain a coding result of each node;

and fusing the coding result of each word with the coding result of the corresponding title node to obtain the coding result of each word.

In the above method, preferably, the title nodes in the heteromorphic graph are the titles in the document; the text segment nodes in the abnormal graph are all the text segments in the document;

alternatively, the first and second liquid crystal display panels may be,

the title node in the abnormal picture is a target title in the document, and the hierarchy of the target title is higher than that of the target; the text segment nodes in the abnormal picture comprise non-target titles and text segments in the text; the level of the non-target title is lower than or equal to the target level.

In the above method, preferably, the initial value of each node in the abnormal graph is determined as follows:

respectively carrying out second coding on each word in each text unit by taking the text unit in the document as a unit to obtain a context feature representation of each word in each text unit in the text unit as an initial value of each word node in the abnormal graph; each text unit is a title or a text segment;

for any title node, fusing the context characteristic representation of each word in the title of the title node to obtain an initial value of the title node;

and for any text segment node, fusing the title of the text segment node or the context characteristic representation of each word in the text segment to obtain an initial value of the text segment node.

A document element extraction device comprising:

an obtaining unit configured to obtain layout structure information of the document;

the coding unit is used for coding each word in the document according to the layout structure information;

and the extraction unit is used for determining the element label of each character according to the coding result of each character.

A document element extraction device includes a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the document element extraction method according to any one of the above.

A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the document element extraction method according to any one of the above.

It can be seen from the foregoing technical solutions that the document element extraction method, apparatus, device, and storage medium provided in the embodiments of the present application obtain layout structure information of a document; coding each character in the document according to the layout structure information of the document; and determining the element label of each word according to the encoding result of each word. When each character in the document is coded, the layout structure information of the document is fused, the element label of each character is determined based on the character coding result fused with the document coding structure information, and the accuracy of document element extraction is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flowchart of an implementation of a document element extraction method disclosed in an embodiment of the present application;

FIG. 2 is a diagram illustrating an example of the recognition result of elements disclosed in the embodiment of the present application;

FIG. 3 is a flowchart of an implementation of obtaining layout structure information of a document according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a document element extraction model disclosed in an embodiment of the present application;

FIG. 5 is a flowchart of an implementation of encoding words in a document according to layout structure information, as disclosed in an embodiment of the present application;

FIG. 6 is another schematic diagram of a document element extraction model disclosed in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a document element extraction apparatus disclosed in an embodiment of the present application;

fig. 8 is a block diagram of a hardware configuration of a document element extraction device disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The current document element extraction method based on the deep learning model is to obtain the vector representation of each character in the document, input the vector representation of each character into a pre-trained neural network, and obtain the element label of each character. This method of extracting elements is less accurate. The scheme of the application is provided for improving the accuracy of document element extraction.

As shown in fig. 1, an implementation flowchart of a document element extraction method provided in an embodiment of the present application may include:

step S101: and obtaining the layout structure information of the document.

Optionally, the document in the present application may be a document in any field, and may be a document in a financial field (e.g., a contract text) as an example, or a document in other fields, such as a document in a legal field (e.g., a decision document), or a document in a medical field (e.g., a medical record), and so on.

As an example, the layout structure information of a document can be obtained by processing a picture (i.e. a document in picture format, for convenience of description, referred to as a picture document) containing the document.

By way of example, the layout structure information may include, but is not limited to: paragraph division, title hierarchy, header, footer, etc.

Step S102: and coding each character in the document according to the layout structure information of the document.

According to the method and the device, each word in the document is coded based on the layout structure information of the document, so that the coding result of each word is fused with the layout structure information of the document.

Step S103: and determining the element label of each word according to the encoding result of each word.

As an example, the encoded result of each word may be decoded by a Conditional Random Field (CRF) model to obtain an element tag to which each word belongs.

As shown in fig. 2, an exemplary diagram of an element identification result provided in the embodiment of the present application is shown, in this example, an element of < first party > is identified, that is, "zhang san" is identified as an element: prescription A.

Since the position of the element in the document has a certain relationship with the layout structure of the document, for example, a title is "1. basic information of first party contract", there is a great possibility that the element < first party > exists in the paragraph under the title. Therefore, the element label of each word is determined based on the coding result of the word fused with the layout structure information of the document, and the accuracy of document element extraction can be improved.

In an alternative embodiment, a flowchart of an implementation of obtaining the layout structure information of the document is shown in fig. 3, and may include:

step S301: and processing the pictures containing the document to obtain the semantic features of each word in the document and the position features corresponding to each text line.

For an English document, each word in the document is a word.

As an example, an Optical Character Recognition (OCR) may be performed on a picture document to obtain each text line in the document and coordinates of each text line, and the coordinates of each text line (denoted as the ith text line for convenience of description and distinction) may be represented by coordinates of four vertices of a rectangular box covering the ith text line.

And coding each word in the ith text line (for convenience of description and distinction, marked as first coding) to obtain the semantic features of each word in the ith text line. 1, 2, 3, … …, N; n is the total number of lines of text that the document contains. Optionally, the ith text line may be input into a pre-training context representation model to obtain a context feature representation of each word in the ith text line. As an example, the pre-trained context representation model may be a pre-trained BERT model.

And determining the position characteristic of the ith text line based on the coordinates of the ith text line. As an example, the position feature of the ith text line may be determined in any one of the following two ways:

in the first mode, the coordinates of the ith text line are determined as the position characteristics of the ith text line.

In the second mode, feature extraction is carried out on the picture document through a first feature extraction module of the text line position feature extraction model to obtain a feature map; and extracting the position characteristic corresponding to the ith text line in the characteristic diagram according to the coordinate of the ith text line through a second characteristic extraction module of the text line position characteristic extraction model.

The first feature extraction module is a feature extraction module of a pre-trained text line boundary detection model. As an example, the text line boundary detection model may be implemented by a Cascade-based convolutional neural network (Cascade-RCNN) based on a preselected region.

In an alternative implementation, the input of the text line boundary detection model is a picture containing a document (i.e. a picture document), and the output of the text line boundary detection model is the coordinates of the text line boundary in the input picture document (which may be the coordinates of the vertices of a rectangular box covering the text line). The sample used for training the text line boundary detection model is a picture containing a document (which may be referred to as a sample picture), and the sample labels are: and marking boundary coordinates of each text line for the sample picture. The text line boundary detection model may be trained by:

inputting the sample picture into a text line boundary detection model to obtain a text line boundary detection model, performing feature extraction on the input sample picture through a feature extraction module to obtain a feature map, and processing the feature map through an output module to output text line boundary coordinates; and updating the parameters of the text line boundary detection model by taking the text line boundary coordinates output by the text line boundary detection model approaching the sample labels as targets until the training end condition is met.

In another alternative implementation, the input of the text line boundary detection model is a picture containing a document (i.e., a picture document), and the output of the text line boundary detection model is the text line boundary coordinates (which may be the vertex coordinates of a rectangular box covering the text line) in the input picture document, and the category of the corresponding area of each text line boundary coordinate (e.g., a text line, a table, etc.). The sample used for training the text line boundary detection model is a picture containing a document, and the sample label is as follows: the boundary coordinates of each text line labeled for the sample picture, and the category of each text line region. The text line boundary detection model may be trained by:

inputting the sample picture into a text line boundary detection model to obtain a text line boundary detection model, performing feature extraction on the input sample picture through a feature extraction module to obtain a feature map, processing the feature map through an output module to output text line boundary coordinates and the category of an area corresponding to the text line boundary coordinates; and updating the parameters of the text line boundary detection model by taking the text line boundary coordinates output by the text line boundary detection model and the region class approaching to the sample label as targets until the training end condition is met.

Optionally, in a second manner, one implementation manner that the second feature extraction module extracts the position feature corresponding to the ith text line in the feature map according to the coordinate of the ith text line may be:

and determining a bounding box corresponding to the ith text line according to the coordinates of the ith text line.

The scaling of the feature map relative to the picture (i.e., the picture containing the document) is obtained.

And scaling the bounding box of the ith text line according to the obtained scaling ratio, so that the scaling ratio of the scaled bounding box relative to the bounding box before scaling is equal to the scaling ratio of the feature map relative to the picture.

And extracting the features in the zoomed bounding box area in the feature map as the position features corresponding to the ith text line. That is, the feature in the feature map located in the scaled bounding box region is taken as the position feature corresponding to the ith text line.

Step S302: and for each text line, fusing the semantic features and the corresponding position features of each word in the text line to obtain the coding features of the text line.

As an example, the semantic features and the corresponding position features of each word in the text line may be fused by a feature fusion module in a pre-trained layout analysis model. In particular, the method comprises the following steps of,

the layout analysis model can calculate the mean value of the semantic features of each word in the ith text line through the feature fusion module to obtain the semantic features of the ith text line (i.e. summing the semantic features of each word, dividing the sum value by the number of the words in the ith text line to obtain the semantic features of the ith text line), splicing the semantic features of the ith text line with the position features of the ith text line, and coding the spliced features through a bidirectional Short-Term Memory Network (LSTM) or a Recurrent Neural Network (RNN) to obtain the coding features of the ith text line.

Step S303: and decoding the coding characteristics of each text line to obtain the layout structure information of the document.

The coding features of each text line can be decoded by a decoding module (for the sake of distinction, referred to as a first decoding module) in a pre-trained layout analysis model. As an example, the first decoding module may be implemented based on a two-layer unidirectional GRU network. Two layers of single GRUs are connected in series. Wherein, the parent GRU is used to fuse the coding feature of the current text line to be decoded (for the convenience of distinguishing and describing, it is denoted as the ith text line) and the hidden layer feature of the previous text line (i.e. the (i-1) th text line) to obtain the target fusion feature (for the convenience of distinguishing and describing, it is denoted as the first target fusion feature), decoding the first target fusion feature to obtain a partial decoding result (marked as a first decoding result) of a current text line to be decoded (i.e. the ith text line), fusing the encoding feature of the current text line (i.e. the ith text line) and the first target fusion feature output by the parent GRU by using the child GRU to obtain another target fusion feature (marked as a second target fusion feature for convenience of distinguishing and describing), and decoding the second target fusion characteristic to obtain another part of decoding results (marked as second decoding results) of the current text line to be decoded.

Wherein, the first decoding result is the attribute of the ith text line. The attribute of the text line is one of the following attributes: title level, text paragraph, header, footer.

The second decoding result is the associated text line of the ith text line and the relationship between the ith text line and the associated text line.

That is, in the present application, for each text line, there are three outputs of the first decoding module, which are: attributes of a text line, an associated text line of a text line, and a relationship of a text line to an associated text line.

The associated text line of the ith text line may be a previous line of the ith text line or a line of a title before the ith text line.

The relationship between the ith text line and the associated text line is one of the following relationships: a parallel relationship (for example, two subtitles under a certain title of a first hierarchy, where the two subtitles belong to a title of a second hierarchy, and then text lines (belonging to different subtitles) in the two subtitles belong to a parallel relationship, a last line of a previous text segment in two adjacent text segments is a parallel relationship with a first line of a next text segment), a progressive relationship (for example, a text line of a subtitle of the second hierarchy is a progressive relationship with a text line in a certain title of the first hierarchy), and a connection relationship (for example, two adjacent text lines in the same text segment are a connection relationship).

Optionally, one implementation manner of the above-mentioned merging the coding features of the current text line to be decoded (i.e. the ith text line) and the hidden layer features of the previous text line (i.e. the (i-1) th text line) may be: and splicing the coding features of the current text line to be decoded and the hidden layer features of the previous text line, and performing dimension transformation on the spliced features to obtain a first target fusion feature.

Optionally, one implementation manner of the above-mentioned merging the encoding feature of the current text line (i.e. the ith text line) and the first target merging feature output by the parent GRU may be: and splicing the coding features of the current text line to be decoded with the first target fusion features, and performing dimension transformation on the spliced features to obtain second target fusion features.

Optionally, the hidden layer feature of the i-1 th text line is a second target fusion feature at the previous time, that is: and the sub GRU decodes the associated text line of the (i-1) th text line and a second target fusion characteristic obtained when the (i-1) th text line is in relation with the associated text line.

Optionally, in the case that the ith text line is the first text line, the hidden layer feature of the previous text line may be the average of the encoding features of the respective text lines in the document.

Optionally, the input of the layout analysis model is semantic features and corresponding position features of each word in each text line obtained by processing the picture including the document in step S301. The output of the layout analysis model is the attributes of the individual text lines (text paragraph, title, header or footer), and the associated text line for each text line and the relationship of each text line to the associated text line.

The sample used for training the layout analysis model is a picture containing a document, and the sample label is as follows: the method comprises the steps of marking attributes of all text lines of a sample picture, associated text lines of all the text lines and the relation between each text line and the associated text lines. Wherein the content of the first and second substances,

if the attribute of one text line is a text segment, a header or a footer, the associated text line of the text line is the previous line of the text line, and the relationship between the text line and the associated text line is parallel relationship or connection relationship.

If the attribute of a text line is a first hierarchical heading, the associated text line of the text line belongs to the text line of the heading. If the previous title of the first hierarchy is also the title of the first hierarchy, the associated text line of the text line belongs to the text line of the previous title, and the relationship between the text line and the associated text line is parallel; if the previous title of the first level is the title of the second level, and the second level is higher than the first level, the associated text line of the text line belongs to the text line of the previous title, and the relationship between the text line and the associated text line is a progressive relationship; and if the previous title of the first level is the title of the second level, and the second level is lower than the first level, the associated text line of the text line is the text line of the title of the first level closest to the text line before the text line, and the relationship between the text line and the associated text line is parallel.

The training process of the layout analysis model may include:

processing the sample picture through the step S301 to obtain the semantic features and the corresponding position features of each character in each text line in the sample picture, inputting the semantic features and the corresponding position features of each character in each text line into the layout analysis model, and obtaining the attributes of each text line output by the layout analysis model, the associated text line of each text line and the relationship between each text line and the associated text line; and updating the parameters of the layout analysis model until the training end condition is met by taking the attributes of all text lines output by the layout analysis model, the associated text line of each text line and the relation between each text line and the associated text line as targets, wherein the relation is close to the sample label.

Optionally, the document element extraction method provided in the embodiment of the present application may be implemented by a document element extraction model. As shown in fig. 4, a schematic structural diagram of a document element extraction model provided in an embodiment of the present application may include:

a character recognition model 401, a context representation model 402, a text line position feature extraction model 403, a layout analysis model 404 and an extraction model 405; wherein the content of the first and second substances,

the character recognition model 401 is used to perform optical character recognition on a picture (i.e., a picture document) containing a document, so as to obtain each text line in the document and coordinates of the text line. As an example, character recognition model 401 may be an OCR model.

The context representation model 402 is used to perform a first encoding on each word in each text line output by the character recognition model 401, so as to obtain semantic features of each word. The context representation model 402 may be a pre-trained BERT model.

The text line position feature extraction model 403 is used for performing feature extraction on the image containing the document through a first feature extraction module to obtain a feature map; and extracting the position feature corresponding to each text line in the feature map by a second feature extraction module according to the coordinate of each text line output by the character recognition model 401. The first feature extraction module is a feature extraction module of a pre-trained text line boundary detection model. The training process of the text line boundary detection model refers to the foregoing embodiments, and is not described herein again.

The layout analysis model 404 is configured to fuse, by using the feature fusion module, semantic features of each word in each text line and corresponding position features output by the text line position feature extraction model 403 for each text line output by the character recognition model 401, so as to obtain coding features of the text line; and decoding the coding characteristics of each text line through a first decoding module to obtain the layout structure information of the document.

Obviously, step S301 is implemented by the character recognition model 401, the context representation model 402, and the text line position feature extraction model 403. Steps S302-S303 are implemented by layout analysis model 404. That is, step S101 is implemented by the character recognition model 401, the context representation model 402, the text line position feature extraction model 403, and the layout analysis model 404.

The extraction model 405 is used to encode each character in each text output by the character recognition model 401 according to the layout structure information output by the layout analysis model 404, and determine the element label to which each character belongs according to the encoding result of each character.

It is clear that steps S102-S103 are implemented by the extraction model 405.

Optionally, the character recognition model 401, the context representation model 402, the text line boundary detection model, the layout analysis model 404, and the extraction model 405 may be obtained by independent training respectively.

The training process of the character recognition model 401 and the context representation model 402 can refer to the existing implementation scheme, and is not described in detail here.

The training process of the text line boundary detection model and the layout analysis model 404 can refer to the foregoing embodiments, and will not be described herein.

The sample for training the extraction model 405 is a document and layout structure information of the document, and the layout structure information in the sample may be artificially labeled or obtained by processing a picture including the document through the character recognition model 401, the context representation model 402, the text line position feature extraction model 403 and the layout analysis model 404; the sample label is: and aiming at the element label of each character marked in the document.

The input of the extraction model 405 is a document and layout structure information of the document, and the output of the extraction model 405 is an element tag to which each word in the input document belongs.

The extraction model 405 may be trained by:

the document as a sample and the layout structure information of the document are input to the extraction model 405, and the element tag to which each word output by the extraction model 405 belongs is obtained.

And updating the parameters of the extraction model 405 by taking the goal that the element labels to which the words output by the extraction model 405 are close to the sample labels until the training end condition is met.

Optionally, the character recognition model 401, the context representation model 402, and the text line boundary detection model may be trained separately, and the layout analysis model 404 and the extraction model 405 may be trained jointly.

The sample for performing the joint training on the layout analysis model 404 and the extraction model 405 is a picture (i.e. a picture text) containing a document, and the sample labels are: the method comprises the steps of marking the attribute of each text line of a picture document, the associated text line of each text line, the relation between each text line and the associated text line and the element label of each word in the document.

The process of jointly training layout analysis model 404 and extraction model 405 may include:

and performing optical character recognition on the sample picture through the character recognition model 401 to obtain each text line in the document contained in the sample picture and the coordinates of the text line.

The words in each text output by the character recognition model 401 are first encoded by the context representation model 402 to obtain semantic features of each word.

Extracting the characteristics of the sample picture through a text line position characteristic extraction model 403 to obtain a characteristic diagram; and extracting the position feature corresponding to each text line in the feature map according to the coordinates of each text line output by the character recognition model 401.

The semantic features of each word in each text line in the sample picture and the position features corresponding to the text lines are input into the layout analysis model 404, and layout structure information (including the attributes of each text line in the sample, the associated text line of each text line, and the relationship between each text line and the associated text line) output by the layout analysis model 404 is obtained.

The layout structure information output by the layout analysis model 404 and the text line in the sample output by the character recognition model 401 are input into the extraction model 405, and the element label of each character in the text line output by the extraction model 405 is obtained.

With the layout structure information output by the layout analysis model 404 and the element labels of the characters output by the extraction model 405 approaching the sample labels as targets, the parameters of the layout analysis model 404 and the extraction model 405 are updated until the training end condition is met.

In an alternative embodiment, an implementation flowchart of encoding each word in the document according to the layout structure information is shown in fig. 5, and may include:

step S501: constructing an abnormal composition based on the document according to the layout structure information of the document, wherein nodes in the abnormal composition comprise word nodes, title nodes and text segment nodes; the edges in the heteromorphic graph include: word-to-word relationships, word-to-text segment relationships, text segment-to-title relationships.

According to the method and the device, the title and the text section are determined from the document according to the layout structure information, and then the abnormal composition is constructed according to the characters, the title and the text section in the document.

In determining the title, the titles of the respective hierarchies are determined.

The sketch map may be constructed based on the document by a map volume module in the extraction model 405 according to the layout structure information of the document.

In this application, the nodes in the heterogeneous graph include three types, which are respectively: word nodes, title nodes and text segment nodes; the relationships among the edge appearance nodes in the heteromorphic graph also include three types, which are respectively: word-to-word relationships, word-to-text segment relationships, text segment-to-title relationships. Wherein, the relation between the words can be co-occurrence frequency or co-occurrence times, etc.; the relationship between the word and the text segment may be the importance of the word to the text segment, such as the tf-idf (term frequency-inverse document frequency) of the word; the relation between the text segment and the title may be whether the text segment is related to the title, wherein the text segment is related to the title if the text segment belongs to the content under the title, and otherwise the text segment is not related to the title.

As an example, the title nodes in the heteromorphic graph are all the titles in the document, that is, each title is taken as a title node; the text segment nodes in the anomaly graph are the respective text segments in the document.

As an example, the title node in the heteromorphic graph is a target title in the document, and the hierarchy of the target title is higher than the target hierarchy; the text segment nodes in the abnormal picture comprise non-target titles and all text segments in the document; the level of the non-target title is lower than or equal to the target level. That is, in the present application, only a part of the headings (i.e., headings with a higher hierarchy than a target hierarchy) are used as the heading nodes in the heteromorphic graph, and another part of the headings (i.e., headings with a lower hierarchy than or equal to the target hierarchy) are used as the partial text segment nodes in the heteromorphic graph, so the text segment nodes in the heteromorphic graph include two parts: and a part of text segment nodes are text segments which do not belong to the titles in the document, and a part of text segment nodes are the titles.

Step S502: and performing Graph Convolution (Graph Convolution) on the heterogeneous Graph to obtain the coding result of each node.

The heterogeneous Graph can be subjected to Graph convolution through a Graph convolution network (Graph Convolutional Networks), and the coding result of each node is obtained. The specific implementation process is not the focus of the present application, and reference may be made to the existing implementation manner, which is not described in detail herein.

The heterogeneous graph may be graph convolved by a graph convolution module (i.e., a graph convolution network) in the decimation model 405.

Step S503: and fusing the coding result of each word with the coding result of the corresponding title node to obtain the coding result of each word.

The title node corresponding to a word W is the title node to which the text segment where the word W is located belongs. For example,

suppose there are two secondary headings T11 and T12 under a primary heading T1, each having a text segment under it, suppose the text segment under the secondary heading T11 is D1 and the text segment under the secondary heading T12 is D2. Then the user can either, for example,

if the primary title T1, the secondary title T11, and T12 are title nodes and W belongs to the text passage D1, the title node corresponding to the word W is T11.

If only the primary title T1 is a title node and the secondary titles T11 and T12 are both text segment nodes, then the title node for word W is T1.

As an example, the encoding result of the header node may be linearly transformed so that the dimension of the transformed encoding result is the same as the dimension of the encoding result of the word W, and then the transformed encoding result of the header node and the encoding result of the word W are spliced or averaged to obtain the encoding result of the word W.

The encoding result of each word and the encoding result of the corresponding header node may be fused by a graph convolution module in the extraction model 405 to obtain the encoding result of each word.

After the coding result of each word is obtained, the decoding module (for the sake of easy distinction, referred to as the second decoding module) of the extraction model 405 may decode the coding result of each word output by the volume module to obtain the element tag to which each word belongs. The second decoding module can decode the encoding result of each word through a Conditional Random Field (CRF) model to obtain the element label of each word.

In an alternative embodiment, the initial value of each node in the abnormal pattern may be determined as follows:

and respectively coding each word in each text unit (for convenience of description and distinction, recorded as the jth text unit) by taking the text unit in the document as a unit (for convenience of description and distinction, recorded as the second coding), and obtaining the context feature representation of each word in each text unit in the text unit as the initial value of each word node in the heteromorphic graph. j ═ 1, 2, 3, … …, M; m is the total number of text units. Each text unit is a title or a text segment.

As an example, the jth text unit may be processed by a context representation module in the extraction model 405, so as to obtain a context feature representation of each word in the jth text unit.

For any title node, the context feature representations of all words in the title of the title node are fused to obtain the initial value of the title node.

As an example, the context feature representation of each word in the title of the title node may be averaged to obtain the initial value of the title node.

And for any text segment node, fusing the title of the text segment node or the context characteristic representation of each word in the text segment to obtain an initial value of the text segment node. That is, if the text segment node is a title, the context feature representations of the words in the title of the text segment node are fused to obtain an initial value of the text segment node. And if the text segment node is a text segment, fusing the context characteristic representations of all words in the text segment of the text segment node to obtain an initial value of the text segment node.

As an example, the context feature representation of each word in the title of a text segment node may be averaged to obtain an initial value of the text segment node. Or, averaging the context feature representations of the words in the text segment of the text segment node to obtain the initial value of the text segment node.

Fig. 6 is a schematic structural diagram of a document element extraction model provided in the embodiment of the present application. The difference from fig. 4 is mainly that a schematic diagram of a structure of the extraction model 405 is given, in this example,

the extraction model 405 may first perform second encoding on each word in each text unit by using a text unit in a document as a unit through a context representation module according to layout structure information to obtain context feature representations of each word in each text unit in the text unit, and fuse the context feature representations of each word belonging to the same text unit to obtain feature representations of the text unit;

then the extraction model 405 constructs an abnormal composition according to the layout structure information through a graph volume module, and uses the context feature representation of the words obtained by the context representation module and the feature representation of each text unit as the initial values of the nodes of the abnormal composition, and carries out graph convolution on the abnormal composition to obtain the coding result of each node; and fusing the coding result of each word with the coding result of the corresponding title node to obtain the coding result of each word.

Finally, the extraction model 405 decodes the coding result of each word through a conditional random field decoding module to obtain the element tag of each word.

Corresponding to the method embodiment, the present application further provides a document element extraction device, and a schematic structural diagram of the document element extraction device provided in the embodiment of the present application is shown in fig. 7, and may include:

an obtaining unit 701, an encoding unit 702, and an extraction unit 703; wherein

An obtaining unit 701 is configured to obtain layout structure information of the document;

the encoding unit 702 is configured to encode each word in the document according to the layout structure information;

the extraction unit 703 is configured to determine an element tag to which each word belongs according to an encoding result of each word.

According to the document element extraction device provided by the embodiment of the application, when each word in a document is coded, the layout structure information of the document is fused, the element label of each word is determined based on the word coding result fused with the document coding structure information, and the accuracy of document element extraction is improved.

In an optional embodiment, the obtaining unit 701 includes:

the feature extraction unit is used for processing the picture containing the document to obtain the semantic features of each word in the document and the position features corresponding to each text line;

the first fusion unit is used for fusing the semantic features and the corresponding position features of all the characters in each text line by a user to obtain the coding features of the text lines;

and the decoding unit is used for decoding the coding characteristics of each text line to obtain the layout structure information of the document.

In an optional embodiment, when the first fusing unit obtains the encoding characteristics of each text line, and the decoding unit obtains the layout structure information of the document, the first fusing unit is configured to:

In an optional embodiment, when the feature extraction unit processes the picture including the document to obtain the semantic features of each word in the document and the position features corresponding to each text line, the feature extraction unit is configured to:

carrying out optical character recognition on the picture through a character recognition model in the document element extraction model to obtain each text line in the document and coordinates of the text line;

performing feature extraction on the picture through a first feature extraction module of a text line position feature extraction model in the document element extraction model to obtain a feature map; extracting the position feature corresponding to each text line in the feature map according to the coordinate of each text line through a second feature extraction module of the text line position feature extraction model; the first feature extraction module is a feature extraction module of a pre-trained text line boundary detection model.

In an optional embodiment, the system further includes a training module, configured to train the text line boundary detection model, including:

the labels of the sample pictures are: and labeling boundary coordinates of each text line aiming at the sample picture, and classifying the corresponding area of each text line boundary coordinate.

In an optional embodiment, the encoding unit 702 encodes each word in the document according to the layout structure information, and when the extraction unit 703 determines the element tag to which each word belongs according to the encoding result of each word, it is configured to:

In an optional embodiment, the layout structure information at least includes: paragraph division, title hierarchy, header, footer.

In an alternative embodiment, the encoding unit 702 includes:

the extraction unit is used for extracting a text unit from the document according to the layout structure information by a user; each text unit is a title or a text segment;

the composition unit is used for constructing an abnormal composition based on the document, and nodes in the abnormal composition comprise word nodes, title nodes and text segment nodes; the edges in the heteromorphic graph include: the relation between words, the relation between words and text segments, and the relation between text segments and titles;

the graph convolution unit is used for carrying out graph convolution on the heterogeneous graph to obtain the coding result of each node;

and the second fusion unit is used for fusing the coding result of each word with the coding result of the corresponding title node to obtain the coding result of each word.

In an optional embodiment, the title nodes in the heteromorphic graph are all the titles in the document; the text segment nodes in the abnormal graph are all the text segments in the document;

alternatively, the first and second electrodes may be,

the title node in the abnormal composition is a target title in the document, and the hierarchy of the target title is higher than that of the target; the text segment nodes in the abnormal picture comprise non-target titles and text segments in the text; the level of the non-target title is lower than or equal to the target level.

In an alternative embodiment, the graph convolution unit determines the initial value of each node in the abnormal graph by:

and for any text segment node, fusing the context characteristic representations of all words in the title of the text segment node to obtain the initial value of the text segment node.

The document element extraction device provided by the embodiment of the application can be applied to document element extraction equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Alternatively, fig. 8 shows a block diagram of a hardware structure of the document element extraction device, and referring to fig. 8, the hardware structure of the document element extraction device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

obtaining layout structure information of the document;

Alternatively, the detailed function and the extended function of the program may refer to the above description.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

obtaining layout structure information of the document;

Alternatively, the detailed function and the extended function of the program may be as described above.

Those of skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A document element extraction method is characterized by comprising the following steps:

obtaining layout structure information of the document;

2. The method of claim 1, wherein obtaining layout structure information for the document comprises:

3. The method of claim 2, wherein obtaining the coding characteristics of each line of the document and the layout information of the document comprises:

4. The method according to claim 2, wherein the processing the picture containing the document to obtain the semantic features of each word in the document and the position features corresponding to each text line comprises:

5. The method of claim 4, wherein the text line boundary detection model is trained by:

6. The method of claim 4, wherein the text line boundary detection model is trained by:

updating the parameters of the text line boundary detection model by taking the text line boundary coordinates output by the text line boundary detection model and the category of the corresponding area of each text line boundary coordinate, which approaches to the label of the sample picture as a target;

7. The method of claim 4, wherein the encoding each word in the document according to the layout structure information, and determining the element tag to which each word belongs according to the encoding result of each word comprises:

8. The method according to any one of claims 1 to 7, wherein the layout structure information includes at least: paragraph division, title hierarchy, header, footer.

9. The method of claim 8, wherein encoding each word in the document according to the layout structure information comprises:

10. The method of claim 9, wherein the title nodes in the composition are respective titles in the document; the text segment nodes in the abnormal graph are all the text segments in the document;

alternatively, the first and second electrodes may be,

11. The method of claim 9, wherein the initial value of each node in the anomaly map is determined by:

12. A document element extraction device, comprising:

13. A document element extraction device characterized by comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to realize the steps of the document element extraction method according to any one of claims 1 to 11.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the document element extraction method according to any one of claims 1 to 11.