CN114297987B

CN114297987B - Document information extraction method and system based on text classification and reading understanding

Info

Publication number: CN114297987B
Application number: CN202210221913.0A
Authority: CN
Inventors: 闫凯峰; 孙林君
Original assignee: Hangzhou Real Intelligence Technology Co ltd
Current assignee: Hangzhou Real Intelligence Technology Co ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-07-19
Anticipated expiration: 2042-03-09
Also published as: CN114297987A

Abstract

The invention belongs to the technical field of information content processing, and particularly relates to a document information extraction method and system based on text classification and reading understanding. The method includes S1, converting the document into a plain text format; s2, preprocessing the document to obtain input data; s3, generating corresponding word vectors, word vectors and context vectors, and splicing to obtain spliced vectors; s4, if the spliced vector is of an answer type, the spliced vector is used as the input of the next step; s5, obtaining the position of the best matched long label data; and S6, finally outputting the long entity field to be extracted. The system comprises a text information intelligent extraction module, a data preprocessing module, a feature extraction module, a text classification module, a reading understanding module, a long entity label data generation module and a data post-processing module. The method has the characteristics of greatly shortening training and predicting time and improving the precision and speed of the document extraction model in field extraction.

Description

Document information extraction method and system based on text classification and reading understanding

Technical Field

The invention belongs to the technical field of information content processing, and particularly relates to a document information extraction method and system based on text classification and reading understanding.

Background

Today, where offices are highly informative, enterprise office employees average about 1/3 days in dealing with text, such as legal personnel reviewing a large number of contracts, drafting agreements; financial accounting personnel review a large number of reports. The work has the characteristics of high repeatability, large workload and the like, the manual treatment efficiency is low, and irrecoverable huge loss is easily caused by errors. In recent years, as machine learning and deep learning have been applied to the field of natural language processing, intelligent document review systems have begun to enter a rapid development stage.

The intelligent document review system needs to quickly process documents and provide review functions of automatic text extraction, comparison, error correction and the like for enterprises. The system can replace manual work to complete the automation of the business process, greatly improve the working efficiency and reduce the business risk.

The method model framework diagram of the existing common document information extraction is shown in fig. 1:

1. and carrying out data preprocessing on the document input by the user and converting the document into a sample which can be processed by the model.

2. And processing the sample according to the format required to be input by the model to obtain the model input.

3. And obtaining the output of the model hidden layer through BERT/LSTM as the input of the next layer of model.

4. The CRF model uses a BIOS labeling system, and obtains an output long sequence label by processing the output of the CRF model; or classifying each input word by using a span model, and judging whether the word is the beginning position mark or the ending position mark of the long entity.

A disadvantage of current information extraction techniques in document review is that long entity information is extracted. The problem mainly relates to the following problems.

Firstly, the extraction precision of long entities in the document information is low, and the traditional method has great difficulty in extracting texts as contract termination condition fields. The text to be extracted is too long and exceeds the field extraction range of the named entity method in the traditional sense; the extraction of the long text needs to incorporate a large amount of semantic information to extract the target field more accurately.

Secondly, the field to be extracted is also long, and the CRF model which is most frequently used at present is difficult to obtain an accurate long label field (the field is defined to be more than 20 words in length).

Thirdly, the problem of entity nesting and entity overlapping cannot be solved by adopting a span mode, namely a mode of acquiring the entity to be extracted through the start mark and the end mark.

Fourthly, a RoBerta-based reading understanding method is used in the existing method to extract the long entity. Firstly, the task definition difficulty of the method is high, related problems need to be proposed according to pertinence of each different label in a training stage, all label problems need to be used for questioning in a prediction stage due to the fact that a document contains unknown label types, all possible long entity labels are obtained, time consumption of the whole training stage is increased greatly, and time consumption of a test stage is N times that of the original single label prediction (N is the total number of labels); secondly, for data in which no label originally exists in the document, that is, the document to be predicted is a negative example, if the label is predicted by the model result, it may cause cascading errors.

Therefore, it is necessary to design a document information extraction method and system based on text classification and reading understanding, which can greatly shorten training and prediction time and improve the precision and speed of a document extraction model in extracting fields.

For example, a method for extracting key information of a document in the field of artificial intelligence, which is described in chinese patent application No. cn202110353610.x, includes the following steps: s1, collecting document data in the artificial intelligence field, and performing key information extraction data annotation; s2, performing further pre-training on the pre-training model RoBERTA; s3, constructing an information extraction model; s4, initializing parameters of the backbone network by using a RoBERTA model obtained by further pre-training; s5, training by using the marked data, carrying out random replacement and data enhancement on the marked data in the training process, and calculating the error of back propagation by using the square cross entropy loss; and S6, extracting information in the unstructured text in the field of artificial intelligence by using the trained information extraction model to obtain result triples. Although the information extraction is used as a machine reading understanding task to solve, the starting point and the end point of each key information in the text are predicted, and the problem that the performance effect is greatly reduced when the sequence labeling model deals with the long-span knowledge text is solved, the defects of difficulty in model training and time consumption increase still exist.

Disclosure of Invention

The invention aims to solve the problems of difficult model training, rapid increase of time consumption and low extraction precision of the existing document information extraction method in the prior art, and provides a document information extraction method and a document information extraction system based on text classification and reading understanding, which can greatly shorten training and prediction time and improve the precision and speed of a document extraction model when extracting fields.

In order to achieve the purpose, the invention adopts the following technical scheme:

the document information extraction method based on text classification and reading understanding comprises the following steps;

s1, inputting a document, analyzing and identifying the document, and converting the document into a plain text format;

s2, preprocessing the text content in the document to obtain input data;

s3, generating corresponding word vectors, word vectors and context vectors according to the input data in the step S2, and splicing the word vectors, the word vectors and the context vectors to obtain spliced vectors;

s4, if the spliced vector is of an answer type, taking an entity text question corresponding to the spliced vector as the input of the next step;

s5, obtaining the position of the best matched long label data corresponding to the entity text question by calculation by using a reading understanding model;

and S6, acquiring long label data according to the position of the long label data, performing post-processing correction on the long label data, and finally outputting the long label data as a long entity field to be extracted.

Preferably, the preprocessing in step S2 includes:

carrying out regularization pretreatment on text content;

removing blanks in the text content, wherein the blanks comprise a space character, a tab character and a line feed character;

and segmenting the text content in the document according to the preset maximum length.

Preferably, step S3 includes the steps of:

s31, vectorizing the input data by constructing a word list and a word list, and respectively generating a word vector and a word vector corresponding to the input data;

s32, generating a context vector corresponding to the input data through a BilSTM model;

and S33, splicing the word vector, the word vector and the context vector to obtain a spliced vector.

Preferably, step S4 further includes the steps of:

and if the spliced vector is of an unanswerable type, indicating that the spliced vector has no corresponding entity text problem, and directly ending the operation.

Preferably, step S5 includes the steps of:

s51, obtaining the probability distribution of the long label data corresponding to the entity text question at each position through the following formula

Wherein E is a matrix representing the hidden layer output of the reading understanding model,

representing a first learnable weight;

s52, obtaining the result by the same calculation method as the step S51

S53, obtaining the maximum long label data corresponding to the entity text question through the following formulaA large probability matrix sum

Wherein, in the step (A),

is a starting position matrix of the largest possible label, and

for the end position matrix of the largest possible label, the calculation is as follows:

s54, predicting the maximum probability position of the matching start position and end position end of the long label data in the E matrix through a two-classification model

And obtaining the specific position information of the long label data, wherein the specific formula is as follows:

wherein the content of the first and second substances,

，

,

,

the i, j-th row,m represents a second learnable weight.

Preferably, step S6 includes the steps of:

and dynamically modifying the label type and the content of the long label data by using a regular expression.

Preferably, step S1 includes the steps of:

analyzing the txt and word format documents through the back end of the document review platform, and converting the txt and word format documents into a plain text format;

converting the jpg and pdf format documents into a pure text format through OCR character recognition.

The invention also provides a document information extraction system based on text classification and reading understanding, which comprises the following steps:

the intelligent extraction module of text information is used for analyzing and identifying the input document and converting the document into a plain text format;

the data preprocessing module is used for preprocessing the text content in the document to obtain input data;

the feature extraction module is used for generating corresponding word vectors, word vectors and context vectors for input data, and splicing the word vectors, the word vectors and the context vectors to obtain spliced vectors;

the text classification module is used for judging the spliced vectors, and taking the entity text questions corresponding to the spliced vectors as the input of the next step if the spliced vectors are of an answer type;

the reading understanding module is used for obtaining the position of the most matched long label data corresponding to the entity text problem through calculation by utilizing a reading understanding model;

the long entity tag data generating module is used for acquiring long tag data according to the position of the long tag data;

and the data post-processing module is used for carrying out post-processing correction on the long label data and finally outputting the long entity field to be extracted.

Preferably, the feature extraction module comprises;

the word vector feature extraction module is used for vectorizing the input data by constructing a word list to generate a word vector corresponding to the input data;

the word vector feature extraction module is used for vectorizing the input data by constructing a word table to generate a word vector corresponding to the input data;

and the context vector feature extraction module generates a context vector corresponding to the input data through a BilSTM model.

Preferably, the intelligent extraction module for text information includes:

the back end analysis module is used for analyzing the txt and word format documents through the back end of the document review platform and converting the txt and word format documents into a plain text format;

and the OCR character recognition module is used for converting the jpg and pdf format documents into a pure text format through OCR character recognition.

Compared with the prior art, the invention has the beneficial effects that: (1) the invention uses a method of combining text classification and reading understanding to solve the problem of long label extraction of document information extraction, can effectively obtain the associated semantic information between the relevant problems of long entities and the context, firstly inputs the target text into a single text classification model, the classification task is a multi-label multi-classification task, on one hand, the problem type associated with the target text with smaller range can be screened, on the other hand, the prediction process can be accelerated, because most input data only has few labels, and the question is not required to be asked for all labels, therefore, the classification model is added as a filter, which is beneficial to improving the prediction efficiency of the whole frame and reducing the prediction time; then, the problem types corresponding to the classification results are input into the reading understanding model in combination with the target text, so that the extracted long label field is more accurate; (2) in the data processing stage, the problem of entity overlapping is solved by a method of combining entities during preprocessing and separating entities during post-processing; by reading pointers start and end constructed in understanding, the problem of entity nesting is solved; (3) the method effectively solves the problem that the extraction of the long label by the prior method is inaccurate, not only improves the model prediction accuracy, but also improves the prediction efficiency, and has the advantages of less time consumption and extremely friendly prediction result to users; (4) the invention has the characteristics of more various factors, more comprehensive consideration, more reasonable design, more optimized efficiency and stronger universality.

Drawings

FIG. 1 is a frame diagram of a conventional method model for extracting common document information;

FIG. 2 is a flowchart of a document information extraction method based on text classification and reading understanding according to the present invention;

FIG. 3 is a block diagram of a matrix E according to the present invention;

FIG. 4 is a block diagram of a document information extraction system based on text classification and reading understanding according to the present invention;

fig. 5 is a flowchart of an exemplary service provided by an embodiment of the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

Example 1:

as shown in FIG. 2, the invention provides a document information extraction method based on text classification and reading understanding, comprising the following steps;

s2, preprocessing the text content in the document to obtain input data;

s4, if the spliced vector is of an answerable type, taking the entity text question corresponding to the spliced vector as the input of the next step;

In order to accelerate the training prediction process, the N +1 classification model is used to determine the answerable (i.e. extractable) data, and the pre-constructed question and document forming word vectors are input into the reading understanding model (such as BERT, RoBerta, Albert reading understanding model), so that more complete context information output can be obtained, which is denoted as matrix E. As shown in fig. 3.

Further, step S1 includes the following steps:

Further, the preprocessing in step S2 includes:

carrying out regularization pretreatment on text content;

For example, the document is divided according to the maximum length set in advance by dividing the mobile phone number + the text sequence number (1 × × 8\ n1.2, and dividing processing is required at 'n'), removing the blank characters (including space characters, tab characters and line feed characters), and finally dividing the document according to the maximum length set in advance.

Step S3 includes the following steps:

In order to enable the model to sufficiently obtain information of output data, three levels of vector features, namely, a word vector feature and a context vector feature, are mainly used. The word vector needs to be segmented by using a word segmentation tool, and the word vector need to be obtained by constructing a word list and then vectorizing; and the context vector needs to be obtained through the BilSTM model.

Further, step S4 includes the following steps:

if the spliced vector is of an unanswerable type, the spliced vector does not have a corresponding entity text question, and the operation is directly ended.

And step S4, obtaining possible labels through an N +1 multi-label classification model, outputting the possible labels as the problems corresponding to the label question, and combining the classification text as the input of the reading understanding model.

If yes, the reading understanding model is followed to obtain the entity label position (start, end) corresponding to the label.

Further, step S5 includes the following steps:

representing a first learnable weight;

s52, obtaining the result through the same calculation way as the step S51

S53, obtaining the maximum probability matrix of the long label data corresponding to the entity text question through the following formula

And

wherein, in the step (A),

is a starting position matrix of the largest possible label, and

s54, predicting the maximum probability position of the matching start position and the end position end of the long label data in the E matrix through a two-classification model

wherein the content of the first and second substances,

，

,

,

the i, j-th row of the E matrix is denoted, and m denotes the second learnable weight.

The matching probability of start and end is predicted through a binary model, i and j represent the ith and jth rows of the matrix (namely row information obtained by formulas (4) and (5)), and the specific position information of the long label data can be finally obtained through the series of calculation, so that the model prediction precision is favorably improved.

Further, step S6 includes the following steps:

Step S6, preprocessing data in reverse direction, namely obtaining a label according to problem analysis; meanwhile, the regular expression is used for dynamically modifying the label type and the label content.

As shown in fig. 4, the present invention also provides a document information extraction system based on text classification and reading understanding, including:

the text classification module is used for judging the spliced vector, and if the spliced vector is of an answerable type, taking an entity text question corresponding to the spliced vector as the input of the next step;

Further, the feature extraction module comprises;

Further, the intelligent extraction module for text information comprises:

Based on the technical scheme of the invention, in the concrete implementation and operation process, the concrete implementation flow of the invention is illustrated by a flow chart of a typical service shown in fig. 5.

As shown in fig. 5, the specific implementation flow is as follows:

1. the user inputs the formats of the document word, pdf, jpg, txt and the like, and selects the document type.

2. Analyzing txt and word format documents by utilizing the rear end of the document review platform through a text information intelligent extraction module, and converting the txt and word format documents into a plain text format; converting the jpg and pdf format documents into a pure text format by using OCR character recognition.

3. In a data preprocessing module, mainly relating to digital type data segmentation, such as 202110301.2, 20211030 of the first half part is a correct label, but the model can interfere with the data in the prediction of 1.2, and the data are segmented by regularized preprocessing; in addition, sentence segmentation needs to be performed according to a set length, such as 256 sentence length.

4. The characteristic extraction module is used for respectively generating word vectors and character vectors corresponding to the input data by constructing a word list and a character list; if word2vec is used, the input data is segmented to generate corresponding word vectors, and the word vectors are constructed after the corresponding word table is obtained by segmenting the current data. The context vector generates a context vector which is depended on by the input text through a BilSTM (bidirectional long-short time memory network), and the three vectors are spliced to carry out the next step.

5. The text classification module inputs the spliced vector into an N +1 multi-label classification model (specifically adopts a two-classification model), and if an unanswered type is obtained, the input is directly finished if no corresponding entity exists; otherwise, obtaining an entity corresponding to the input, and splicing the corresponding reading and understanding problem as the input of the next layer.

6. The reading understanding module, specifically using the Robert model, obtains the hidden layer matrix E of the reading understanding model, and then obtains the position of the long entity by calculating the maximum probability of (start, end) marker group matching, and the calculation process refers to step S5.

7. The data post-processing module plays a role of correcting the label and corrects some obvious errors so as to improve the prediction accuracy; the other effect is the reverse of step 3 in order to handle previously added invalid punctuation.

8. The final output is the long entity field to be extracted.

The text classification model is creatively added in front of the reading understanding model and used for classifying the label types corresponding to the target documents, so that the entity prediction precision is improved, and the problem of low speed caused by independent use of the reading understanding model is solved.

In the data processing stage, the problem of entity overlapping is solved by a method of combining the entity during preprocessing and separating the entity during post-processing; the problem of entity nesting is solved by constructing a way of reading and understanding the beginning and the end of the pointer type entity.

By using the method of combining the Chinese word vector, the word vector and the context vector, the context information is fully represented, the real-time interaction of the text information is facilitated, and the accuracy of extracting the information by the reading and understanding frame is further improved.

The method not only improves the accuracy of information extraction, but also greatly improves the long entity prediction speed, and the prediction result is extremely friendly to users.

The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims

1. The document information extraction method based on text classification and reading understanding is characterized by comprising the following steps;

s2, preprocessing the text content in the document to obtain input data;

s5, obtaining the position of the most matched long label data corresponding to the entity text problem by calculation by using a reading understanding model;

s6, obtaining long label data according to the position of the long label data, carrying out post-processing correction on the long label data, and finally outputting the long label data as a long entity field to be extracted;

the preprocessing in step S2 includes:

carrying out regularization pretreatment on text content;

removing blank characters in the text content, wherein the blank characters comprise space characters, tab characters and line feed characters;

segmenting text contents in the document according to a preset maximum length;

step S3 includes the following steps:

s33, splicing the word vector, the word vector and the context vector to obtain a spliced vector;

step S4 further includes the steps of:

if the spliced vector is of an unanswerable type, the spliced vector has no corresponding entity text problem, and the operation is directly finished;

step S5 includes the following steps:

s51, obtaining a probability distribution p representing that each position is long label data corresponding to the entity text problem through the following formula_start

representing a first learnable weight;

s52, obtaining Pp in the same calculation manner as step S51_end

And

wherein the content of the first and second substances,

is a starting position matrix of the largest possible label, and

Obtaining specific positions of long label dataThe information is specifically represented as follows:

wherein the content of the first and second substances,

2. The document information extraction method based on text classification and reading understanding according to claim 1, wherein the step S6 includes the steps of:

3. The document information extraction method based on text classification and reading understanding according to any one of claims 1-2, wherein the step S1 includes the steps of:

and converting the jpg and pdf format documents into a pure text format through OCR character recognition.

4. A document information extraction system based on text classification and reading understanding, which applies the document information extraction method based on text classification and reading understanding of any one of claims 1-2, wherein the document information extraction system based on text classification and reading understanding comprises:

5. The system of claim 4, wherein the feature extraction module comprises;

6. The system for extracting document information based on text classification and reading understanding according to claim 5, wherein the intelligent extracting module of text information comprises: