CN114297987B - Document information extraction method and system based on text classification and reading understanding - Google Patents

Document information extraction method and system based on text classification and reading understanding Download PDF

Info

Publication number
CN114297987B
CN114297987B CN202210221913.0A CN202210221913A CN114297987B CN 114297987 B CN114297987 B CN 114297987B CN 202210221913 A CN202210221913 A CN 202210221913A CN 114297987 B CN114297987 B CN 114297987B
Authority
CN
China
Prior art keywords
text
word
document
vectors
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210221913.0A
Other languages
Chinese (zh)
Other versions
CN114297987A (en
Inventor
闫凯峰
孙林君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Real Intelligence Technology Co ltd
Original Assignee
Hangzhou Real Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Real Intelligence Technology Co ltd filed Critical Hangzhou Real Intelligence Technology Co ltd
Priority to CN202210221913.0A priority Critical patent/CN114297987B/en
Publication of CN114297987A publication Critical patent/CN114297987A/en
Application granted granted Critical
Publication of CN114297987B publication Critical patent/CN114297987B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention belongs to the technical field of information content processing, and particularly relates to a document information extraction method and system based on text classification and reading understanding. The method includes S1, converting the document into a plain text format; s2, preprocessing the document to obtain input data; s3, generating corresponding word vectors, word vectors and context vectors, and splicing to obtain spliced vectors; s4, if the spliced vector is of an answer type, the spliced vector is used as the input of the next step; s5, obtaining the position of the best matched long label data; and S6, finally outputting the long entity field to be extracted. The system comprises a text information intelligent extraction module, a data preprocessing module, a feature extraction module, a text classification module, a reading understanding module, a long entity label data generation module and a data post-processing module. The method has the characteristics of greatly shortening training and predicting time and improving the precision and speed of the document extraction model in field extraction.

Description

Document information extraction method and system based on text classification and reading understanding
Technical Field
The invention belongs to the technical field of information content processing, and particularly relates to a document information extraction method and system based on text classification and reading understanding.
Background
Today, where offices are highly informative, enterprise office employees average about 1/3 days in dealing with text, such as legal personnel reviewing a large number of contracts, drafting agreements; financial accounting personnel review a large number of reports. The work has the characteristics of high repeatability, large workload and the like, the manual treatment efficiency is low, and irrecoverable huge loss is easily caused by errors. In recent years, as machine learning and deep learning have been applied to the field of natural language processing, intelligent document review systems have begun to enter a rapid development stage.
The intelligent document review system needs to quickly process documents and provide review functions of automatic text extraction, comparison, error correction and the like for enterprises. The system can replace manual work to complete the automation of the business process, greatly improve the working efficiency and reduce the business risk.
The method model framework diagram of the existing common document information extraction is shown in fig. 1:
1. and carrying out data preprocessing on the document input by the user and converting the document into a sample which can be processed by the model.
2. And processing the sample according to the format required to be input by the model to obtain the model input.
3. And obtaining the output of the model hidden layer through BERT/LSTM as the input of the next layer of model.
4. The CRF model uses a BIOS labeling system, and obtains an output long sequence label by processing the output of the CRF model; or classifying each input word by using a span model, and judging whether the word is the beginning position mark or the ending position mark of the long entity.
A disadvantage of current information extraction techniques in document review is that long entity information is extracted. The problem mainly relates to the following problems.
Firstly, the extraction precision of long entities in the document information is low, and the traditional method has great difficulty in extracting texts as contract termination condition fields. The text to be extracted is too long and exceeds the field extraction range of the named entity method in the traditional sense; the extraction of the long text needs to incorporate a large amount of semantic information to extract the target field more accurately.
Secondly, the field to be extracted is also long, and the CRF model which is most frequently used at present is difficult to obtain an accurate long label field (the field is defined to be more than 20 words in length).
Thirdly, the problem of entity nesting and entity overlapping cannot be solved by adopting a span mode, namely a mode of acquiring the entity to be extracted through the start mark and the end mark.
Fourthly, a RoBerta-based reading understanding method is used in the existing method to extract the long entity. Firstly, the task definition difficulty of the method is high, related problems need to be proposed according to pertinence of each different label in a training stage, all label problems need to be used for questioning in a prediction stage due to the fact that a document contains unknown label types, all possible long entity labels are obtained, time consumption of the whole training stage is increased greatly, and time consumption of a test stage is N times that of the original single label prediction (N is the total number of labels); secondly, for data in which no label originally exists in the document, that is, the document to be predicted is a negative example, if the label is predicted by the model result, it may cause cascading errors.
Therefore, it is necessary to design a document information extraction method and system based on text classification and reading understanding, which can greatly shorten training and prediction time and improve the precision and speed of a document extraction model in extracting fields.
For example, a method for extracting key information of a document in the field of artificial intelligence, which is described in chinese patent application No. cn202110353610.x, includes the following steps: s1, collecting document data in the artificial intelligence field, and performing key information extraction data annotation; s2, performing further pre-training on the pre-training model RoBERTA; s3, constructing an information extraction model; s4, initializing parameters of the backbone network by using a RoBERTA model obtained by further pre-training; s5, training by using the marked data, carrying out random replacement and data enhancement on the marked data in the training process, and calculating the error of back propagation by using the square cross entropy loss; and S6, extracting information in the unstructured text in the field of artificial intelligence by using the trained information extraction model to obtain result triples. Although the information extraction is used as a machine reading understanding task to solve, the starting point and the end point of each key information in the text are predicted, and the problem that the performance effect is greatly reduced when the sequence labeling model deals with the long-span knowledge text is solved, the defects of difficulty in model training and time consumption increase still exist.
Disclosure of Invention
The invention aims to solve the problems of difficult model training, rapid increase of time consumption and low extraction precision of the existing document information extraction method in the prior art, and provides a document information extraction method and a document information extraction system based on text classification and reading understanding, which can greatly shorten training and prediction time and improve the precision and speed of a document extraction model when extracting fields.
In order to achieve the purpose, the invention adopts the following technical scheme:
the document information extraction method based on text classification and reading understanding comprises the following steps;
s1, inputting a document, analyzing and identifying the document, and converting the document into a plain text format;
s2, preprocessing the text content in the document to obtain input data;
s3, generating corresponding word vectors, word vectors and context vectors according to the input data in the step S2, and splicing the word vectors, the word vectors and the context vectors to obtain spliced vectors;
s4, if the spliced vector is of an answer type, taking an entity text question corresponding to the spliced vector as the input of the next step;
s5, obtaining the position of the best matched long label data corresponding to the entity text question by calculation by using a reading understanding model;
and S6, acquiring long label data according to the position of the long label data, performing post-processing correction on the long label data, and finally outputting the long label data as a long entity field to be extracted.
Preferably, the preprocessing in step S2 includes:
carrying out regularization pretreatment on text content;
removing blanks in the text content, wherein the blanks comprise a space character, a tab character and a line feed character;
and segmenting the text content in the document according to the preset maximum length.
Preferably, step S3 includes the steps of:
s31, vectorizing the input data by constructing a word list and a word list, and respectively generating a word vector and a word vector corresponding to the input data;
s32, generating a context vector corresponding to the input data through a BilSTM model;
and S33, splicing the word vector, the word vector and the context vector to obtain a spliced vector.
Preferably, step S4 further includes the steps of:
and if the spliced vector is of an unanswerable type, indicating that the spliced vector has no corresponding entity text problem, and directly ending the operation.
Preferably, step S5 includes the steps of:
s51, obtaining the probability distribution of the long label data corresponding to the entity text question at each position through the following formula
Figure DEST_PATH_IMAGE001
Figure DEST_PATH_IMAGE003
Wherein E is a matrix representing the hidden layer output of the reading understanding model,
Figure 380396DEST_PATH_IMAGE004
representing a first learnable weight;
s52, obtaining the result by the same calculation method as the step S51
Figure DEST_PATH_IMAGE005
Figure DEST_PATH_IMAGE007
S53, obtaining the maximum long label data corresponding to the entity text question through the following formulaA large probability matrix sum
Figure 793929DEST_PATH_IMAGE008
Wherein, in the step (A),
Figure DEST_PATH_IMAGE009
is a starting position matrix of the largest possible label, and
Figure 160188DEST_PATH_IMAGE010
for the end position matrix of the largest possible label, the calculation is as follows:
Figure 497628DEST_PATH_IMAGE012
Figure 203416DEST_PATH_IMAGE014
s54, predicting the maximum probability position of the matching start position and end position end of the long label data in the E matrix through a two-classification model
Figure DEST_PATH_IMAGE015
And obtaining the specific position information of the long label data, wherein the specific formula is as follows:
Figure DEST_PATH_IMAGE017
wherein the content of the first and second substances,
Figure 725533DEST_PATH_IMAGE018
Figure DEST_PATH_IMAGE019
,
Figure 53789DEST_PATH_IMAGE020
,
Figure DEST_PATH_IMAGE021
the i, j-th row,m represents a second learnable weight.
Preferably, step S6 includes the steps of:
and dynamically modifying the label type and the content of the long label data by using a regular expression.
Preferably, step S1 includes the steps of:
analyzing the txt and word format documents through the back end of the document review platform, and converting the txt and word format documents into a plain text format;
converting the jpg and pdf format documents into a pure text format through OCR character recognition.
The invention also provides a document information extraction system based on text classification and reading understanding, which comprises the following steps:
the intelligent extraction module of text information is used for analyzing and identifying the input document and converting the document into a plain text format;
the data preprocessing module is used for preprocessing the text content in the document to obtain input data;
the feature extraction module is used for generating corresponding word vectors, word vectors and context vectors for input data, and splicing the word vectors, the word vectors and the context vectors to obtain spliced vectors;
the text classification module is used for judging the spliced vectors, and taking the entity text questions corresponding to the spliced vectors as the input of the next step if the spliced vectors are of an answer type;
the reading understanding module is used for obtaining the position of the most matched long label data corresponding to the entity text problem through calculation by utilizing a reading understanding model;
the long entity tag data generating module is used for acquiring long tag data according to the position of the long tag data;
and the data post-processing module is used for carrying out post-processing correction on the long label data and finally outputting the long entity field to be extracted.
Preferably, the feature extraction module comprises;
the word vector feature extraction module is used for vectorizing the input data by constructing a word list to generate a word vector corresponding to the input data;
the word vector feature extraction module is used for vectorizing the input data by constructing a word table to generate a word vector corresponding to the input data;
and the context vector feature extraction module generates a context vector corresponding to the input data through a BilSTM model.
Preferably, the intelligent extraction module for text information includes:
the back end analysis module is used for analyzing the txt and word format documents through the back end of the document review platform and converting the txt and word format documents into a plain text format;
and the OCR character recognition module is used for converting the jpg and pdf format documents into a pure text format through OCR character recognition.
Compared with the prior art, the invention has the beneficial effects that: (1) the invention uses a method of combining text classification and reading understanding to solve the problem of long label extraction of document information extraction, can effectively obtain the associated semantic information between the relevant problems of long entities and the context, firstly inputs the target text into a single text classification model, the classification task is a multi-label multi-classification task, on one hand, the problem type associated with the target text with smaller range can be screened, on the other hand, the prediction process can be accelerated, because most input data only has few labels, and the question is not required to be asked for all labels, therefore, the classification model is added as a filter, which is beneficial to improving the prediction efficiency of the whole frame and reducing the prediction time; then, the problem types corresponding to the classification results are input into the reading understanding model in combination with the target text, so that the extracted long label field is more accurate; (2) in the data processing stage, the problem of entity overlapping is solved by a method of combining entities during preprocessing and separating entities during post-processing; by reading pointers start and end constructed in understanding, the problem of entity nesting is solved; (3) the method effectively solves the problem that the extraction of the long label by the prior method is inaccurate, not only improves the model prediction accuracy, but also improves the prediction efficiency, and has the advantages of less time consumption and extremely friendly prediction result to users; (4) the invention has the characteristics of more various factors, more comprehensive consideration, more reasonable design, more optimized efficiency and stronger universality.
Drawings
FIG. 1 is a frame diagram of a conventional method model for extracting common document information;
FIG. 2 is a flowchart of a document information extraction method based on text classification and reading understanding according to the present invention;
FIG. 3 is a block diagram of a matrix E according to the present invention;
FIG. 4 is a block diagram of a document information extraction system based on text classification and reading understanding according to the present invention;
fig. 5 is a flowchart of an exemplary service provided by an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Example 1:
as shown in FIG. 2, the invention provides a document information extraction method based on text classification and reading understanding, comprising the following steps;
s1, inputting a document, analyzing and identifying the document, and converting the document into a plain text format;
s2, preprocessing the text content in the document to obtain input data;
s3, generating corresponding word vectors, word vectors and context vectors according to the input data in the step S2, and splicing the word vectors, the word vectors and the context vectors to obtain spliced vectors;
s4, if the spliced vector is of an answerable type, taking the entity text question corresponding to the spliced vector as the input of the next step;
s5, obtaining the position of the best matched long label data corresponding to the entity text question by calculation by using a reading understanding model;
and S6, acquiring long label data according to the position of the long label data, performing post-processing correction on the long label data, and finally outputting the long label data as a long entity field to be extracted.
In order to accelerate the training prediction process, the N +1 classification model is used to determine the answerable (i.e. extractable) data, and the pre-constructed question and document forming word vectors are input into the reading understanding model (such as BERT, RoBerta, Albert reading understanding model), so that more complete context information output can be obtained, which is denoted as matrix E. As shown in fig. 3.
Further, step S1 includes the following steps:
analyzing the txt and word format documents through the back end of the document review platform, and converting the txt and word format documents into a plain text format;
converting the jpg and pdf format documents into a pure text format through OCR character recognition.
Further, the preprocessing in step S2 includes:
carrying out regularization pretreatment on text content;
removing blanks in the text content, wherein the blanks comprise a space character, a tab character and a line feed character;
and segmenting the text content in the document according to the preset maximum length.
For example, the document is divided according to the maximum length set in advance by dividing the mobile phone number + the text sequence number (1 × × 8\ n1.2, and dividing processing is required at 'n'), removing the blank characters (including space characters, tab characters and line feed characters), and finally dividing the document according to the maximum length set in advance.
Step S3 includes the following steps:
s31, vectorizing the input data by constructing a word list and a word list, and respectively generating a word vector and a word vector corresponding to the input data;
s32, generating a context vector corresponding to the input data through a BilSTM model;
and S33, splicing the word vector, the word vector and the context vector to obtain a spliced vector.
In order to enable the model to sufficiently obtain information of output data, three levels of vector features, namely, a word vector feature and a context vector feature, are mainly used. The word vector needs to be segmented by using a word segmentation tool, and the word vector need to be obtained by constructing a word list and then vectorizing; and the context vector needs to be obtained through the BilSTM model.
Further, step S4 includes the following steps:
if the spliced vector is of an unanswerable type, the spliced vector does not have a corresponding entity text question, and the operation is directly ended.
And step S4, obtaining possible labels through an N +1 multi-label classification model, outputting the possible labels as the problems corresponding to the label question, and combining the classification text as the input of the reading understanding model.
If yes, the reading understanding model is followed to obtain the entity label position (start, end) corresponding to the label.
Further, step S5 includes the following steps:
s51, obtaining the probability distribution of the long label data corresponding to the entity text question at each position through the following formula
Figure 53975DEST_PATH_IMAGE001
Figure 614270DEST_PATH_IMAGE003
Wherein E is a matrix representing the hidden layer output of the reading understanding model,
Figure 182654DEST_PATH_IMAGE004
representing a first learnable weight;
s52, obtaining the result through the same calculation way as the step S51
Figure 930030DEST_PATH_IMAGE005
Figure 874853DEST_PATH_IMAGE007
S53, obtaining the maximum probability matrix of the long label data corresponding to the entity text question through the following formula
Figure 289653DEST_PATH_IMAGE009
And
Figure 622415DEST_PATH_IMAGE008
wherein, in the step (A),
Figure 122666DEST_PATH_IMAGE009
is a starting position matrix of the largest possible label, and
Figure 871179DEST_PATH_IMAGE010
for the end position matrix of the largest possible label, the calculation is as follows:
Figure 140487DEST_PATH_IMAGE022
Figure DEST_PATH_IMAGE023
s54, predicting the maximum probability position of the matching start position and the end position end of the long label data in the E matrix through a two-classification model
Figure 378570DEST_PATH_IMAGE015
And obtaining the specific position information of the long label data, wherein the specific formula is as follows:
Figure 631697DEST_PATH_IMAGE017
wherein the content of the first and second substances,
Figure 935900DEST_PATH_IMAGE018
Figure 59714DEST_PATH_IMAGE019
,
Figure 406381DEST_PATH_IMAGE020
,
Figure 84487DEST_PATH_IMAGE021
the i, j-th row of the E matrix is denoted, and m denotes the second learnable weight.
The matching probability of start and end is predicted through a binary model, i and j represent the ith and jth rows of the matrix (namely row information obtained by formulas (4) and (5)), and the specific position information of the long label data can be finally obtained through the series of calculation, so that the model prediction precision is favorably improved.
Further, step S6 includes the following steps:
and dynamically modifying the label type and the content of the long label data by using a regular expression.
Step S6, preprocessing data in reverse direction, namely obtaining a label according to problem analysis; meanwhile, the regular expression is used for dynamically modifying the label type and the label content.
As shown in fig. 4, the present invention also provides a document information extraction system based on text classification and reading understanding, including:
the intelligent extraction module of text information is used for analyzing and identifying the input document and converting the document into a plain text format;
the data preprocessing module is used for preprocessing the text content in the document to obtain input data;
the feature extraction module is used for generating corresponding word vectors, word vectors and context vectors for input data, and splicing the word vectors, the word vectors and the context vectors to obtain spliced vectors;
the text classification module is used for judging the spliced vector, and if the spliced vector is of an answerable type, taking an entity text question corresponding to the spliced vector as the input of the next step;
the reading understanding module is used for obtaining the position of the most matched long label data corresponding to the entity text problem through calculation by utilizing a reading understanding model;
the long entity tag data generating module is used for acquiring long tag data according to the position of the long tag data;
and the data post-processing module is used for carrying out post-processing correction on the long label data and finally outputting the long entity field to be extracted.
Further, the feature extraction module comprises;
the word vector feature extraction module is used for vectorizing the input data by constructing a word list to generate a word vector corresponding to the input data;
the word vector feature extraction module is used for vectorizing the input data by constructing a word table to generate a word vector corresponding to the input data;
and the context vector feature extraction module generates a context vector corresponding to the input data through a BilSTM model.
Further, the intelligent extraction module for text information comprises:
the back end analysis module is used for analyzing the txt and word format documents through the back end of the document review platform and converting the txt and word format documents into a plain text format;
and the OCR character recognition module is used for converting the jpg and pdf format documents into a pure text format through OCR character recognition.
Based on the technical scheme of the invention, in the concrete implementation and operation process, the concrete implementation flow of the invention is illustrated by a flow chart of a typical service shown in fig. 5.
As shown in fig. 5, the specific implementation flow is as follows:
1. the user inputs the formats of the document word, pdf, jpg, txt and the like, and selects the document type.
2. Analyzing txt and word format documents by utilizing the rear end of the document review platform through a text information intelligent extraction module, and converting the txt and word format documents into a plain text format; converting the jpg and pdf format documents into a pure text format by using OCR character recognition.
3. In a data preprocessing module, mainly relating to digital type data segmentation, such as 202110301.2, 20211030 of the first half part is a correct label, but the model can interfere with the data in the prediction of 1.2, and the data are segmented by regularized preprocessing; in addition, sentence segmentation needs to be performed according to a set length, such as 256 sentence length.
4. The characteristic extraction module is used for respectively generating word vectors and character vectors corresponding to the input data by constructing a word list and a character list; if word2vec is used, the input data is segmented to generate corresponding word vectors, and the word vectors are constructed after the corresponding word table is obtained by segmenting the current data. The context vector generates a context vector which is depended on by the input text through a BilSTM (bidirectional long-short time memory network), and the three vectors are spliced to carry out the next step.
5. The text classification module inputs the spliced vector into an N +1 multi-label classification model (specifically adopts a two-classification model), and if an unanswered type is obtained, the input is directly finished if no corresponding entity exists; otherwise, obtaining an entity corresponding to the input, and splicing the corresponding reading and understanding problem as the input of the next layer.
6. The reading understanding module, specifically using the Robert model, obtains the hidden layer matrix E of the reading understanding model, and then obtains the position of the long entity by calculating the maximum probability of (start, end) marker group matching, and the calculation process refers to step S5.
7. The data post-processing module plays a role of correcting the label and corrects some obvious errors so as to improve the prediction accuracy; the other effect is the reverse of step 3 in order to handle previously added invalid punctuation.
8. The final output is the long entity field to be extracted.
The text classification model is creatively added in front of the reading understanding model and used for classifying the label types corresponding to the target documents, so that the entity prediction precision is improved, and the problem of low speed caused by independent use of the reading understanding model is solved.
In the data processing stage, the problem of entity overlapping is solved by a method of combining the entity during preprocessing and separating the entity during post-processing; the problem of entity nesting is solved by constructing a way of reading and understanding the beginning and the end of the pointer type entity.
By using the method of combining the Chinese word vector, the word vector and the context vector, the context information is fully represented, the real-time interaction of the text information is facilitated, and the accuracy of extracting the information by the reading and understanding frame is further improved.
The method not only improves the accuracy of information extraction, but also greatly improves the long entity prediction speed, and the prediction result is extremely friendly to users.
The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims (6)

1. The document information extraction method based on text classification and reading understanding is characterized by comprising the following steps;
s1, inputting a document, analyzing and identifying the document, and converting the document into a plain text format;
s2, preprocessing the text content in the document to obtain input data;
s3, generating corresponding word vectors, word vectors and context vectors according to the input data in the step S2, and splicing the word vectors, the word vectors and the context vectors to obtain spliced vectors;
s4, if the spliced vector is of an answerable type, taking the entity text question corresponding to the spliced vector as the input of the next step;
s5, obtaining the position of the most matched long label data corresponding to the entity text problem by calculation by using a reading understanding model;
s6, obtaining long label data according to the position of the long label data, carrying out post-processing correction on the long label data, and finally outputting the long label data as a long entity field to be extracted;
the preprocessing in step S2 includes:
carrying out regularization pretreatment on text content;
removing blank characters in the text content, wherein the blank characters comprise space characters, tab characters and line feed characters;
segmenting text contents in the document according to a preset maximum length;
step S3 includes the following steps:
s31, vectorizing the input data by constructing a word list and a word list, and respectively generating a word vector and a word vector corresponding to the input data;
s32, generating a context vector corresponding to the input data through a BilSTM model;
s33, splicing the word vector, the word vector and the context vector to obtain a spliced vector;
step S4 further includes the steps of:
if the spliced vector is of an unanswerable type, the spliced vector has no corresponding entity text problem, and the operation is directly finished;
step S5 includes the following steps:
s51, obtaining a probability distribution p representing that each position is long label data corresponding to the entity text problem through the following formulastart
Figure FDA0003673543720000021
Wherein E is a matrix representing the hidden layer output of the reading understanding model,
Figure FDA0003673543720000022
representing a first learnable weight;
s52, obtaining Pp in the same calculation manner as step S51end
Figure FDA0003673543720000023
S53, obtaining the maximum probability matrix of the long label data corresponding to the entity text question through the following formula
Figure FDA0003673543720000024
And
Figure FDA0003673543720000025
wherein the content of the first and second substances,
Figure FDA0003673543720000026
is a starting position matrix of the largest possible label, and
Figure FDA0003673543720000027
for the end position matrix of the largest possible label, the calculation is as follows:
Figure FDA0003673543720000028
Figure FDA0003673543720000029
s54, predicting the maximum probability position of the matching start position and the end position end of the long label data in the E matrix through a two-classification model
Figure FDA00036735437200000210
Obtaining specific positions of long label dataThe information is specifically represented as follows:
Figure FDA00036735437200000211
wherein the content of the first and second substances,
Figure FDA00036735437200000212
the i, j-th row of the E matrix is denoted, and m denotes the second learnable weight.
2. The document information extraction method based on text classification and reading understanding according to claim 1, wherein the step S6 includes the steps of:
and dynamically modifying the label type and the content of the long label data by using a regular expression.
3. The document information extraction method based on text classification and reading understanding according to any one of claims 1-2, wherein the step S1 includes the steps of:
analyzing the txt and word format documents through the back end of the document review platform, and converting the txt and word format documents into a plain text format;
and converting the jpg and pdf format documents into a pure text format through OCR character recognition.
4. A document information extraction system based on text classification and reading understanding, which applies the document information extraction method based on text classification and reading understanding of any one of claims 1-2, wherein the document information extraction system based on text classification and reading understanding comprises:
the intelligent extraction module of text information is used for analyzing and identifying the input document and converting the document into a plain text format;
the data preprocessing module is used for preprocessing the text content in the document to obtain input data;
the feature extraction module is used for generating corresponding word vectors, word vectors and context vectors for input data, and splicing the word vectors, the word vectors and the context vectors to obtain spliced vectors;
the text classification module is used for judging the spliced vectors, and taking the entity text questions corresponding to the spliced vectors as the input of the next step if the spliced vectors are of an answer type;
the reading understanding module is used for obtaining the position of the most matched long label data corresponding to the entity text problem through calculation by utilizing a reading understanding model;
the long entity tag data generating module is used for acquiring long tag data according to the position of the long tag data;
and the data post-processing module is used for carrying out post-processing correction on the long label data and finally outputting the long entity field to be extracted.
5. The system of claim 4, wherein the feature extraction module comprises;
the word vector feature extraction module is used for vectorizing the input data by constructing a word list to generate a word vector corresponding to the input data;
the word vector feature extraction module is used for vectorizing the input data by constructing a word table to generate a word vector corresponding to the input data;
and the context vector feature extraction module generates a context vector corresponding to the input data through a BilSTM model.
6. The system for extracting document information based on text classification and reading understanding according to claim 5, wherein the intelligent extracting module of text information comprises:
the back end analysis module is used for analyzing the txt and word format documents through the back end of the document review platform and converting the txt and word format documents into a plain text format;
and the OCR character recognition module is used for converting the jpg and pdf format documents into a pure text format through OCR character recognition.
CN202210221913.0A 2022-03-09 2022-03-09 Document information extraction method and system based on text classification and reading understanding Active CN114297987B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210221913.0A CN114297987B (en) 2022-03-09 2022-03-09 Document information extraction method and system based on text classification and reading understanding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210221913.0A CN114297987B (en) 2022-03-09 2022-03-09 Document information extraction method and system based on text classification and reading understanding

Publications (2)

Publication Number Publication Date
CN114297987A CN114297987A (en) 2022-04-08
CN114297987B true CN114297987B (en) 2022-07-19

Family

ID=80978618

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210221913.0A Active CN114297987B (en) 2022-03-09 2022-03-09 Document information extraction method and system based on text classification and reading understanding

Country Status (1)

Country Link
CN (1) CN114297987B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115995087B (en) * 2023-03-23 2023-06-20 杭州实在智能科技有限公司 Document catalog intelligent generation method and system based on fusion visual information
CN117076596B (en) * 2023-10-16 2023-12-26 微网优联科技(成都)有限公司 Data storage method, device and server applying artificial intelligence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270604A (en) * 2020-10-14 2021-01-26 招商银行股份有限公司 Information structuring processing method and device and computer readable storage medium
WO2021027533A1 (en) * 2019-08-13 2021-02-18 平安国际智慧城市科技股份有限公司 Text semantic recognition method and apparatus, computer device, and storage medium
CN112989831A (en) * 2021-03-29 2021-06-18 华南理工大学 Entity extraction method applied to network security field
CN113158674A (en) * 2021-04-01 2021-07-23 华南理工大学 Method for extracting key information of document in field of artificial intelligence
CN113822026A (en) * 2021-09-10 2021-12-21 神思电子技术股份有限公司 Multi-label entity labeling method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929515B (en) * 2019-11-21 2023-04-18 中国民航大学 Reading understanding method and system based on cooperative attention and adaptive adjustment
CN112101027A (en) * 2020-07-24 2020-12-18 昆明理工大学 Chinese named entity recognition method based on reading understanding
US11734510B2 (en) * 2020-08-27 2023-08-22 Bayerische Motoren Werke Aktiengesellschaft Natural language processing of encoded question tokens and encoded table schema based on similarity
CN113033203A (en) * 2021-02-05 2021-06-25 浙江大学 Structured information extraction method oriented to medical instruction book text
CN113220768A (en) * 2021-06-04 2021-08-06 杭州投知信息技术有限公司 Resume information structuring method and system based on deep learning
CN113609859A (en) * 2021-08-04 2021-11-05 浙江工业大学 Special equipment Chinese named entity recognition method based on pre-training model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021027533A1 (en) * 2019-08-13 2021-02-18 平安国际智慧城市科技股份有限公司 Text semantic recognition method and apparatus, computer device, and storage medium
CN112270604A (en) * 2020-10-14 2021-01-26 招商银行股份有限公司 Information structuring processing method and device and computer readable storage medium
CN112989831A (en) * 2021-03-29 2021-06-18 华南理工大学 Entity extraction method applied to network security field
CN113158674A (en) * 2021-04-01 2021-07-23 华南理工大学 Method for extracting key information of document in field of artificial intelligence
CN113822026A (en) * 2021-09-10 2021-12-21 神思电子技术股份有限公司 Multi-label entity labeling method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Unified MRC Framework for Named Entity Recognition;Xiaoya Li等;《arXiv:1910.11476v6》;20200523;第1-11页 *
Based BERT-BiLSTM-ATT Model of Commodity Commentary on The Emotional Tendency Analysis;Hao Ge等;《2021 IEEE 4th International Conference on Big Data and Artificial Intelligence (BDAI)》;20211113;第130-133页 *
融合本体特征的BiLSTM-CRF军事实体识别模型;齐玉东等;《兵器装备工程学报》;20200525(第05期);第124-129页 *

Also Published As

Publication number Publication date
CN114297987A (en) 2022-04-08

Similar Documents

Publication Publication Date Title
CN109165294B (en) Short text classification method based on Bayesian classification
CN110222188B (en) Company notice processing method for multi-task learning and server
CN107392143B (en) Resume accurate analysis method based on SVM text classification
CN114297987B (en) Document information extraction method and system based on text classification and reading understanding
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN108664474B (en) Resume analysis method based on deep learning
CN110705265A (en) Contract clause risk identification method and device
CN112163424A (en) Data labeling method, device, equipment and medium
CN110516057B (en) Petition question answering method and device
CN110580308A (en) information auditing method and device, electronic equipment and storage medium
CN115310443A (en) Model training method, information classification method, device, equipment and storage medium
CN115858758A (en) Intelligent customer service knowledge graph system with multiple unstructured data identification
CN113360647B (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN114239579A (en) Electric power searchable document extraction method and device based on regular expression and CRF model
CN116304023A (en) Method, system and storage medium for extracting bidding elements based on NLP technology
CN111708870A (en) Deep neural network-based question answering method and device and storage medium
CN115718889A (en) Industry classification method and device for company profile
CN114328930A (en) Text classification method and system based on entity extraction
CN115270818A (en) Intention identification method and device, storage medium and computer equipment
CN115687917A (en) Sample processing method and device, and recognition model training method and device
CN114842301A (en) Semi-supervised training method of image annotation model
CN114661900A (en) Text annotation recommendation method, device, equipment and storage medium
CN115080732A (en) Complaint work order processing method and device, electronic equipment and storage medium
CN117291192B (en) Government affair text semantic understanding analysis method and system
CN116028620B (en) Method and system for generating patent abstract based on multi-task feature cooperation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant