CN116127986A

CN116127986A - Method for extracting key information of bidding documents based on pre-training model and BiLatticeLSTM

Info

Publication number: CN116127986A
Application number: CN202310165102.8A
Authority: CN
Inventors: 涂著刚; 汤双明; 周鸿章
Original assignee: Guiyang Gaoxin Ston Information Co ltd
Current assignee: Guiyang Gaoxin Ston Information Co ltd
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-05-16

Abstract

The invention relates to the technical field of information extraction, in particular to a method for extracting key information of a punctuation mark based on a pre-training model and BiLatticeLSTM. The method comprises the following steps: s100: acquiring a plurality of bidding documents and preprocessing the bidding documents to generate a data set; s200: inputting the data set into a Bert model for pre-training, and learning semantic information of a bidding document to obtain a BidBiert pre-training model S300: the key information in the data set is marked and then is input into a BidBiert model, so that a word vector of each word in the mark document and a word vector of each word related to the key information are obtained; s400: extracting feature vectors required by key information identification in a tagbook file according to the word vectors and the word vectors, and decoding the feature vectors through a conditional random field to obtain an optimal parameter model; s500: and (5) performing iterative training to obtain a final model for extracting the key information of the bidding documents. The accuracy and efficiency of extracting the key information of the bidding documents can be improved.

Description

Method for extracting key information of bidding documents based on pre-training model and BiLatticeLSTM

Technical Field

The invention relates to the technical field of information extraction, in particular to a method for extracting key information of a punctuation mark based on a pre-training model and BiLatticeLSTM.

Background

The bidding document is a document which is compiled by a bidding issuing unit or a consignment design unit and provides the bidder with requirements of main technology, quality, construction period and the like of the project. There are some important information in the bidding document, such as more than 30 items of key information, such as project name, bidding unit, winning amount, bidding deadline, etc., which are of great interest. At present, the effective way of searching the key information in the bidding document is a way of manual copy and paste and rule extraction. However, when the engineering project is ordered or the goods are purchased, the engineering project is usually released at a plurality of sites, and has the characteristics of no fixed template, unstructured data, various document forms (Word, PDF, HTML, scanned pictures and the like), the artificial mode is time-consuming and labor-consuming, the engineering project can be completed only by experienced workers, a large number of rules are required to be configured by specific personnel for rule extraction, and the boundary of an extraction result is fuzzy, so that the information extraction effect is not ideal, the adaptability to different documents is poor, semantic information cannot be obtained from a large number of labels in the prior art, and therefore, some semantic ambiguous key information is difficult to extract correctly.

Disclosure of Invention

The technical problem solved by the invention is to provide a method for extracting the key information of the bidding document based on a pre-training model and BiLatticeLSTM, which can improve the accuracy and efficiency of extracting the key information of the bidding document.

The basic scheme provided by the invention is as follows: a method for extracting key information of a bidding document based on a pre-training model and BiLatticeLSTM comprises the following steps:

s100: acquiring a plurality of bidding documents, preprocessing, extracting text information and generating a data set;

s200: inputting the data set into a Bert model for pre-training, and learning semantic information of a bidding document to obtain a BidBiert pre-training model;

s300: the key information in the data set is marked and then is input into a BidBiert model, so that a word vector of each word in the mark document and a word vector of each word related to the key information are obtained;

s400: extracting feature vectors required by key information identification in a tagbook file according to the word vectors and the word vectors, and decoding the feature vectors through a conditional random field to obtain an optimal parameter model;

s500: and (5) performing iterative training to obtain a final model for extracting the key information of the bidding documents.

The principle of the invention is as follows: firstly, massive bidding documents are obtained to serve as data sets, the data sets are input into a Bert model for pre-training, a BidBiert model is obtained, semantic information in the bidding documents is learned, semantic learning is carried out through the massive bidding documents, a pre-training model in the bidding field is obtained, and word vectors of input data can be obtained more accurately through the model. And marking key information of the data set bidding document, inputting the marked key information into the BidBiert model, extracting word vectors and word vectors, and training the word vectors and the word vectors. The key information in the bidding document is extracted through the word vector and the word vector, the required feature vector is decoded to obtain an optimal model, the final bidding information extraction model is obtained after repeated iterative training, and the key information in the bidding document can be directly extracted by directly inputting the bidding document into the model.

Compared with the prior art, the following advantages exist:

compared with the traditional manual mode, the method has the advantages that only the key information is required to be marked in the model training process, the target document is directly input into the model, the key information in the target document can be directly obtained, and the labor, material resources and time cost are reduced.

Compared with the extraction modes of rules and word libraries, the method can accurately identify key information in the tag document by learning semantic information in the tag document, has higher coverage and accuracy, can be suitable for tag documents in various formats, and does not need to consider to maintain the word libraries and identification rules.

Further, the step S100 includes the steps of:

s110: acquiring a bid-inviting file disclosed on a network through a crawler;

s120: extracting text information in the bidding document;

s130: and intercepting the long sentence in the text information into a preset sentence length.

Massive bidding documents disclosed on the network are obtained through crawlers, text information in the bidding documents is extracted as training samples, long sentences are intercepted into preset sentence lengths, and semantic information of each sentence is learned subsequently. By intercepting long sentences into short sentences, the operand in the recognition process is reduced, and ambiguity of semantic recognition is avoided.

Further, the step S200 includes the steps of:

s210: the input sentence is divided into words and then a plurality of words are randomly covered;

s220: obtaining word vectors of each word by a plurality of words through the Embedding;

s230: word vectors predict masked words by the Encoder;

s240: repeating S210-S230, and obtaining a BidBiert model through iterative learning.

After inputting sentences into a Bert model, masking a plurality of words after word segmentation, obtaining word vectors of each word through Embedding, predicting the Masked words through Enclder, wherein the Bert model comprises two unsupervised prediction tasks, namely a Masked LM and a Masked LM Next sentence Predic, acquiring massive taggant files, using the Masked LM task in the Bert model, randomly erasing one or more words in the sentence for a given sentence by working logic of the Masked LM, and respectively erasing the words according to the rest vocabulary prediction positions. And (3) perfecting a pre-training model through iterative learning, and learning semantic information in the bidding field.

Further, the step S300 includes the steps of:

s310: manually labeling key information in a data set;

s320: transmitting the marked data set into a BidBiert model to obtain a word vector of each word in the data set;

s330: and carrying out word vector training on the word segmentation result according to a self-built word library and a word segmentation tool in the preset bidding field, and obtaining the word vector of each word.

After key information in the markup document is marked manually, the marked markup document is transmitted to a BidBiert model assembly, and a word vector of each word in the data set is obtained. And carrying out word segmentation processing by combining a self-built word stock and a word segmentation tool in the bidding field, and carrying out word vector training on the segmented structure to obtain the word vector of each word.

Further, the step S400 includes the steps of:

s410: inputting the character vector and the word vector obtained in the S300 into a BiLatticLSTM model, and extracting feature vectors required by identification of key information of the bidding project data;

s420: and inputting the feature vector into a CRF model, calculating an optimal labeling sequence, and fitting the artificial labeling sequence to obtain optimal model parameters.

The character vector and the word vector are input into a BiLatticeLSTM model to extract the feature vector, the BiLatticeLSTM model is an LSTM model with a bidirectional Lattice structure, the features of a text sequence can be extracted from a front lane and a back lane through the model, and the word vector in the bidding field is fused in the front direction and the back direction, so that the entity boundary information is more defined, and the entity ambiguity problem is solved. After extracting the feature vector of the text sequence, inputting the feature vector into a CRF model for decoding, wherein CRF is a conditional random field, and is a conditional probability distribution model of a given group of input random variables, and the other group of input random variables, so that optimal model parameters are obtained.

Further, S510: repeating S200-S400, and performing iterative training to obtain a BidBiert+BiLatticeLSTM+CRF model.

Drawings

FIG. 1 is a schematic flow chart of BidBiert training based on a pre-training model and BiLatticeLSTM method of extracting the key information of a target book;

FIG. 2 is a schematic diagram of a training process of BidBiert+BiLatticeLSTM+CRF model according to an embodiment of a method for extracting key information of a target book based on a pre-training model and BiLatticeLSTM;

FIG. 3 is a schematic diagram of a BidBiert model of an embodiment of a method for extracting taggant key information based on a pre-training model and BiLatticeLSTM;

FIG. 4 is a schematic diagram of a framework of BidBiert+BiLatticeLSTM+CRF model according to an embodiment of a method for extracting taggant key information based on a pre-training model and BiLatticeLSTM.

Detailed Description

The following is a further detailed description of the embodiments:

an example is substantially as shown in figures 1 and 2:

a method for extracting key information of a bidding document based on a pre-training model and BiLatticeLSTM comprises the following steps:

s100, acquiring a plurality of bidding documents, preprocessing the bidding documents, and generating a data set. S100 specifically comprises the following steps:

s110: acquiring a bid-inviting file disclosed on a network through a crawler;

s120: extracting text information in the bidding document;

s130: and intercepting long sentences in the text information to be in a preset sentence length.

Specifically, in this embodiment, a plurality of published bidding documents are obtained from the network through a Python crawler tool, and text information in the bidding documents is extracted. The bidding documents are in various formats, including Word, PDF, HTML and scanned pictures. For Word and PDF bidding documents, text information is extracted after direct data enhancement processing, for HTML bidding documents, HTML tags are removed first, then data enhancement processing is performed, and for scanned documents, text data in pictures is extracted through the existing picture Word extraction technology. And then, cutting long sentences in the text information into short sentences according to the preset sentence length.

S200: inputting the data set into a Bert model for pre-training, and learning semantic information in a tagbook file to obtain a BidBiert pre-training model. S200 specifically comprises the following steps:

s230: word vectors predict masked words by the Encoder;

The Bert model comprises two unsupervised prediction tasks, namely a Masked LM task and a Masked LM task Next sentence Predic, a massive amount of tagbook files are obtained, the Masked LM task is used in the Bert model, the working logic of the Masked LM is given a sentence, one or more words in the sentence are randomly wiped out, and according to the residual vocabulary, what the wiped out words are respectively predicted. Specifically, as shown in fig. 3, a sentence "a tornado project steel bid announcement" is input into a Bert model, 1-n words are obtained by dividing words into random MASK, 1-n words are obtained by encoding, word vectors E1-En of each word are obtained by the 1-n words, finally, the words which are subjected to MSK (masking) are predicted by a plurality of encoders, and iterative learning is repeated continuously to obtain a Bidbert model. Through iterative training, the pretraining of the BidBiert model in the field of the bidding is completed, and semantic information in the field of the bidding is learned.

S300: and marking the key information in the data set, and inputting the marked key information into the BidBiert model to obtain a word vector of each word in the mark file and a word vector of each word related to the key information. S300 specifically comprises the following steps:

s310: manually labeling key information in a data set;

Specifically, first, key information in the dataset is manually marked, wherein the key information comprises project names, bid units, bid amount, bid deadlines and the like. Inputting the marked words into a BidBiert model, penetrating the Bid model to obtain Word vectors Char Embedding of each Word, combining a data set with a pre-configured self-built Word library and Word segmentation tools in the bidding field, carrying out Word segmentation processing, and carrying out Word vector training on the Word segmentation result to obtain Word vectors Word Embedding of each Word.

the step S400 includes the steps of:

Inputting Word vectors Char and Word vectors into BiLatticeLSTM model to extract feature Vector Fertrector required by identifying key information of a bidding document, wherein the BiLatticeLSTM model is an LSTM model with a bidirectional Lattice structure, features of text sequences can be extracted from forward and backward directions through the model, word vectors in bidding fields can be fused from forward and backward directions, entity boundary information is more defined, and the problem of entity ambiguity is solved.

S510: repeating S200-S400, and performing iterative training to obtain a BidBiert+BiLatticeLSTM+CRF model.

The BidBiert+BiLatticeLSTM+CRF model is shown in FIG. 4, with 15 characters for the input sentence: and (5) bid-winning results in a bid section of Guiyang municipal engineering service. 15 characters are used for obtaining a word vector CE of each character through Bid encoding ₁ -CE ₁₅ . Wherein, two words of municipal engineering and winning bid are in a self-built word stock in the bidding field, and the word vectors of the two words are obtained as WE respectively _3,6 And WE _12,13 . Word vector CE ₁ -CE ₁₅ Sum word vector WE _3,6 And WE _12,13 Extracting key information features through BiLatticeLSTM to obtain feature directionsQuantity FV ₁ -FV ₁₅ . The feature vector is calculated through CRF to obtain the label of each word, the label of the feature is marked when the key information is encountered, for example, the labels of B-PN, I-PN and E-PN are marked on the names of the last items in fig. 4, and the label of the non-key information is marked with the O label, so that the extraction of the key information in the bidding document is realized.

The foregoing is merely exemplary of the present invention, and the specific structures and features well known in the art are not described in any way herein, so that those skilled in the art will be able to ascertain all prior art in the field, and will not be able to ascertain any prior art to which this invention pertains, without the general knowledge of the skilled person in the field, before the application date or the priority date, to practice the present invention, with the ability of these skilled persons to perfect and practice this invention, with the help of the teachings of this application, with some typical known structures or methods not being the obstacle to the practice of this application by those skilled in the art. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the structure of the present invention, and these should also be considered as the scope of the present invention, which does not affect the effect of the implementation of the present invention and the utility of the patent. The protection scope of the present application shall be subject to the content of the claims, and the description of the specific embodiments and the like in the specification can be used for explaining the content of the claims.

Claims

1. A method for extracting key information of a target book based on a pre-training model and BiLatticeLSTM is characterized by comprising the following steps of: the method comprises the following steps:

s100: acquiring a plurality of bidding documents and preprocessing the bidding documents to generate a data set;

2. The method for extracting critical information from a pre-training model and BiLatticeLSTM according to claim 1, wherein said step S100 comprises the steps of:

s110: acquiring a bid-inviting file disclosed on a network through a crawler;

s120: extracting text information in the bidding document;

3. The method for extracting the key information of the punctuation based on the pre-training model and the BiLatticeLSTM according to claim 2, wherein the method comprises the following steps of: the step S200 includes the steps of:

s230: word vectors predict masked words by the Encoder;

4. A method for extracting key information of a bidding document based on a pre-training model and a BiLatticeLSTM according to claim 3, wherein: the step S300 includes the steps of:

s310: manually labeling key information in a data set;

5. A method for extracting key information of a bidding document based on a pre-training model and a BiLatticeLSTM according to claim 3, wherein: the step S400 includes the steps of:

6. The method for extracting the key information of the punctuation based on the pre-training model and the BiLatticeLSTM according to claim 5, wherein the method comprises the following steps of: the step S500 includes the steps of: