CN116304023A

CN116304023A - Method, system and storage medium for extracting bidding elements based on NLP technology

Info

Publication number: CN116304023A
Application number: CN202310088650.5A
Authority: CN
Inventors: 李正; 张晴晴; 徐立群; 郭海涛
Original assignee: Anhui Zhiyuxin Information Technology Co ltd
Current assignee: Anhui Zhiyuxin Information Technology Co ltd
Priority date: 2023-02-09
Filing date: 2023-02-09
Publication date: 2023-06-23

Abstract

The invention provides a bid and ask element extraction method based on NLP information extraction technology, which comprises the following steps: s1, acquiring an bidding original file; s2, obtaining a pre-training model A; s3, acquiring a labeling training sample; s4, carrying out data enhancement on the marked sample; s5, training a sentence potential element type identification model B; s6, training elements and relation extraction models C; s7, outputting a result by the data through a standardized module; a bid and ask element extraction system based on NLP information extraction technology comprises a processor and a memory; a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method and electronic system for extracting bid elements based on NLP technology. The invention has the advantages that: the development efficiency is greatly improved, the development cost is reduced, the limit of the text length of the traditional model 512 is broken through, the element nesting can be efficiently performed, and the recall rate of element information extraction is high.

Description

Method, system and storage medium for extracting bidding elements based on NLP technology

Technical Field

The invention relates to the technical field of bidding, in particular to a bid element extraction method, a bid element extraction system and a storage medium based on an NLP technology.

Background

The bidding documents are application information, bidding content announcement information, and content information published in the processes of follow-up evaluation, winning bid and the like which are published by a bidding person for a certain purchasing requirement. The structural framework and writing formats may be slightly different from region to region, from recruitment procedure to procedure. There are usually bid-picking notices, bid-evaluating notices, bid-winning notices, change-clearing notices, and the like (hereinafter, collectively referred to as bid-picking documents), and since important information such as bid-picking processes and results are recorded in the bid-picking documents, these information have important values of analysis and attention, such as bid-picking commodity (item) names, budget amounts, bid amounts of a large number of bid-winning documents. The winning bid unit and the engineering place are used for drawing analysis of winning bid persons, enterprise operation credit analysis and the like.

The main current method is to use BERT+BILSTM+CRF to identify elements and then use classification model to judge the relation existing between the elements, but the following difficulties exist in the practice of extracting key fields and relations in bid-making bulletin:

1) The training model requires a large amount of high-quality manual post data, and the acquisition of the post data requires a large amount of manpower, material resources and financial resources;

2) The current method mainly extracts the relation between two entities in a single sentence, and this task is called sentence-level relation extraction. However, a large number of entity relationships in the bid document are jointly expressed by multiple sentences;

3) The maximum input length required by BERT is 512, while the actual text length of the bidding document is much greater than this limit; and the presently disclosed pre-training model is based on a generic corpus rather than specific to the bidding domain, it is therefore desirable to provide improved bid element extraction schemes and obtain a pre-training model for the bidding domain.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides the method, the system and the storage medium for extracting the bidding elements based on the NLP technology, which have the advantages of greatly improving the development efficiency, reducing the development cost, breaking through the limitation of the text length of the traditional model 512, being capable of efficiently nesting elements and having high recall rate of element information extraction.

In order to solve the technical problems, the technical scheme provided by the invention is as follows: a bid and ask element extraction method based on NLP information extraction technology comprises the following steps:

s1, acquiring bidding original files and acquiring bidding file information from the Internet;

s2, converting an original file into a plain text document, removing special character strings by using a regularization method, segmenting the plain text into sentence sets according to rules, splicing the sentence sets into a new text document by using a line feeder\n, obtaining a pre-training expectation, training BERT after word segmentation of the expectation, and obtaining a pre-training model A based on a Transformer network structure in the bidding field;

s3, acquiring a labeling training sample, using a part of data matched by a regular expression, utilizing a universal language model to acquire the last complete sample based on data matched by a large number of various template, and finally manually checking;

s4, carrying out data enhancement on the marked sample: word segmentation, clustering and screening are carried out on a large number of bidding document corpora to obtain element key field corpora, and more training samples are generated by utilizing a data enhancement technology;

s5, training a potential element type recognition model B of a sentence, summarizing and classifying element types, summarizing M elements into N types, constructing an element dictionary table with corresponding relation between element labels and element types, training an NER recognition model based on sentence level to recognize the element types possibly contained in the sentence, acquiring the types of sentences where the annotation data are located by using element label labels through the element dictionary table, acquiring CLS layer characteristics as sentence characterization by using a pre-training model, taking a sentence characterization as a token, constructing a token-pair matrix by using a multi-head idea, acquiring the element types contained in the sentence by using a globalpoint method, and finally splicing and combining the sentences with the same element type marks according to element type combination rule strategies to obtain a target paragraph text with known element information types;

s6, training elements and relation extraction model C, and acquiring all element information and relation among elements contained in the type from the different type of paragraph sets identified in S5;

s7, outputting a result by the data through a standardized module: each element and each group of relation pair has a standard model for standardization, a newly acquired bidding original file is cleaned, element information of the file can be acquired by using the model A, B, C acquired by the steps, then the standardized treatment is performed according to standardized output formats of the element information, such as date, amount, address, telephone, mailbox and the like, the extracted primary result formats are various, and the final output result is required to be in a unified format through a format standardized module and finally output.

Further, the specific step of S6 includes,

s61, acquiring a sentence set of information to be extracted, and constructing a schema according to the sentence set type: according to the information of the address, the contact person and the like, a schema = { winning unit name is constructed: [ address, contact, phone, winning amount ] };

s62, constructing a model input, fixing a prefix form, and taking a spliced form of schema+text as an input;

s63, model input information is subjected to a pre-training model to obtain token level characterization vectors, the token level characterization vectors are mapped into feature vectors with dimensions being the number of types of output elements through a full connection layer, whether the token is the beginning or the end of the elements is judged through sigmoid, and element information can be completely obtained according to the beginning and the end of the elements.

Further, the step S61 specifically includes,

s611, firstly acquiring a winning unit name of an element, wherein a spliced form of schema+text is winning unit name+X, and acquiring a starting position and a stopping position of the winning unit name through a model B so as to acquire a winning unit name bit Y;

s612, obtaining an address of Y: the spliced form of the schema+text is an address+X of Y, and the starting and stopping positions of the address are acquired through a model B so as to acquire the address;

s613, respectively acquiring the contact person of Y, the telephone of Y and the winning amount of Y by the similar method.

Further, in S613, not only the specific number of the winning bid amount is obtained, but also whether the unit of the winning bid amount is a yuan or a ten thousand yuan is obtained.

A bid and ask element extraction system based on NLP information extraction technology, comprising a processor and a memory, wherein the memory is used for storing an instruction file and an algorithm model of the processing; the processor is configured with a data acquisition module, a data cleaning module, an extraction module of the bidding element extraction method based on the NLP technology and an output module of the extraction result, wherein the data acquisition module comprises a bidding document.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method and electronic system for extracting bid elements based on NLP technology.

Compared with the prior art, the invention has the advantages that:

based on a small amount of annotation data, a large amount of annotation samples are obtained by using a neutralization data enhancement technology of various template models, and finally, a large amount of high-quality annotation data are obtained by manually checking the samples, wherein the quantity and quality of the annotation data are key of the subsequent models; the current mainstream extraction model algorithm is improved to be based on BERT+GlobalPointer, and a pre-trained prompt model is used for carrying out two-level cascade information extraction on the basis of Ernie, so that the implementation is realized:

1. the development efficiency is greatly improved, the development cost is reduced, and the development of the model can be completed based on a small amount of samples;

2. the limitation of the text length of the traditional model 512 is broken through, and the model B is not affected by the length;

3. the multi-level joint information extraction models selected by the method are the beginning and the end of the predicted elements, so that the problem that the conventional method cannot efficiently solve element nesting is solved;

4. in the scheme, a plurality of element categories are arranged, one category is taken as the input of a model by splicing the template with the original text each time in element identification, so that the plurality of element categories can be predicted for a plurality of times when the scheme is provided with the plurality of element categories, and the recall rate of element information extraction is greatly improved.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a model training flow chart of the present invention.

FIG. 3 is a schematic diagram of the structure of the extraction model B of the present invention using model A+GlobalPointer.

Fig. 4 is a schematic diagram of the structure of the extraction model C using a+promt according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples.

Examples

A bid and ask element extraction system based on NLP information extraction technology, comprising a processor and a memory, wherein the memory is used for storing an instruction file and an algorithm model of the processing; the processor is configured with a data acquisition module, a data cleaning module, an extraction module of the bidding element extraction method based on the NLP technology and an output module of the extraction result, wherein the data acquisition module and the data cleaning module comprise bidding documents, and the structure is shown in figure 1.

The bidding original file acquisition module is used for acquiring a text file to be extracted; the original file cleaning module is used for uniformly cleaning and converting the various file forms in the text to be extracted into plain text character strings; the bidding document element extraction module is used for integrating a functional model of the bidding document element extraction method based on the NLP technology, extracting element information based on the plain text character strings of the cleaning module and obtaining element information sets; the data standardization module unit is used for determining element extraction results of the text to be extracted from the element information set based on element boundary characteristics; and outputting the element extraction result according to the specific requirements by using an output model of the element result according to the service requirements.

S1, acquiring a bidding original file: acquiring bidding document information from the Internet, wherein the document information can be in the formats of PDF, HTML (hypertext markup language) documents, WORD (WORD) and the like;

s2, converting an original file into a plain text document, removing special character strings by using a regularization method, segmenting the plain text into sentence sets according to rules, splicing the sentence sets into a new text document by using a line feeder\n, obtaining a pre-training expectation, and training BERT after word segmentation of the expectation to obtain a pre-training model A based on a Transformer network structure in the bidding field;

s3, acquiring a labeling training sample, wherein the sample consists of multiple parts, such as partial structured data contained in an original file, a part of data matched by using a regular expression is acquired, a universal language model is utilized to acquire the data matched by using a large number of various template, and finally, the last complete sample is acquired by manual proofreading;

s4, carrying out data enhancement on the labeling samples, carrying out word segmentation, clustering and screening on a large number of bidding document linguistic data to obtain element key field linguistic data, and generating more training samples by utilizing a data enhancement technology;

s5, training a sentence potential element type identification model B: as shown in fig. 3, most of the bidding documents do not contain element information, the element types are summarized and classified, M elements are summarized into N classes (M > N), and an element dictionary table with element labels and element types in correspondence is constructed. The objective is to split the target text into a plurality of target paragraph texts. A NER recognition model based on sentence level is trained to recognize the types of elements that a sentence may contain. The type of the sentence in which the labeling data is positioned can be obtained through the element dictionary table by utilizing the element label, so that the sample can be produced without manually labeling again. And acquiring the CLS layer characteristics by using the pre-training model as sentence characterization, constructing a token-pair matrix by using a multi-head idea by using one sentence characterization as a token, acquiring the contained element category of the sentence by using a globalpinter method, and finally splicing and combining the same sentences with element type marks according to an element type combination rule strategy to obtain the target paragraph text with known element information type.

6: training elements and relations extraction model C, the model structure is as shown in figure 4:

and (5) acquiring all element information and relations among elements contained in the type from the different type of paragraph sets identified in the step (S5). The method comprises the following steps:

taking a sentence set of information to be extracted, and constructing a schema according to the sentence set type: for example, the sentence combination of the winning bid information class is obtained, the winning bid unit and the winning bid amount thereof, the address, the contact and other information are required to be extracted, and a schema = { winning bid unit name is constructed: [ Address, contact, telephone, winning amount ] }

And (3) constructing a model input: the form of fixed prefix template adopts a spliced form of schema+text as input

The model input information is subjected to a pre-training model to obtain token level characterization vectors, and the token level characterization vectors are mapped into feature vectors with dimensions being the number of output element types through a full connection layer; and judging whether the token is the beginning or the end of the element by using the sigmoid, and completely acquiring element information according to the beginning and the end of the element.

More specifically, the following expressions: let text X and schema = { winning unit names known to contain winning information: [ Address, contact, telephone, winning amount ] }

Firstly, acquiring the beginning and ending positions of a 'winning unit name' of an element, wherein the splicing form of a schema+text is winning unit name+X, and acquiring a 'winning unit name' position Y by a model B

Acquiring an address of Y: the spliced form of the schema+text is Y address+X, and the starting and stopping positions of the address are acquired through the model B so as to acquire the address "

And the contact person of Y, the telephone of Y and the winning amount of Y are respectively obtained by the similar method.

It is particularly pointed out that not only the specific number of the winning bid amount is obtained, but also whether the winning bid amount is in units of ten thousands of units.

S7, outputting a result by the data through a standardized module: each element, each set of relationship pairs has a standard model for its normalization. The element information of the newly acquired bidding original file can be acquired by using the model A, B, C acquired by the steps after cleaning, and then standardized processing is performed according to standardized output formats of the element information, such as date, amount, address, telephone, mailbox and the like, the preliminary result formats extracted by the steps of extracting are various, and the final output result is required to be in a unified format through a format standardized module and finally output.

The invention and its embodiments have been described in a non-limiting manner, and the actual construction is not limited to the embodiments of the invention as shown in the drawings. In summary, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical solution should not be creatively devised without departing from the gist of the present invention.

Claims

1. A bid and ask element extraction method based on NLP information extraction technology, characterized by comprising the steps of:

2. The bid element extraction method based on the NLP information extraction technology of claim 1, wherein: the specific steps of S6 include,

3. The bid element extraction method based on the NLP information extraction technology of claim 2, wherein: the step S61 specifically includes,

4. A bid element extraction method based on NLP information extraction technique as claimed in claim 3, wherein: in S613, not only the specific number of the winning amount but also whether the winning amount is in units of yuan or ten thousand yuan is obtained.

5. A bid and ask element extraction system based on NLP information extraction technology, characterized in that: the system comprises a processor and a memory, wherein the memory is used for storing an instruction file and an algorithm model of the processing; the processor is configured with a data acquisition module, a data cleaning module, an extraction module of the bidding element extraction method based on the NLP technology and an output module of the extraction result, wherein the data acquisition module comprises a bidding document.

6. A computer-readable storage medium, characterized by: a computer program is stored thereon, which when executed by a processor implements the above-described method for extracting bid elements based on NLP technology and an electronic system.