CN109753660B

CN109753660B - LSTM-based winning bid web page named entity extraction method

Info

Publication number: CN109753660B
Application number: CN201910013185.2A
Authority: CN
Inventors: 陈羽中; 林剑; 郭昆
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2023-06-13
Anticipated expiration: 2039-01-07
Also published as: CN109753660A

Abstract

The invention relates to a named entity identification method of winning bid data, which comprises the following steps: cleaning the text data of the winning bid webpage to obtain winning bid text; using a Lattice-LSTM as a coding layer to obtain semantic information characteristics of text data; performing entity marking on each word by using the LSTM as a decoding layer, and marking entity information in a sentence sequence; performing rule correction and formatting; and finally, outputting the identified named entity of the winning web page. The invention can efficiently identify the named entity in the detail page of the bid-winning item of the bid-winning website based on the Lattice-LSTM-LSTM model.

Description

LSTM-based winning bid web page named entity extraction method

Technical Field

The invention relates to the technical field of named entity recognition, in particular to a method for extracting named entities of a winning web page based on LSTM.

Background

Named entity recognition is one of the fundamental tasks of natural language processing. The method aims at identifying name entities such as person names, place names, organization names and the like in the corpus. Because of the ever-increasing number of these named entities, it is often not possible to list them in a dictionary in an exhaustive manner and their constituent methods have some regularity of each, recognition of these words is often handled independently from the task of lexical morphological processing (e.g., chinese segmentation), known as named entity recognition.

As a basic task of natural language processing, related studies of named entity recognition have attracted close attention of more experts and scholars, and some optimization algorithms and models have been proposed. A scholars puts forward a named entity recognition algorithm based on a stacked HMM model, firstly, recognizing a person name and a place name, and then, performing high-level organization name recognition as a characteristic; the learner puts forward a Chinese named entity recognition algorithm based on a conditional random field, and obtains characters, boundaries, parts of speech and entity dictionaries as characteristics, so that a good effect can be obtained; a learner puts forward a bootstrapping-based method, and the bootstrapping technology is utilized to expand a seed word list to solve the problem of insufficient manual annotation data; a scholars put forward a named entity recognition algorithm based on a neural network structure of BLSTM, the method does not directly depend on artificial features and domain knowledge any more, but utilizes word vectors based on context and word vectors based on words, the former expresses context information of named entities, and the latter expresses prefix, suffix and domain information forming the named entities; a learner puts forward a named entity recognition algorithm based on a BLSTM-CRF model, when sequence labeling is carried out on sentences, labels among words are not independent, label information of the previous words is considered, tag of the current word is further marked by combining with the information of the words, CRF replaces output from the layer by using softmax, and final prediction of each word is generated; the scholars put forward a deep neural network model based on a stacked self-coding classifier, solves the problem of conversion from a Chinese text sequence to a model input vector, and puts forward a vectorization forward-backward propagation formula which is convenient for engineering realization.

Most of named entity recognition algorithms at present recognize names of people, places and institutions, do not divide the names of people, places and institutions further, and have poor recognition effect on long entities.

Disclosure of Invention

Therefore, the invention aims to provide the LSTM-based bid-winning web page named entity extraction method, which can rapidly and effectively identify named entities in bid-winning item detail pages of bidding websites.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a method for extracting a winning bid web page named entity based on LSTM specifically comprises the following steps:

step A: cleaning text data of the bid-winning web page to be extracted to obtain a bid-winning text;

and (B) step (B): taking a Lattice-LSTM model as a coding layer, and taking a winning bid text as input of the coding layer to obtain semantic information characteristics of the winning bid text;

step C: taking the LSTM model as a decoding layer, and taking the semantic information characteristics of the obtained winning bid text as the input of the decoding layer, and marking each word in the winning bid text;

step D: performing rule correction and formatting treatment on the obtained marked winning bid text;

step E: and outputting the identified named entity.

Further, the step B specifically includes:

step B1: converting words in the winning bid text into word vectors;

wherein for the j-th word c in the winning text _j Conversion to word vectors

The calculation formula is as follows:

wherein e ^c Representing a character vector mapping table.

Step B2: converting words in the winning bid text into word vectors;

step B3: and inputting the word vector into a Lattice-LSTM model, and obtaining semantic information characteristics of the winning bid text by using the Lattice-LSTM model.

Further, the step B2 specifically includes:

step B21: constructing a word list D by using a Tire tree according to the large-scale corpus;

step B22: initializing an empty matching word set P of the winning bid text;

step B23: traversing the first word of the winning bid text as the current word, and executing step B24;

step B24: word list D is matched with the word with the current word as the initial word

Adding into the collection P;

wherein b represents the position of the first word of the word in the sentence, e represents the position of the last word of the word in the sentence;

step B25, taking the next character of the current character as the current character, and iteratively executing the step B24 until the last character of the winning bid text is finished;

step B26: after the traversal is finished, the data in the set P is collected

Conversion to word vector->

The calculation formula is as follows:

wherein e ^w Is a word vector mapping table.

Further, the step B3 specifically includes the following steps:

for each sentence in the text, sequentially inputting the word vector sequence obtained in the step B1

And the word vector sequence obtained in the step B2 +.>

In the Lattice-LSTM model, a vector representation sequence of semantic information of each word context is output, and a specific calculation formula is as follows:

/>

is the word vector of the j-th word in the sentence,/>

Is a word vector of words ending with the j-th word in the sentence,/for example>

The output of the moment j; />

Weight matrix for word level LSTM, +.>

Bias terms for word level LSTM; />

Is the forget gate of the word level LSTM at the moment j; />

Is the input gate of the word level LSTM at the moment j; />

Is a candidate memory vector of the word level LSTM at the moment j; />

Is the memory vector of the word level LSTM at the moment j;

weight matrix for character level LSTM, +.>

Bias terms that are character level LSTM; />

Is the input gate of the character level LSTM at the moment j; />

Is a candidate memory vector of the word level LSTM at the moment j; />

Is the memory vector of the word level LSTM at the moment j; />

Is the output gate of the word level LSTM at the moment j; />

Is to calculate->

Weight at that time.

Further, the step C specifically includes:

step C1: aiming at the named entity recognition task of the winning web page, dividing words in the data into two types;

wherein the first class represents words that are not related to the entity, denoted by the label "O"; the second class represents words associated with an entity, and the tags of this class consist of three parts:

step C2: hiding state information which is obtained in the step B and can represent semantic information of text

The output state of each character under the influence of the context character is calculated in the LSTM model input to the decoding layer, and the specific calculation formula is as follows:

/>

wherein the method comprises the steps of

Is a label vector;

step C3: vector of labels

Inputting the text into a Softmax classifier, carrying out normalization operation on the text, and calculating the probability that each word in the text is marked as various labels, wherein the specific formula is as follows:

wherein W is _y As a weight matrix, b _y As bias term, N _t The number of kinds of labels;

step C4: the log likelihood function is used as a loss function, model parameters are updated by using back propagation iteration through a random gradient descent optimization method, so that the loss function is minimized to train a model, and a specific calculation formula is as follows:

wherein D represents the size of the training set, L _j Is the length of the sentence x and,

is the character t in sentence x _j Is->

Is the normalized probability, Θ represents the model parameter, I (O) is a selection function to distinguish the loss of label 'O' from the loss of label that can indicate the entity, the specific calculation formula is as follows:

further, the named entity comprises a bid-winning organization, a region where the bid-winning organization is located, a bid-winning amount, a bid-winning organization contact person and a bid-winning project name, and a bid-winning time.

Further, the step D specifically includes:

step D1: c, carrying out regular correction processing on the marked data obtained in the step C;

step D2: and formatting the corrected data.

Further, the step D1 specifically includes:

and D11, judging whether the entity has Arabic numerals or Chinese capital numerals by adopting a regular expression mode for the winning amount, and if not, judging that the entity is not the winning amount and discarding the winning amount.

Step D12: for the winning time, the judgment is not a rejection of the date composition mode.

Step D13: for the project name, since the character string length of the project name entity is generally long, the situation that only two or three words are formed basically does not occur, and therefore, the entity with the character string length of the identified project name being less than 4 is discarded.

Step D14: and only the named entity with the longest character string length is reserved when one label data appears for a plurality of times in the same category.

Further, in the step D2, the formatting process is performed on the named entity, which specifically includes the following steps:

step D21, judging whether the entity contains units of hundred, bai, qian, wan, yi, and U.S. dollars and Japanese, and if so, converting the units;

step D22: for the winning time, the conversion is performed in the form of date format YYYY-MM-DD.

Compared with the prior art, the invention has the following beneficial effects:

the invention is based on the Lattice-LSTM-LSTM model, can efficiently identify the named entity in the detail page of the bid-winning item of the bid-winning website, and can well identify the long entity.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples.

Referring to fig. 1, the invention provides a method for extracting named entities of a winning bid web page based on LSTM, which specifically includes the following steps:

step B1: converting words in the winning bid text into word vectors;

wherein for the j-th word c in the winning text _j Conversion to word vectors

The calculation formula is as follows:

wherein e ^c Representing a character vector mapping table.

Step B2: converting words in the winning bid text into word vectors;

step B22: initializing an empty matching word set P of the winning bid text;

Adding into the collection P;

step B26: after the traversal is finished, the data in the set P is collected

Conversion to word vector->

The calculation formula is as follows:

wherein e ^w Is a word vector mapping table.

And the word vector sequence obtained in the step B2 +.>

is the word vector of the j-th word in the sentence,/>

The output of the moment j; />

Weight matrix for word level LSTM, +.>

Bias terms for word level LSTM; />

Is the forget gate of the word level LSTM at the moment j; />

Is the input gate of the word level LSTM at the moment j;

is a candidate memory vector of the word level LSTM at the moment j; />

Is the memory vector of the word level LSTM at the moment j;

weight matrix for character level LSTM, +.>

Bias terms that are character level LSTM; />

Is the input gate of the character level LSTM at the moment j; />

Is a candidate memory vector of the word level LSTM at the moment j; />

Is the memory vector of the word level LSTM at the moment j; />

Is the output gate of the word level LSTM at the moment j; />

Is to calculate->

Weight at that time.

wherein the method comprises the steps of

Is a label vector;

step C3: vector of labels

is the character t in sentence x _j Is->

Step D2: and formatting the corrected data.

Step E: and outputting the identified bidding institutions, the regions where the bidding institutions are located, the bid amount, the bidding institution contacts, the bidding project names and the named entities of the bid time.

The foregoing description is only of the preferred embodiments of the invention, and all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. The method for extracting the named entity of the winning bid web page based on the LSTM is characterized by comprising the following steps of:

step E: outputting the identified named entity;

the step B specifically comprises the following steps:

step B1: converting words in the winning bid text into word vectors;

wherein for the j-th word c in the winning text _j Conversion to word vectors

The calculation formula is as follows:

wherein e ^c Representing character vector mappingA table;

step B2: converting words in the winning bid text into word vectors;

step B3: inputting the word vector into a Lattice-LSTM model, and obtaining semantic information characteristics of the winning bid text by using the Lattice-LSTM model;

the step B3 specifically comprises the following steps:

And the word vector sequence obtained in the step B2 +.>

/>

is the word vector of the j-th word in the sentence,/>

The output of the moment j; />

Weight matrix for word level LSTM, +.>

Bias terms for word level LSTM; />

Is the forget gate of the word level LSTM at the moment j; />

Is the input gate of the word level LSTM at the moment j; />

Is a candidate memory vector of the word level LSTM at the moment j; />

Is the memory vector of the word level LSTM at the moment j;

weight matrix for character level LSTM, +.>

Bias terms that are character level LSTM; />

Is the input gate of the character level LSTM at the moment j; />

Is a candidate memory vector of the word level LSTM at the moment j; />

Is the memory vector of the word level LSTM at the moment j; />

Is the output gate of the word level LSTM at the moment j; />

Is to calculate->

Weight at time;

the step C specifically comprises the following steps:

wherein the first class represents words that are not related to the entity, denoted by the label "O"; the second class represents words associated with an entity, and the labels of the words in this class consist of three parts;

step C2: b, hiding state information of semantic information representing text obtained in the step

wherein the method comprises the steps of

Is a label vector;

step C3: vector of labels

is the character t in sentence x _j Is->

2. the method for extracting the named entity of the winning web page based on the LSTM as recited in claim 1, wherein the step B2 specifically includes:

step B22: initializing an empty matching word set P of the winning bid text;

Adding into the collection P;

step B26: after the traversal is finished, the data in the set P is collected

Conversion to word vector->

The calculation formula is as follows:

wherein e ^w Is a word vector mapping table.

3. The LSTM-based bid-winning web page named entity extraction method of claim 1, wherein: the named entities comprise a bid-winning institution, a region where the bid-winning institution is located, a bid-winning amount, a bid-winning institution contact person, a bid-winning project name and a bid-winning time.

4. The method for extracting named entities from a winning web page based on LSTM as recited in claim 3, wherein step D is specifically:

step D2: and formatting the corrected data.

5. The LSTM-based extraction method of named entities of a winning web page according to claim 4, wherein the step D1 specifically includes:

step D11, judging whether an Arabic number or a Chinese capital number exists in the entity by adopting a regular expression mode for the winning amount, and if not, judging that the winning amount is not the winning amount and discarding the winning amount;

step D12: judging whether the winning time is the rejection of the date composition mode;

step D13: for the project name, as the character string length of the project name entity is generally longer, the condition that only two or three words are formed basically does not occur, and therefore, the entity with the character string length of the identified project name less than 4 is abandoned;

6. The method for extracting named entities from a winning web page based on LSTM in claim 4, wherein in step D2, the named entities are formatted, specifically comprising the following steps: