CN109753660A

CN109753660A - A kind of acceptance of the bid webpage name entity abstracting method based on LSTM

Info

Publication number: CN109753660A
Application number: CN201910013185.2A
Authority: CN
Inventors: 陈羽中; 林剑; 郭昆
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2019-05-14
Anticipated expiration: 2039-01-07
Also published as: CN109753660B

Abstract

The present invention relates to a kind of name entity recognition method of data of getting the bid, include the following steps: that the text data to acceptance of the bid webpage cleans, and obtains acceptance of the bid text；The semantic information feature of text data is obtained using Lattice-LSTM as coding layer；Entity mark is carried out to each word as decoding layer using LSTM, marks the entity information in statement sequence；Carry out the correction and formatting processing of rule；The name entity for the acceptance of the bid webpage that finally output identifies.The present invention is based on Lattice-LSTM-LSTM models, can efficiently identify the name entity in the project winning a bid details page of bidding website.

Description

A kind of acceptance of the bid webpage name entity abstracting method based on LSTM

Technical field

The present invention relates to name entity recognition techniques fields, and in particular to a kind of acceptance of the bid webpage name based on LSTM is real Body abstracting method.

Background technique

Name Entity recognition is a background task of natural language processing.The purpose is to identify name in corpus, Name, institution term etc. name entity.Since these name physical quantities are continuously increased, it is often impossible to exhaustive in dictionary It lists, and its constructive method has respective certain law, thus, usually the identification to these words from vocabulary form Independent process in (such as Chinese word segmentation) task of managing, referred to as name Entity recognition.

As a background task of natural language processing, the correlative study of Entity recognition is named to attract much more The close attention of expert and scholar, and propose some optimization algorithms and model.There is scholar to propose a kind of based on stacking HMM mould The name entity identification algorithms of type, first identify name and place name, then carry out high-rise mechanism name as feature and know Not；There is scholar to propose a kind of Chinese name entity identification algorithms based on condition random field, and obtains based on word, boundary, part of speech Good effect can be got as feature with entity dictionary；There is scholar to propose a kind of method based on bootstrapping, Expand seed vocabulary using bootstrapping technology and solves the problems, such as that artificial labeled data is insufficient；There is scholar to propose a kind of base In the name entity identification algorithms of the neural network structure of BLSTM, this method no longer depends directly on manual features and field is known Knowing, but utilizes the term vector based on context and the term vector based on word, the former expresses the contextual information of name entity, The latter expresses prefix, suffix and the realm information for constituting name entity；There is scholar to propose a kind of based on BLSTM-CRF model Entity identification algorithms are named, when carrying out sequence labelling to sentence, the label between word is not independent, considers front word Label information so that the information of bluebeard compound mark the tag of current word, CRF to replace again to export, produce from the layer using softmax The final prediction of raw each word；There is scholar to propose a kind of deep-neural-network model based on stack from coding classifier, Solve the transition problem from Chinese text sequence to mode input vector, propose before the vectorization convenient for Project Realization to- Back-propagating formula.

Name entity identification algorithms most at present are all to name, place name, and mechanism name is identified, not to its into Row is further to be divided, and bad to the recognition effect of long entity.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of, the acceptance of the bid webpage based on LSTM names entity abstracting method, It can quickly and effectively identify the name entity in the project winning a bid details page of bidding website.

To achieve the above object, the present invention adopts the following technical scheme:

A kind of acceptance of the bid webpage name entity abstracting method based on LSTM, specifically includes the following steps:

Step A: the text data of acceptance of the bid webpage to be extracted is cleaned, acceptance of the bid text is obtained；

Step B: it using Lattice-LSTM model as coding layer, and using acceptance of the bid text as the input of coding layer, obtains The semantic information feature of acceptance of the bid text；

Step C: using LSTM model as decoding layer, and using the semantic information feature of obtained acceptance of the bid text as decoding The input of layer is labeled each word in acceptance of the bid text；

Step D: rule regulating is carried out to the obtained acceptance of the bid text with mark and formatting is handled；

Step E: the name entity of identification is exported.

Further, the step B specifically:

Step B1: word vector is converted by the word in text of getting the bid；

Wherein, for j-th of word c in acceptance of the bid text_j, it is converted into word vectorCalculation formula is as follows:

Wherein, e^cIndicate character vector mapping table.

Step B2: the word in text of getting the bid is converted into term vector；

Step B3: inputting Lattice-LSTM model for term vector, obtains acceptance of the bid text using Lattice-LSTM model Semantic information feature.

Further, the step B2 specifically:

Step B21: vocabulary D is constructed using Tire tree according to Large Scale Corpus；

Step B22: the matching set of words P of the empty acceptance of the bid text of initialization one；

Step B23: beginning stepping through the first character for text of getting the bid as current word, executes step B24；

Step B24: by matching in vocabulary D using current word as the word of prefix wordIt is added in set P；

Wherein, b indicates position of the first character of word in sentence, and e indicates position of the last character of word in sentence；

Step B25: using the character late of current word as current word, iteration executes step B24, until text of getting the bid Last character terminate；

Step B26: will be in set P after traversalBe converted to term vectorCalculation formula is as follows:

Wherein, e^wFor term vector mapping table.

Further, the step B3 is specific as follows:

For each sentence in text, the word sequence vector that step B1 is obtained is sequentially inputAnd step The term vector sequence that B2 is obtainedInto Lattice-LSTM model, each word is exported in the semanteme of context The vector of information indicates sequence, and specific formula for calculation is as follows:

It is the word vector of j-th of word in sentence,Be in sentence with j-th of word be ending word term vector,For j The output at moment；For the weight matrix of word-level LSTM, For word-level The bias term of LSTM；It is forgetting door of the word-level LSTM at the j moment；It is input gate of the word-level LSTM at the j moment；It is candidate memory vector of the word-level LSTM at the j moment；It is memory vector of the word-level LSTM at the j moment；For the weight matrix of character level LSTM,For character level The bias term of LSTM；It is input gate of the character level LSTM at the j moment；Be candidate of the word-level LSTM at the j moment remember to Amount；It is memory vector of the word-level LSTM at the j moment；It is out gate of the word-level LSTM at the j moment； It is to calculateWhen weight.

Further, the step C specifically:

Step C1: for the name Entity recognition task of acceptance of the bid webpage, the word in data is divided into two classes；

Wherein, the first kind represents the word unrelated with entity, is indicated with label " O "；Second class represents relevant to entity The label of word, this kind of words consists of three parts:

Step C2: by the hidden state information of the obtained semantic information that can indicate text of step BIt is input to decoding In the LSTM model of layer, output state of each character under the influence of upper and lower Chinese character, the following institute of specific formula for calculation are calculated Show:

WhereinFor label vector；

Step C3: by label vectorIt is input in Softmax classifier, it is normalized operation, calculate text In each word be marked as the probability of all kinds of labels, specific formula is as follows:

Wherein W_yFor weight matrix, b_yFor bias term, N_tFor the species number of label；

Step C4: it using log-likelihood function as loss function, by stochastic gradient descent optimization method, is passed using reversed It broadcasts iteration and updates model parameter, carry out training pattern to minimize loss function, specific formula for calculation is as follows:

Wherein, D indicates the size of training set, L_jIt is the length of sentence x,It is character t in sentence x_jLabel, It is the probability after normalization, Θ representative model parameter, I (O) is a selection function, to distinguish the loss of label ' O ' and can refer to Show the loss of the label of entity, specific formula for calculation is as follows:

Further, the name entity includes bid mechanism, acceptance of the bid mechanism, bid mechanism their location, middle standard gold Volume, bid authority contact people, project for bidding title, get the bid the time.

Further, the step D specifically:

Step D1: the correction process that rule is carried out with labeled data that step C is obtained；

Step D2: will correction treated that data are formatted processing.

Further, the step D1 specifically:

Step D11: for the amount of money of getting the bid, judge entity with the presence or absence of Arabic numerals by the way of regular expression Or Chinese word figure, if there is no then not thinking to be the acceptance of the bid amount of money and give up.

Step D12: for the time of getting the bid, judgement is not that date building form give up.

Step D13: project name will not be gone out substantially since the string length of project name entity is usually longer Now there was only the case where two or three of word compositions, therefore gives up entity of the string length less than 4 of the project name recognized.

Step D14: reserved character string length longest life when classification same for an acceptance of the bid data occurs multiple Name entity.

Further, in the step D2, processing is formatted to name entity, specifically includes the following steps:

Step D21: for the amount of money of getting the bid, judging whether entity includes unit " hundred ", " one hundred ", " thousand ", " thousand ", " ten thousand ", " ten thousand ", " hundred million ", " hundred million ", " dollar ", " yen ", if carrying out unit conversion comprising if；

Step D22: it for the time of getting the bid, is converted in the form of date format YYYY-MM-DD.

Compared with the prior art, the invention has the following beneficial effects:

The present invention is based on Lattice-LSTM-LSTM models, can efficiently identify the project winning a bid details of bidding website Name entity in the page, and identification that can very well to long entity.

Detailed description of the invention

Fig. 1 is the method for the present invention flow chart.

Specific embodiment

The present invention will be further described with reference to the accompanying drawings and embodiments.

Fig. 1 is please referred to, the present invention provides a kind of acceptance of the bid webpage name entity abstracting method based on LSTM, specifically includes Following steps:

Step B1: word vector is converted by the word in text of getting the bid；

Wherein, e^cIndicate character vector mapping table.

Step B2: the word in text of getting the bid is converted into term vector；

Wherein, e^wFor term vector mapping table.

It is the word vector of j-th of word in sentence,Be in sentence with j-th of word be ending word term vector,For j The output at moment；For the weight matrix of word-level LSTM,For word The bias term of grade LSTM；It is forgetting door of the word-level LSTM at the j moment；It is input gate of the word-level LSTM at the j moment；It is candidate memory vector of the word-level LSTM at the j moment；It is memory vector of the word-level LSTM at the j moment；For the weight matrix of character level LSTM,For character level The bias term of LSTM；It is input gate of the character level LSTM at the j moment；Be candidate of the word-level LSTM at the j moment remember to Amount；It is memory vector of the word-level LSTM at the j moment；It is out gate of the word-level LSTM at the j moment； It is to calculateWhen weight.

WhereinFor label vector；

Step D2: will correction treated that data are formatted processing.

Step E: bid mechanism, acceptance of the bid mechanism, bid mechanism their location, the acceptance of the bid amount of money, the bid mechanism of identification are exported Contact person, project for bidding title, the name entity for time of getting the bid.

The foregoing is merely presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent With modification, it is all covered by the present invention.

Claims

1. a kind of acceptance of the bid webpage based on LSTM names entity abstracting method, which is characterized in that specifically includes the following steps:

Step B: it using Lattice-LSTM model as coding layer, and using acceptance of the bid text as the input of coding layer, is got the bid The semantic information feature of text；

Step C: using LSTM model as decoding layer, and using the semantic information feature of obtained acceptance of the bid text as the defeated of decoding layer Enter, each word in acceptance of the bid text is labeled；

Step E: the name entity of identification is exported.

2. a kind of acceptance of the bid webpage based on LSTM according to claim 1 names entity abstracting method, it is characterised in that: institute State step B specifically:

Step B1: word vector is converted by the word in text of getting the bid；

Wherein, e^cIndicate character vector mapping table；

Step B2: the word in text of getting the bid is converted into term vector；

Step B3: inputting Lattice-LSTM model for term vector, obtains the language of acceptance of the bid text using Lattice-LSTM model Adopted information characteristics.

3. a kind of acceptance of the bid webpage based on LSTM according to claim 2 names entity abstracting method, which is characterized in that institute State step B2 specifically:

Step B25: using the character late of current word as current word, iteration executes step B24, until the last of text of getting the bid One character ends；

Wherein, e^wFor term vector mapping table.

4. a kind of acceptance of the bid webpage based on LSTM according to claim 2 names entity abstracting method, which is characterized in that institute It is specific as follows to state step B3:

For each sentence in text, the word sequence vector that step B1 is obtained is sequentially inputIt is obtained with step B2 The term vector sequence arrivedInto Lattice-LSTM model, each word is exported in the semantic information of context Vector indicates sequence, and specific formula for calculation is as follows:

It is the word vector of j-th of word in sentence,Be in sentence with j-th of word be ending word term vector,For the j moment Output；For the weight matrix of word-level LSTM, For word-level LSTM Bias term；It is forgetting door of the word-level LSTM at the j moment；It is input gate of the word-level LSTM at the j moment；It is word Candidate memory vector of the language grade LSTM at the j moment；It is memory vector of the word-level LSTM at the j moment；For the weight matrix of character level LSTM,For character level The bias term of LSTM；It is input gate of the character level LSTM at the j moment；Be candidate of the word-level LSTM at the j moment remember to Amount；It is memory vector of the word-level LSTM at the j moment；It is out gate of the word-level LSTM at the j moment； It is to calculateWhen weight.

5. a kind of acceptance of the bid webpage based on LSTM according to claim 4 names entity abstracting method, which is characterized in that institute State step C specifically:

Wherein, the first kind represents the word unrelated with entity, is indicated with label " O "；Second class represents word relevant to entity, this The label of a kind of word consists of three parts:

Step C2: by the hidden state information of the obtained semantic information that can indicate text of step BIt is input to decoding layer In LSTM model, output state of each character under the influence of upper and lower Chinese character is calculated, specific formula for calculation is as follows:

WhereinFor label vector；

Step C3: by label vectorIt is input in Softmax classifier, it is normalized operation, calculate every in text A word is marked as the probability of all kinds of labels, and specific formula is as follows:

Wherein W_yFor weight matrix, b_yFor bias term, Nt is the species number of label；

Step C4: using log-likelihood function as loss function, by stochastic gradient descent optimization method, backpropagation iteration is utilized Model parameter is updated, carrys out training pattern to minimize loss function, specific formula for calculation is as follows:

Wherein, D indicates the size of training set, and Lj is the length of sentence x,It is label of the character t in sentence xj,It is normalizing Probability after change, Θ representative model parameter, I (O) are a selection functions, to distinguish the loss of label ' O ' and can indicate entity Label loss, specific formula for calculation is as follows:

6. a kind of acceptance of the bid webpage based on LSTM according to claim 1 names entity abstracting method, it is characterised in that: institute Stating name entity includes bid mechanism, acceptance of the bid mechanism, bid mechanism their location, the acceptance of the bid amount of money, bid authority contact people, bid Project name is got the bid the time.

7. a kind of acceptance of the bid webpage based on LSTM according to claim 6 names entity abstracting method, which is characterized in that institute State step D specifically:

Step D2: will correction treated that data are formatted processing.

8. a kind of acceptance of the bid webpage based on LSTM according to claim 7 names entity abstracting method, which is characterized in that institute State step D1 specifically:

Step D11: for the amount of money of getting the bid, judge entity with the presence or absence of Arabic numerals or Chinese by the way of regular expression Word figure, if there is no then not thinking it is that acceptance of the bid and is given up the amount of money.

Step D13: being not in only since the string length of project name entity is usually longer for project name substantially The case where being made of two or three of words, therefore give up entity of the string length less than 4 of the project name recognized.

Step D14: the longest name of reserved character string length is real when classification same for an acceptance of the bid data occurs multiple Body.

9. a kind of acceptance of the bid webpage based on LSTM according to claim 1 names entity abstracting method, which is characterized in that institute It states in step D2, processing is formatted to name entity, specifically includes the following steps: