CN109753660B - LSTM-based winning bid web page named entity extraction method - Google Patents

LSTM-based winning bid web page named entity extraction method Download PDF

Info

Publication number
CN109753660B
CN109753660B CN201910013185.2A CN201910013185A CN109753660B CN 109753660 B CN109753660 B CN 109753660B CN 201910013185 A CN201910013185 A CN 201910013185A CN 109753660 B CN109753660 B CN 109753660B
Authority
CN
China
Prior art keywords
word
winning
lstm
text
bid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910013185.2A
Other languages
Chinese (zh)
Other versions
CN109753660A (en
Inventor
陈羽中
林剑
郭昆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201910013185.2A priority Critical patent/CN109753660B/en
Publication of CN109753660A publication Critical patent/CN109753660A/en
Application granted granted Critical
Publication of CN109753660B publication Critical patent/CN109753660B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to a named entity identification method of winning bid data, which comprises the following steps: cleaning the text data of the winning bid webpage to obtain winning bid text; using a Lattice-LSTM as a coding layer to obtain semantic information characteristics of text data; performing entity marking on each word by using the LSTM as a decoding layer, and marking entity information in a sentence sequence; performing rule correction and formatting; and finally, outputting the identified named entity of the winning web page. The invention can efficiently identify the named entity in the detail page of the bid-winning item of the bid-winning website based on the Lattice-LSTM-LSTM model.

Description

LSTM-based winning bid web page named entity extraction method
Technical Field
The invention relates to the technical field of named entity recognition, in particular to a method for extracting named entities of a winning web page based on LSTM.
Background
Named entity recognition is one of the fundamental tasks of natural language processing. The method aims at identifying name entities such as person names, place names, organization names and the like in the corpus. Because of the ever-increasing number of these named entities, it is often not possible to list them in a dictionary in an exhaustive manner and their constituent methods have some regularity of each, recognition of these words is often handled independently from the task of lexical morphological processing (e.g., chinese segmentation), known as named entity recognition.
As a basic task of natural language processing, related studies of named entity recognition have attracted close attention of more experts and scholars, and some optimization algorithms and models have been proposed. A scholars puts forward a named entity recognition algorithm based on a stacked HMM model, firstly, recognizing a person name and a place name, and then, performing high-level organization name recognition as a characteristic; the learner puts forward a Chinese named entity recognition algorithm based on a conditional random field, and obtains characters, boundaries, parts of speech and entity dictionaries as characteristics, so that a good effect can be obtained; a learner puts forward a bootstrapping-based method, and the bootstrapping technology is utilized to expand a seed word list to solve the problem of insufficient manual annotation data; a scholars put forward a named entity recognition algorithm based on a neural network structure of BLSTM, the method does not directly depend on artificial features and domain knowledge any more, but utilizes word vectors based on context and word vectors based on words, the former expresses context information of named entities, and the latter expresses prefix, suffix and domain information forming the named entities; a learner puts forward a named entity recognition algorithm based on a BLSTM-CRF model, when sequence labeling is carried out on sentences, labels among words are not independent, label information of the previous words is considered, tag of the current word is further marked by combining with the information of the words, CRF replaces output from the layer by using softmax, and final prediction of each word is generated; the scholars put forward a deep neural network model based on a stacked self-coding classifier, solves the problem of conversion from a Chinese text sequence to a model input vector, and puts forward a vectorization forward-backward propagation formula which is convenient for engineering realization.
Most of named entity recognition algorithms at present recognize names of people, places and institutions, do not divide the names of people, places and institutions further, and have poor recognition effect on long entities.
Disclosure of Invention
Therefore, the invention aims to provide the LSTM-based bid-winning web page named entity extraction method, which can rapidly and effectively identify named entities in bid-winning item detail pages of bidding websites.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a method for extracting a winning bid web page named entity based on LSTM specifically comprises the following steps:
step A: cleaning text data of the bid-winning web page to be extracted to obtain a bid-winning text;
and (B) step (B): taking a Lattice-LSTM model as a coding layer, and taking a winning bid text as input of the coding layer to obtain semantic information characteristics of the winning bid text;
step C: taking the LSTM model as a decoding layer, and taking the semantic information characteristics of the obtained winning bid text as the input of the decoding layer, and marking each word in the winning bid text;
step D: performing rule correction and formatting treatment on the obtained marked winning bid text;
step E: and outputting the identified named entity.
Further, the step B specifically includes:
step B1: converting words in the winning bid text into word vectors;
wherein for the j-th word c in the winning text j Conversion to word vectors
Figure BDA0001938106730000022
The calculation formula is as follows:
Figure BDA0001938106730000021
wherein e c Representing a character vector mapping table.
Step B2: converting words in the winning bid text into word vectors;
step B3: and inputting the word vector into a Lattice-LSTM model, and obtaining semantic information characteristics of the winning bid text by using the Lattice-LSTM model.
Further, the step B2 specifically includes:
step B21: constructing a word list D by using a Tire tree according to the large-scale corpus;
step B22: initializing an empty matching word set P of the winning bid text;
step B23: traversing the first word of the winning bid text as the current word, and executing step B24;
step B24: word list D is matched with the word with the current word as the initial word
Figure BDA0001938106730000023
Adding into the collection P;
wherein b represents the position of the first word of the word in the sentence, e represents the position of the last word of the word in the sentence;
step B25, taking the next character of the current character as the current character, and iteratively executing the step B24 until the last character of the winning bid text is finished;
step B26: after the traversal is finished, the data in the set P is collected
Figure BDA0001938106730000031
Conversion to word vector->
Figure BDA0001938106730000032
The calculation formula is as follows:
Figure BDA0001938106730000033
wherein e w Is a word vector mapping table.
Further, the step B3 specifically includes the following steps:
for each sentence in the text, sequentially inputting the word vector sequence obtained in the step B1
Figure BDA0001938106730000034
And the word vector sequence obtained in the step B2 +.>
Figure BDA0001938106730000035
In the Lattice-LSTM model, a vector representation sequence of semantic information of each word context is output, and a specific calculation formula is as follows:
Figure BDA0001938106730000036
Figure BDA0001938106730000037
Figure BDA0001938106730000038
Figure BDA0001938106730000039
Figure BDA00019381067300000310
Figure BDA00019381067300000311
/>
Figure BDA00019381067300000312
Figure BDA00019381067300000313
Figure BDA00019381067300000314
Figure BDA00019381067300000315
Figure BDA00019381067300000316
Figure BDA00019381067300000317
Figure BDA0001938106730000041
is the word vector of the j-th word in the sentence,/>
Figure BDA0001938106730000042
Is a word vector of words ending with the j-th word in the sentence,/for example>
Figure BDA0001938106730000043
The output of the moment j; />
Figure BDA0001938106730000044
Weight matrix for word level LSTM, +.>
Figure BDA0001938106730000045
Figure BDA0001938106730000046
Bias terms for word level LSTM; />
Figure BDA0001938106730000047
Is the forget gate of the word level LSTM at the moment j; />
Figure BDA0001938106730000048
Is the input gate of the word level LSTM at the moment j; />
Figure BDA0001938106730000049
Is a candidate memory vector of the word level LSTM at the moment j; />
Figure BDA00019381067300000410
Is the memory vector of the word level LSTM at the moment j;
Figure BDA00019381067300000411
weight matrix for character level LSTM, +.>
Figure BDA00019381067300000412
Bias terms that are character level LSTM; />
Figure BDA00019381067300000413
Is the input gate of the character level LSTM at the moment j; />
Figure BDA00019381067300000414
Is a candidate memory vector of the word level LSTM at the moment j; />
Figure BDA00019381067300000415
Is the memory vector of the word level LSTM at the moment j; />
Figure BDA00019381067300000416
Is the output gate of the word level LSTM at the moment j; />
Figure BDA00019381067300000417
Figure BDA00019381067300000418
Is to calculate->
Figure BDA00019381067300000419
Weight at that time.
Further, the step C specifically includes:
step C1: aiming at the named entity recognition task of the winning web page, dividing words in the data into two types;
wherein the first class represents words that are not related to the entity, denoted by the label "O"; the second class represents words associated with an entity, and the tags of this class consist of three parts:
step C2: hiding state information which is obtained in the step B and can represent semantic information of text
Figure BDA00019381067300000420
The output state of each character under the influence of the context character is calculated in the LSTM model input to the decoding layer, and the specific calculation formula is as follows:
Figure BDA00019381067300000421
Figure BDA00019381067300000422
Figure BDA00019381067300000423
Figure BDA00019381067300000424
/>
Figure BDA00019381067300000425
Figure BDA00019381067300000426
Figure BDA00019381067300000427
wherein the method comprises the steps of
Figure BDA00019381067300000428
Is a label vector;
step C3: vector of labels
Figure BDA0001938106730000051
Inputting the text into a Softmax classifier, carrying out normalization operation on the text, and calculating the probability that each word in the text is marked as various labels, wherein the specific formula is as follows:
Figure BDA0001938106730000052
Figure BDA0001938106730000053
wherein W is y As a weight matrix, b y As bias term, N t The number of kinds of labels;
step C4: the log likelihood function is used as a loss function, model parameters are updated by using back propagation iteration through a random gradient descent optimization method, so that the loss function is minimized to train a model, and a specific calculation formula is as follows:
Figure BDA0001938106730000054
wherein D represents the size of the training set, L j Is the length of the sentence x and,
Figure BDA0001938106730000055
is the character t in sentence x j Is->
Figure BDA0001938106730000056
Is the normalized probability, Θ represents the model parameter, I (O) is a selection function to distinguish the loss of label 'O' from the loss of label that can indicate the entity, the specific calculation formula is as follows:
Figure BDA0001938106730000057
further, the named entity comprises a bid-winning organization, a region where the bid-winning organization is located, a bid-winning amount, a bid-winning organization contact person and a bid-winning project name, and a bid-winning time.
Further, the step D specifically includes:
step D1: c, carrying out regular correction processing on the marked data obtained in the step C;
step D2: and formatting the corrected data.
Further, the step D1 specifically includes:
and D11, judging whether the entity has Arabic numerals or Chinese capital numerals by adopting a regular expression mode for the winning amount, and if not, judging that the entity is not the winning amount and discarding the winning amount.
Step D12: for the winning time, the judgment is not a rejection of the date composition mode.
Step D13: for the project name, since the character string length of the project name entity is generally long, the situation that only two or three words are formed basically does not occur, and therefore, the entity with the character string length of the identified project name being less than 4 is discarded.
Step D14: and only the named entity with the longest character string length is reserved when one label data appears for a plurality of times in the same category.
Further, in the step D2, the formatting process is performed on the named entity, which specifically includes the following steps:
step D21, judging whether the entity contains units of hundred, bai, qian, wan, yi, and U.S. dollars and Japanese, and if so, converting the units;
step D22: for the winning time, the conversion is performed in the form of date format YYYY-MM-DD.
Compared with the prior art, the invention has the following beneficial effects:
the invention is based on the Lattice-LSTM-LSTM model, can efficiently identify the named entity in the detail page of the bid-winning item of the bid-winning website, and can well identify the long entity.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples.
Referring to fig. 1, the invention provides a method for extracting named entities of a winning bid web page based on LSTM, which specifically includes the following steps:
step A: cleaning text data of the bid-winning web page to be extracted to obtain a bid-winning text;
and (B) step (B): taking a Lattice-LSTM model as a coding layer, and taking a winning bid text as input of the coding layer to obtain semantic information characteristics of the winning bid text;
step B1: converting words in the winning bid text into word vectors;
wherein for the j-th word c in the winning text j Conversion to word vectors
Figure BDA0001938106730000061
The calculation formula is as follows:
Figure BDA0001938106730000062
wherein e c Representing a character vector mapping table.
Step B2: converting words in the winning bid text into word vectors;
step B21: constructing a word list D by using a Tire tree according to the large-scale corpus;
step B22: initializing an empty matching word set P of the winning bid text;
step B23: traversing the first word of the winning bid text as the current word, and executing step B24;
step B24: word list D is matched with the word with the current word as the initial word
Figure BDA0001938106730000063
Adding into the collection P;
wherein b represents the position of the first word of the word in the sentence, e represents the position of the last word of the word in the sentence;
step B25, taking the next character of the current character as the current character, and iteratively executing the step B24 until the last character of the winning bid text is finished;
step B26: after the traversal is finished, the data in the set P is collected
Figure BDA0001938106730000071
Conversion to word vector->
Figure BDA0001938106730000072
The calculation formula is as follows:
Figure BDA0001938106730000073
wherein e w Is a word vector mapping table.
Step B3: and inputting the word vector into a Lattice-LSTM model, and obtaining semantic information characteristics of the winning bid text by using the Lattice-LSTM model.
For each sentence in the text, sequentially inputting the word vector sequence obtained in the step B1
Figure BDA0001938106730000074
And the word vector sequence obtained in the step B2 +.>
Figure BDA0001938106730000075
In the Lattice-LSTM model, a vector representation sequence of semantic information of each word context is output, and a specific calculation formula is as follows:
Figure BDA0001938106730000076
Figure BDA0001938106730000077
Figure BDA0001938106730000078
Figure BDA0001938106730000079
Figure BDA00019381067300000710
Figure BDA00019381067300000711
Figure BDA00019381067300000712
Figure BDA00019381067300000713
Figure BDA00019381067300000714
Figure BDA00019381067300000715
Figure BDA00019381067300000716
Figure BDA0001938106730000081
Figure BDA0001938106730000082
is the word vector of the j-th word in the sentence,/>
Figure BDA0001938106730000083
Is a word vector of words ending with the j-th word in the sentence,/for example>
Figure BDA0001938106730000084
The output of the moment j; />
Figure BDA0001938106730000085
Weight matrix for word level LSTM, +.>
Figure BDA0001938106730000086
Bias terms for word level LSTM; />
Figure BDA0001938106730000087
Is the forget gate of the word level LSTM at the moment j; />
Figure BDA0001938106730000088
Is the input gate of the word level LSTM at the moment j;
Figure BDA0001938106730000089
is a candidate memory vector of the word level LSTM at the moment j; />
Figure BDA00019381067300000810
Is the memory vector of the word level LSTM at the moment j;
Figure BDA00019381067300000811
weight matrix for character level LSTM, +.>
Figure BDA00019381067300000812
Bias terms that are character level LSTM; />
Figure BDA00019381067300000813
Is the input gate of the character level LSTM at the moment j; />
Figure BDA00019381067300000814
Is a candidate memory vector of the word level LSTM at the moment j; />
Figure BDA00019381067300000815
Is the memory vector of the word level LSTM at the moment j; />
Figure BDA00019381067300000816
Is the output gate of the word level LSTM at the moment j; />
Figure BDA00019381067300000817
Figure BDA00019381067300000818
Is to calculate->
Figure BDA00019381067300000819
Weight at that time.
Step C: taking the LSTM model as a decoding layer, and taking the semantic information characteristics of the obtained winning bid text as the input of the decoding layer, and marking each word in the winning bid text;
step C1: aiming at the named entity recognition task of the winning web page, dividing words in the data into two types;
wherein the first class represents words that are not related to the entity, denoted by the label "O"; the second class represents words associated with an entity, and the tags of this class consist of three parts:
step C2: hiding state information which is obtained in the step B and can represent semantic information of text
Figure BDA00019381067300000826
The output state of each character under the influence of the context character is calculated in the LSTM model input to the decoding layer, and the specific calculation formula is as follows:
Figure BDA00019381067300000820
Figure BDA00019381067300000821
Figure BDA00019381067300000822
Figure BDA00019381067300000823
Figure BDA00019381067300000824
Figure BDA00019381067300000825
Figure BDA0001938106730000091
wherein the method comprises the steps of
Figure BDA0001938106730000092
Is a label vector;
step C3: vector of labels
Figure BDA0001938106730000093
Inputting the text into a Softmax classifier, carrying out normalization operation on the text, and calculating the probability that each word in the text is marked as various labels, wherein the specific formula is as follows:
Figure BDA0001938106730000094
Figure BDA0001938106730000095
wherein W is y As a weight matrix, b y As bias term, N t The number of kinds of labels;
step C4: the log likelihood function is used as a loss function, model parameters are updated by using back propagation iteration through a random gradient descent optimization method, so that the loss function is minimized to train a model, and a specific calculation formula is as follows:
Figure BDA0001938106730000096
wherein D represents the size of the training set, L j Is the length of the sentence x and,
Figure BDA0001938106730000097
is the character t in sentence x j Is->
Figure BDA0001938106730000098
Is the normalized probability, Θ represents the model parameter, I (O) is a selection function to distinguish the loss of label 'O' from the loss of label that can indicate the entity, the specific calculation formula is as follows:
Figure BDA0001938106730000099
step D: performing rule correction and formatting treatment on the obtained marked winning bid text;
step D1: c, carrying out regular correction processing on the marked data obtained in the step C;
and D11, judging whether the entity has Arabic numerals or Chinese capital numerals by adopting a regular expression mode for the winning amount, and if not, judging that the entity is not the winning amount and discarding the winning amount.
Step D12: for the winning time, the judgment is not a rejection of the date composition mode.
Step D13: for the project name, since the character string length of the project name entity is generally long, the situation that only two or three words are formed basically does not occur, and therefore, the entity with the character string length of the identified project name being less than 4 is discarded.
Step D14: and only the named entity with the longest character string length is reserved when one label data appears for a plurality of times in the same category.
Step D2: and formatting the corrected data.
Step D21, judging whether the entity contains units of hundred, bai, qian, wan, yi, and U.S. dollars and Japanese, and if so, converting the units;
step D22: for the winning time, the conversion is performed in the form of date format YYYY-MM-DD.
Step E: and outputting the identified bidding institutions, the regions where the bidding institutions are located, the bid amount, the bidding institution contacts, the bidding project names and the named entities of the bid time.
The foregoing description is only of the preferred embodiments of the invention, and all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (6)

1. The method for extracting the named entity of the winning bid web page based on the LSTM is characterized by comprising the following steps of:
step A: cleaning text data of the bid-winning web page to be extracted to obtain a bid-winning text;
and (B) step (B): taking a Lattice-LSTM model as a coding layer, and taking a winning bid text as input of the coding layer to obtain semantic information characteristics of the winning bid text;
step C: taking the LSTM model as a decoding layer, and taking the semantic information characteristics of the obtained winning bid text as the input of the decoding layer, and marking each word in the winning bid text;
step D: performing rule correction and formatting treatment on the obtained marked winning bid text;
step E: outputting the identified named entity;
the step B specifically comprises the following steps:
step B1: converting words in the winning bid text into word vectors;
wherein for the j-th word c in the winning text j Conversion to word vectors
Figure FDA0004100483970000011
The calculation formula is as follows:
Figure FDA0004100483970000012
wherein e c Representing character vector mappingA table;
step B2: converting words in the winning bid text into word vectors;
step B3: inputting the word vector into a Lattice-LSTM model, and obtaining semantic information characteristics of the winning bid text by using the Lattice-LSTM model;
the step B3 specifically comprises the following steps:
for each sentence in the text, sequentially inputting the word vector sequence obtained in the step B1
Figure FDA0004100483970000013
And the word vector sequence obtained in the step B2 +.>
Figure FDA0004100483970000014
In the Lattice-LSTM model, a vector representation sequence of semantic information of each word context is output, and a specific calculation formula is as follows:
Figure FDA0004100483970000015
Figure FDA0004100483970000016
Figure FDA0004100483970000017
Figure FDA0004100483970000018
Figure FDA0004100483970000021
Figure FDA0004100483970000022
Figure FDA0004100483970000023
Figure FDA0004100483970000024
Figure FDA0004100483970000025
Figure FDA0004100483970000026
/>
Figure FDA0004100483970000027
Figure FDA0004100483970000028
Figure FDA0004100483970000029
is the word vector of the j-th word in the sentence,/>
Figure FDA00041004839700000210
Is a word vector of words ending with the j-th word in the sentence,/for example>
Figure FDA00041004839700000226
The output of the moment j; />
Figure FDA00041004839700000211
Weight matrix for word level LSTM, +.>
Figure FDA00041004839700000212
Bias terms for word level LSTM; />
Figure FDA00041004839700000213
Is the forget gate of the word level LSTM at the moment j; />
Figure FDA00041004839700000214
Is the input gate of the word level LSTM at the moment j; />
Figure FDA00041004839700000215
Is a candidate memory vector of the word level LSTM at the moment j; />
Figure FDA00041004839700000216
Is the memory vector of the word level LSTM at the moment j;
Figure FDA00041004839700000217
weight matrix for character level LSTM, +.>
Figure FDA00041004839700000218
Bias terms that are character level LSTM; />
Figure FDA00041004839700000219
Is the input gate of the character level LSTM at the moment j; />
Figure FDA00041004839700000220
Is a candidate memory vector of the word level LSTM at the moment j; />
Figure FDA00041004839700000221
Is the memory vector of the word level LSTM at the moment j; />
Figure FDA00041004839700000222
Is the output gate of the word level LSTM at the moment j; />
Figure FDA00041004839700000223
Figure FDA00041004839700000224
Is to calculate->
Figure FDA00041004839700000225
Weight at time;
the step C specifically comprises the following steps:
step C1: aiming at the named entity recognition task of the winning web page, dividing words in the data into two types;
wherein the first class represents words that are not related to the entity, denoted by the label "O"; the second class represents words associated with an entity, and the labels of the words in this class consist of three parts;
step C2: b, hiding state information of semantic information representing text obtained in the step
Figure FDA00041004839700000316
The output state of each character under the influence of the context character is calculated in the LSTM model input to the decoding layer, and the specific calculation formula is as follows:
Figure FDA0004100483970000031
Figure FDA0004100483970000032
Figure FDA0004100483970000033
Figure FDA0004100483970000034
Figure FDA0004100483970000035
Figure FDA0004100483970000036
Figure FDA0004100483970000037
wherein the method comprises the steps of
Figure FDA0004100483970000038
Is a label vector;
step C3: vector of labels
Figure FDA0004100483970000039
Inputting the text into a Softmax classifier, carrying out normalization operation on the text, and calculating the probability that each word in the text is marked as various labels, wherein the specific formula is as follows:
Figure FDA00041004839700000310
Figure FDA00041004839700000311
wherein W is y As a weight matrix, b y As bias term, N t The number of kinds of labels;
step C4: the log likelihood function is used as a loss function, model parameters are updated by using back propagation iteration through a random gradient descent optimization method, so that the loss function is minimized to train a model, and a specific calculation formula is as follows:
Figure FDA00041004839700000312
wherein D represents the size of the training set, L j Is the length of the sentence x and,
Figure FDA00041004839700000313
is the character t in sentence x j Is->
Figure FDA00041004839700000314
Is the normalized probability, Θ represents the model parameter, I (O) is a selection function to distinguish the loss of label 'O' from the loss of label that can indicate the entity, the specific calculation formula is as follows:
Figure FDA00041004839700000315
2. the method for extracting the named entity of the winning web page based on the LSTM as recited in claim 1, wherein the step B2 specifically includes:
step B21: constructing a word list D by using a Tire tree according to the large-scale corpus;
step B22: initializing an empty matching word set P of the winning bid text;
step B23: traversing the first word of the winning bid text as the current word, and executing step B24;
step B24: word list D is matched with the word with the current word as the initial word
Figure FDA0004100483970000041
Adding into the collection P;
wherein b represents the position of the first word of the word in the sentence, e represents the position of the last word of the word in the sentence;
step B25, taking the next character of the current character as the current character, and iteratively executing the step B24 until the last character of the winning bid text is finished;
step B26: after the traversal is finished, the data in the set P is collected
Figure FDA0004100483970000042
Conversion to word vector->
Figure FDA0004100483970000043
The calculation formula is as follows:
Figure FDA0004100483970000044
wherein e w Is a word vector mapping table.
3. The LSTM-based bid-winning web page named entity extraction method of claim 1, wherein: the named entities comprise a bid-winning institution, a region where the bid-winning institution is located, a bid-winning amount, a bid-winning institution contact person, a bid-winning project name and a bid-winning time.
4. The method for extracting named entities from a winning web page based on LSTM as recited in claim 3, wherein step D is specifically:
step D1: c, carrying out regular correction processing on the marked data obtained in the step C;
step D2: and formatting the corrected data.
5. The LSTM-based extraction method of named entities of a winning web page according to claim 4, wherein the step D1 specifically includes:
step D11, judging whether an Arabic number or a Chinese capital number exists in the entity by adopting a regular expression mode for the winning amount, and if not, judging that the winning amount is not the winning amount and discarding the winning amount;
step D12: judging whether the winning time is the rejection of the date composition mode;
step D13: for the project name, as the character string length of the project name entity is generally longer, the condition that only two or three words are formed basically does not occur, and therefore, the entity with the character string length of the identified project name less than 4 is abandoned;
step D14: and only the named entity with the longest character string length is reserved when one label data appears for a plurality of times in the same category.
6. The method for extracting named entities from a winning web page based on LSTM in claim 4, wherein in step D2, the named entities are formatted, specifically comprising the following steps:
step D21, judging whether the entity contains units of hundred, bai, qian, wan, yi, and U.S. dollars and Japanese, and if so, converting the units;
step D22: for the winning time, the conversion is performed in the form of date format YYYY-MM-DD.
CN201910013185.2A 2019-01-07 2019-01-07 LSTM-based winning bid web page named entity extraction method Active CN109753660B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910013185.2A CN109753660B (en) 2019-01-07 2019-01-07 LSTM-based winning bid web page named entity extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910013185.2A CN109753660B (en) 2019-01-07 2019-01-07 LSTM-based winning bid web page named entity extraction method

Publications (2)

Publication Number Publication Date
CN109753660A CN109753660A (en) 2019-05-14
CN109753660B true CN109753660B (en) 2023-06-13

Family

ID=66404567

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910013185.2A Active CN109753660B (en) 2019-01-07 2019-01-07 LSTM-based winning bid web page named entity extraction method

Country Status (1)

Country Link
CN (1) CN109753660B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334300A (en) * 2019-07-10 2019-10-15 哈尔滨工业大学 Text aid reading method towards the analysis of public opinion
CN110738182A (en) * 2019-10-21 2020-01-31 四川隧唐科技股份有限公司 LSTM model unit training method and device for high-precision identification of bid amount
CN112017016A (en) * 2019-10-29 2020-12-01 河南拓普计算机网络工程有限公司 Method for cleaning bid amount of bid-attracting bulletin
CN110738319A (en) * 2019-11-11 2020-01-31 四川隧唐科技股份有限公司 LSTM model unit training method and device for recognizing bid-winning units based on CRF
CN111078978B (en) * 2019-11-29 2024-02-27 上海观安信息技术股份有限公司 Network credit website entity identification method and system based on website text content
CN111241832B (en) * 2020-01-15 2023-08-15 北京百度网讯科技有限公司 Core entity labeling method and device and electronic equipment
CN111738002A (en) * 2020-05-26 2020-10-02 北京信息科技大学 Ancient text field named entity identification method and system based on Lattice LSTM
CN111737969B (en) * 2020-07-27 2020-12-08 北森云计算有限公司 Resume parsing method and system based on deep learning
CN112990845A (en) * 2021-01-04 2021-06-18 江苏省测绘地理信息局信息中心 Intelligent acquisition method for mapping market project
CN112989807B (en) * 2021-03-11 2021-11-23 重庆理工大学 Long digital entity extraction method based on continuous digital compression coding
CN112948588B (en) * 2021-05-11 2021-07-30 中国人民解放军国防科技大学 Chinese text classification method for quick information editing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN107832400A (en) * 2017-11-01 2018-03-23 山东大学 A kind of method that location-based LSTM and CNN conjunctive models carry out relation classification
CN108416058A (en) * 2018-03-22 2018-08-17 北京理工大学 A kind of Relation extraction method based on the enhancing of Bi-LSTM input informations
CN108509423A (en) * 2018-04-04 2018-09-07 福州大学 A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8370128B2 (en) * 2008-09-30 2013-02-05 Xerox Corporation Semantically-driven extraction of relations between named entities

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN107832400A (en) * 2017-11-01 2018-03-23 山东大学 A kind of method that location-based LSTM and CNN conjunctive models carry out relation classification
CN108416058A (en) * 2018-03-22 2018-08-17 北京理工大学 A kind of Relation extraction method based on the enhancing of Bi-LSTM input informations
CN108509423A (en) * 2018-04-04 2018-09-07 福州大学 A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唐敏.基于深度学习的中文实体关系抽取方法研究.《万方数据学位论文库》.2018, *

Also Published As

Publication number Publication date
CN109753660A (en) 2019-05-14

Similar Documents

Publication Publication Date Title
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN110427623B (en) Semi-structured document knowledge extraction method and device, electronic equipment and storage medium
CN108984526B (en) Document theme vector extraction method based on deep learning
CN107729309B (en) Deep learning-based Chinese semantic analysis method and device
CN109062893B (en) Commodity name identification method based on full-text attention mechanism
CN104834747B (en) Short text classification method based on convolutional neural networks
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN109359291A (en) A kind of name entity recognition method
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN110555084A (en) remote supervision relation classification method based on PCNN and multi-layer attention
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN106611041A (en) New text similarity solution method
CN110750646B (en) Attribute description extracting method for hotel comment text
CN112800184B (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN110134934A (en) Text emotion analysis method and device
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN108763192B (en) Entity relation extraction method and device for text processing
CN111444704B (en) Network safety keyword extraction method based on deep neural network
CN111191031A (en) Entity relation classification method of unstructured text based on WordNet and IDF
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115510864A (en) Chinese crop disease and pest named entity recognition method fused with domain dictionary
CN115269834A (en) High-precision text classification method and device based on BERT

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant