AU2020103654A4 - Method for intelligent construction of place name annotated corpus based on interactive and iterative learning - Google Patents
Method for intelligent construction of place name annotated corpus based on interactive and iterative learning Download PDFInfo
- Publication number
- AU2020103654A4 AU2020103654A4 AU2020103654A AU2020103654A AU2020103654A4 AU 2020103654 A4 AU2020103654 A4 AU 2020103654A4 AU 2020103654 A AU2020103654 A AU 2020103654A AU 2020103654 A AU2020103654 A AU 2020103654A AU 2020103654 A4 AU2020103654 A4 AU 2020103654A4
- Authority
- AU
- Australia
- Prior art keywords
- place name
- model
- character
- sentence
- interactive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000002452 interceptive effect Effects 0.000 title claims abstract description 47
- 238000010276 construction Methods 0.000 title claims abstract description 23
- 239000011159 matrix material Substances 0.000 claims abstract description 55
- 238000012549 training Methods 0.000 claims abstract description 53
- 239000013598 vector Substances 0.000 claims abstract description 48
- 238000012937 correction Methods 0.000 claims abstract description 8
- 230000000694 effects Effects 0.000 claims description 14
- 238000011156 evaluation Methods 0.000 claims description 14
- 238000013135 deep learning Methods 0.000 claims description 8
- 238000003058 natural language processing Methods 0.000 claims description 8
- 230000007704 transition Effects 0.000 claims description 6
- 238000002474 experimental method Methods 0.000 description 36
- 101001013832 Homo sapiens Mitochondrial peptide methionine sulfoxide reductase Proteins 0.000 description 12
- 102100031767 Mitochondrial peptide methionine sulfoxide reductase Human genes 0.000 description 12
- 230000011218 segmentation Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 230000006403 short-term memory Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000006386 memory function Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The present invention discloses a method for intelligent construction of a place
name annotated corpus based on interactive and iterative learning. The method
includes: generating a word vector matrix of a character and a disambiguation matrix
of the character in a sentence in an initial corpus, after splicing the word vector matrix
and the disambiguation matrix, inputting, for training, the word vector matrix and the
disambiguation matrix into a model in which Bi-LSTM and CRF are integrated, and
generating a place name identification model; embedding the place name
identification model into a human-machine interactive place name annotation
platform, and performing human-machine interactive correction; and merging initial
training linguistic data with annotated place name linguistic data, optimizing a
parameter of the place name identification model, and ending iterative training and
learning until the constructed corpus meets a requirement, thereby intelligently
constructing and optimizing the place name corpus. Based on the present invention,
current problems of a lack and slow update of place name linguistic data, and
time-consuming, laborious, and inefficient manual construction of the place name
linguistic data can be effectively resolved, and intelligent update of the place name
annotated corpus facing multi-source, dynamic, heterogeneous, and exponentially
growing Internet texts can be effectively implemented.
8 8)
e~
a
C4. _
C)
.- oa
CAC
Z
~H~i & o a
--- -- ----- ----- --.
ti
Fig. 1
1/10
Description
8 8)
e~
a
C4. _
.- oa
~H~i & oa ti
Fig. 1
1/10
The present invention belongs to the field of geographic information processing
technologies, and specifically, relates to a method for intelligent construction of a
place name annotated corpus based on interactive and iterative learning, to optimize a
parameter of a deep learning model to a full extent, improve a place name
identification effect, and achieve intelligent construction and optimization of the place
name annotated corpus.
With rapid development of the Internet and the advent of the era of big data and
artificial intelligence, the world today is entering a ubiquitous information society and
the era of big data (Chenghu Zhou, 2011; Deren Li, 2012; Goodchild, 2017). Big
location data is an important part of big data, and 80% of information in the world is
related to locations (Williams, 1987; Jingnan Liu, 2014). Place names are the proper
names assigned by people to specific geographic entities in the universe, are an
important part of location information, and are also indispensable information for
digital surveying and mapping products. As one of the most commonly used social
public information, place names are the most acceptable positioning method for
ordinary people, and also provide indispensable basic information resources for
national administration, economic construction, and domestic and foreign exchanges.
Texts are a typical representative of ubiquitous geographic big data sources. A
data scale of the texts is getting larger and larger, and the texts cover a plurality of
fields which are more complex. Chinese text expressions have the following
characteristics: being unstructured, vague, and random, and having complex
composition and no obvious separator between words. Description of place name
entities in Chinese texts has the following characteristics: (1) Internal composition of
Chinese place name entities is complex and diverse, including both simple place
names and a large quantity of compound place names, that is, there may be a plurality
of overlapping place name entities, such as "Jiangning Nanjing Jiangsu". (2) Chinese
place names and other categories of entity names often contain each other. For
example, "Zuchong Road" contains a name. (3) Lengths of Chinese place names vary
relatively greatly, including an abbreviation and a full name of a place name, some
Chinese place names contain only one Chinese character, Such as "Ying", "Mei",
"Hu", and some Chinese place names can have up to a dozen of Chinese characters,
such as "Hong Kong Special Administrative Region of the People's Republic of
China". (4) A Chinese sentence is a sequence of Chinese characters, a place name
entity is a segment of the character sequence, and there is no separator between
Chinese characters, which is not conducive to identification of a boundary of place
name entities. (5) Compared with common nouns, a Chinese place name entity has no
obvious distinguishing features such as a case change and a word form change. (6)
Chinese linguistic data resources are small in scale and slowly updated. In particular,
with the rapid development of Intemet+ and big data, a large quantity of new and
unregistered place names has emerged. A plurality of the above-mentioned factors
cause identification of Chinese place name identities to fail to meet requirements of
ubiquitous location information services.
At present, identification methods of Chinese place names are mainly classified
into a method based on rules and dictionaries, a method based on statistics, and a
method based on both. The Chinese place name identification method based on rules
and dictionaries mostly use rule templates manually constructed by linguistic experts.
These rules often depend on specific languages, domains, and text styles, are
compiled in a time-consuming manner and difficult to cover all language phenomena,
have poor system portability, and are costly. According to different linguistic data, the
statistical-based Chinese place name identification method sets up complex feature
templates to extract features, inputs the features into a classification model, and
converts Chinese place name converts into sentence sequence tagging problems. This
method has the following disadvantages: (1) the method relatively severely depends on a corpus, and currently there are relatively few large-scale general corpora that can be used to construct and evaluate a place name entity identification system. (2)
Manually designed features require repeated experiments to complete modification,
adjustment and selection. The process is time-consuming and laborious, and requires
researchers to have a lot of linguistic knowledge. (3) Sparse representation of data
leads to excessively large model parameter space and excessive consumption of
model calculation and storage. In recent years, deep learning methods have provided a
new idea and method for extracting natural language information. In the deep learning
methods, feature templates no longer need to be manually formulated, but final output
is optimized by effectively learning features of input linguistic data and context
representation. Currently, deep learning neural networks commonly used for Chinese
name entity identification include feedforward neural network models, recurrent
neural networks (RNN), and the like. The feedforward neural networks generally
select input information by using fixed-length windows. Therefore, when some
sentences whose length exceeds a length of the window, there will be deficiencies and
context information of a word is ignored. A recurrent neural network (RNN) model is
a sequence model whose structure contains directional loops, can make full use of
sequence information, and has a memory function. Therefore, the RNN can handle
short-distance dependencies better, but a problem of disappearance of gradients or the
like occurs when the RNN deals with long-distance dependencies. To overcome
shortcomings of the RNN model, a variety of complex RNN models have been
proposed, such as a bidirectional recurrent neural network model (Bi-RNN) and a
long short-term memory model (LSTM). Because LSTM can handle long-distance
dependencies, LSTM is effective in natural language processing tasks.
Both traditional methods and the deep learning-based Chinese place name
identification methods rely heavily on corpora. A scale and coverage of training
linguistic data required directly affect an identification effect of Chinese place names.
Existing public place name linguistic data are as follows: (1) the People's Daily
annotated corpus, where the corpus covers a wide range of content, involving finance,
military, sports, entertainment, and the like, but place name information included in the corpus is sparsely and unevenly distributed; (2) linguistic data of "Encyclopedia of
China and Geography of China" (referred to as geographic encyclopedia linguistic
data, http://www.geoip.com.cn:9004/ITIS/corpus.html) is special linguistic data of
place names with independent intellectual property rights of Nanjing Normal
University, and description of place name entities is standardized and evenly
distributed, and include rich spatial semantic relationship information of place names;
(3) the Microsoft MSRA linguistic data is more in line with description characteristics
of free texts, but a quantity of place name entities is relatively small and distribution is
sparse and uneven. At present, large-scale general corpora that can be used to
construct and evaluate place name entity identification are relatively lacking and
slowly updated. Manual construction of the place name linguistic data is
time-consuming, laborious, and inefficient, which makes it impossible to optimize a
model parameter to a full extent during a deep learning training process, thereby
affecting a place name identification effect. In addition, in the era of ubiquitous
geographic information, a large quantity of new place names and unregistered place
names cannot be effectively resolved for exponentially growing multi-source,
dynamic, and heterogeneous Internet texts.
Invention objective: In view of current problems that large-scale general corpora
for place name identification are relatively few and slowly updated, manual
construction of place name linguistic data is time-consuming, laborious, and
inefficient, and place name entity identification cannot meet requirements of
ubiquitous location information services, the objective of the present invention is to
provide a method for intelligent construction of a place name annotated corpus based
on interactive and iterative learning, to optimize a parameter of a deep learning model
to a full extent, improve a place name identification effect, and achieve intelligent
construction and optimization of the place name annotated corpus.
Technical solutions: To implement the foregoing invention objective, the present
invention uses the following technical solutions:
A method for intelligent construction of a place name annotated corpus based on interactive and iterative learning is provided, including the following steps: step 1: reading an initial place name annotated corpus data, including geographic encyclopedia linguistic data and Microsoft MSRA linguistic data; step 2: preprocessing the place name annotated corpus data, including segmenting sentences by using a blank line, deduplicating a sentence, and deleting a stop word; step 3: mixing the geographic encyclopedia linguistic data and the Microsoft MSRA linguistic data, and performing training by using a tool Word2vec, to obtain a character-level word vector model; step 4: representing each character in the place name annotated corpus by using the word vector model, to generate a word vector matrixX 100of each character; step 5: performing word segmentation and part-of-speech annotation on a sentence by using a tool Jieba, and generating, as a disambiguation matrix of the character, a vector matrixix20 of each character in the sentence based on a word segmentation result; step 6: splicing the word vector matrix of each character in the sentence and the disambiguation matrix of the corresponding character, to finally obtain a word vector matrix of the sentence; inputting, for training, the word vector matrix into a place name identification model in which Bi-LSTM and CRF are integrated; and selecting an optimal place name identification model by using three evaluation indicators of a natural language processing field: precision P, a recall rate R, and a comprehensive value F; step 7: developing an interactive Chinese place name annotation platform, and embedding the place name identification model in step 6 into the interactive Chinese place name annotation platform; step 8: performing place name identification on a new Internet text on the interactive place name annotation platform, and performing human-machine interactive correction on a place name identification result; and visually displaying, in a corresponding window, a place name finally identified in the Internet text, an added place name tag, and a deleted place name tag that is wrongly tagged; step 9: when a scale of annotated place name text linguistic data in step 8 reaches a specified threshold, automatically merging, by the interactive place name annotation platform, initial place name annotation linguistic data with place name linguistic data on which human-machine interactive correction is performed, to update the place name corpus; step 10: continuing training training code and a model parameter of the place name identification model in step 6 by using, as training linguistic data, the place name linguistic data generated in step 9, to optimize the parameter of the model, and improve a model identification effect; and displaying a model training progress, final precision, the recall rate, and the value F on the interactive annotation platform; and step 11: performing iterative looping from step 2 to step 10 for the new Internet text, to intelligently update and optimize the place name annotated corpus, and ending iterative training and learning until the place name identification effect and the scale of the place name annotated corpus meet a user requirement.
Further, step 6 specifically includes:
step 1: splicing the word vector matrix of each character in the sentence and the
disambiguation matrix of the corresponding character, to obtain the word vector
matrix of the sentence an input layer, and inputting the word vector matrix into the
Bi-LSTM for training;
step 2: setting a dropout regularization method, to preventing model overfitting;
step 3: using a sentence sequence (x 1 ,x 2 , ---xn) of the input layer as input of
time steps of the Bi-LSTM, where n indicates a quantity of character s in a sentence,
and x, indicates an ith character in the sentence; and then splicing a forward LSTM
hidden output sequence (fi,f 2 , .. fn) and a backward LSTM hidden input sequence
(bi, b 2 ,... bn) based on positions, to obtain a complete hidden output sequence
(fi,f 2 , .. f, bi, b 2 , ... bn), where semantic description information above and below is
fully considered to achieve deep learning and representation of features;
step 4: after dropout is set, connecting a linear layer, to convert the complete hidden output sequence from 2n dimensions to k dimensions, where the complete hidden output sequence is denoted as a matrix P"(, where k is a quantity of tag categories in an annotation set, including four categories of tags in total: B, I, E, and , B indicates a beginning character of a place name, I indicates a middle character of the place name, E indicates an end character of the place name, and 0 indicates a non-place-name character, so that features of the sentence are automatically extracted; step 5: based on an output layer matrix of a Bi-LSTM model in step 4, setting dropout to prevent model overfitting, inputting the Bi-LSTM model output layer matrix into a CRF model for sentence sequence annotation, that is, predicting a tag for each character; and step 6: selecting the optimal place name identification model by using the three evaluation indicators of the natural language processing field: the precision P, the recall rate R, and the comprehensive value F. Further, performing sentence sequence annotation based on the CRF model in step 5 is specifically as follows: for a tag sequence y = (yy2,-... ,yn) whose length is equal to a sentence length, a model scores a sentence x whose tag is equal to y as follows: n n+1 s(x'y) = P'yi + Ayi-_y where Pey, is a probability of outputting yj at an ith position, Ay _ is a probability of performing transition from yi_1 to y, a score of the entire sequence is equal to a sum of scores at various positions, and a score at each position is obtained based on two parts: one part is determined by Piy, output by LSTM, and the other part is determined by a transition matrix A of CRF; and a normalized probability obtained by using Softmax is as follows: exp(s(x, y)) P(ylx) =yepsxy) where a numerator indicates an index value for performing scoring by the model on the sentence x whose tag is equal to y, and a denominator indicates an index sum for performing scoring by the model on all sentences whose tags are equal to corresponding y; according to the obtained normalized probability, the sentences are sorted to identify a place name.
Further, the interactive Chinese place name annotation platform is implemented
by using the Python GUI programming Tkinter.
Further, the model is optimized on a local server or by uploading the training
code and the model parameter of the place name identification model to the cloud
Google Colaboratory in step 10.
Beneficial effects: Based on the present invention, current problems of a lack and
slow update of place name linguistic data, and time-consuming, laborious, and
inefficient manual construction of the place name linguistic data can be effectively
resolved, and intelligent update of the place name annotated corpus facing
multi-source, dynamic, heterogeneous, and exponentially growing Internet texts can
be effectively implemented. The present invention is widely applied to fields such as
ubiquitous geographic information mining, spatial location services, spatial
information retrieval, and natural language processing.
Fig. 1 is a flowchart of a method for intelligent construction of a place name
annotated corpus based on interactive and iterative learning according to the present
invention;
Fig. 2 is a screenshot of some data of a place name corpus according to an
embodiment of the present invention;
Fig. 3 is a screenshot of a list of some stop words according to an embodiment of
the present invention;
Fig. 4 is a screenshot of a pretrained word vector model according to an
embodiment of the present invention;
Fig. 5 is a screenshot of a result of matching characters in a dictionary and
pretrained word vectors according to an embodiment of the present invention;
Fig. 6 is a structural diagram of a model in which Bi-LSTM and CRF are
integrated according to an embodiment of the present invention;
Fig. 7 is a flowchart of Chinese place name identification in which Bi-LSTM and
CRF are integrated according to an embodiment of the present invention;
Fig. 8 is a screenshot of a CRF feature template according to an embodiment of
the present invention;
Fig. 9 is a screenshot of a training and evaluation result of a model in which
Bi-LSTM and CRF are integrated according to an embodiment of the present
invention;
Fig. 10 is an interface diagram of an interactive Chinese place name identification
and annotation platform according to an embodiment of the present invention;
Fig. 11 is an interface diagram of an identification result of Chinese place names
on an interactive annotation platform according to an embodiment of the present
invention;
Fig. 12 is an interface diagram of a result of human-machine interactive place
name annotation according to an embodiment of the present invention; and
Fig. 13 is an intelligent update interface diagram of an annotated corpus
according to an embodiment of the present invention.
The method of the present invention is further described in detail below with
reference to specific instances.
As shown in Fig. 1, a method for intelligent construction of a place name
annotated corpus based on interactive and iterative learning disclosed in an
embodiment of the present invention uses a method for integrating a bi-directional
long short-term memory model (Bi-LSTM) and a CRF model to implement identification of a place name entity in a text. Based on this, a human-machine interactive Chinese place name annotation platform is constructed, to perform place name identification on an Internet text, and human-machine interactive correction is performed on a place name identification result. When a scale of annotated Chinese place name text linguistic data reaches a specified threshold, initial training linguistic data is merged with place name annotation linguistic data, and the initial training corpus and the place name annotated corpus are re-input into a place name identification model for training, thereby optimizing a model parameter, improving a model identification effect, and adding new linguistic data to the place name annotated corpus. The above steps are iteratively looped, iterative training and learning are ended until a constructed corpus meets a requirement, thereby implementing intelligent construction and optimization of the place name corpus.
The method mainly includes three parts: the place name identification model in
which Bi-LSTM and CRF are integrated, a human-machine interactive Chinese place
name annotation method, and intelligent construction of the place name annotated
corpus based on iterative learning. Detailed steps are as follows:
Step 1: Read an initial place name annotated corpus data.
Place name linguistic data in geographic encyclopedia linguistic data and place
name linguistic data in Microsoft MSRA linguistic data (Fig. 2) are read.
Step 2: Preprocessing the corpus data.
Sentences in the corpus data are segmented by using a blank line. Then word
segmentation is performed on the geographic encyclopedia linguistic data and the
Microsoft MSRA place name linguistic data by using a tool Jieba, a sentence is
deduplicated, and a stop word is deleted (Fig. 3).
Step 3: Generate a word vector matrix of the place name linguistic data based on
word2vec.
First, the geographic encyclopedia linguistic data is mixed with the Microsoft
MSRA linguistic data, and training is performed by using the tool Word2vec, to obtain
a character-level word vector model (Fig. 4).
Training parameters are as follows: a minimum quantity of appearance times of a word needing to be trained: mincount=5; word vector scale (dimension): size=100; a quantity of words transferred to a thread in each batch: batch_words=10000; training window: window=5; training algorithm: sg=1 (sg=0is a cbow algorithm, and sg=1 is a skip-gram algorithm); thread: workers=4; a quantity of iteration times: Iter--50.
Step 4: Generate a word vector matrix of a character in a place name linguistic
data set.
Each character in a place name annotated corpus is represented by using the word
vector model, to generate a word vector matrixX100 of each character (Fig. 5).
Step 5: Generate a disambiguation matrix of the character in the place name
linguistic data set.
Word segmentation and part-of-speech annotation are performed on the sentence
by using the tool Jieba. Based on a word segmentation result, meanings of character s
in the sentence are classified into 4 categories, represented by numbers 0, 1, 2 and 3. 0
indicates that a character is single word, 1 indicates that a character is a beginning of
the word, 2 indicates that the character is in the middle of the word, and 3 indicates
that this character is an end of the word. For example, "I am Chinese" may be
expressed as [0, 0, 1, 2, 3]. Based on the word segmentation result, a vector matrixix20
(briefly referred to as a disambiguation matrix of the character) is generated for each
character in the sentence, to achieve a purpose of eliminating a plurality of semantic
expressions of the character. For example, "shang" may be an independent positional
preposition or a character in the noun "Shanghai".
Step 6: Place name identification model in which Bi-LSTM and CRF are
integrated.
The word vector matrix of each character in the sentence and the disambiguation
matrix of the corresponding character are spliced, to finally obtain a word vector matrix of the sentence; the word vector matrix of the sentence is input, for training, into the place name identification model in which Bi-LSTM and CRF are integrated.
An optimal place name identification model is selected by using three evaluation
indicators of a natural language processing field: precision P, a recall rate R, and a
comprehensive value F (referring to Fig. 6 and Fig. 7). Details are specifically as
follows:
Step 1: Splice the word vector matrix of each character in the sentence and the
disambiguation matrix of the corresponding character, to obtain the word vector
matrix of the sentence as an input layer (a first layer of the model), and input the word
vector matrix into the Bi-LSTM for training.
Step 2: Set a dropout regularization method, to preventing model overfitting.
During a training process of dropout, some input is randomly discarded. In this case, a
parameter corresponding to the discarded part is not updated. Equivalently, dropout is
an integration method, in which results of all sub-networks are combined, and various
sub-networks may be obtained by randomly discarding input.
Step 3: Use a sentence sequence (x 1 ,x 2 , ---xn) of the input layer as input of
time steps of the Bi-LSTM, where xj indicates an it character in the sentence; and
then splice a forward LSTM hidden output sequence 2,... fn) and a backward (ff
LSTM hidden input sequence (bi, b 2 , ... bn) based on positions, to obtain a complete
hidden output sequence (fif2 ,.. , bi,b2 ,... b,), where semantic description
information above and below is fully considered to achieve deep learning and
representation of features;
Step 4: After dropout is set, connect a linear layer, to convert the complete hidden
output sequence from 2n dimensions to k dimensions, where n indicates a
quantity of characters in the sentence, and k is a quantity of tag categories in an
annotation set, there are four categories of tags in total in the annotated corpus: B, I, E,
and 0 (B indicates a beginning character of a place name, I indicates a middle character of the place name, E indicates an end character of the place name, and 0 indicates a non-place-name character), the complete hidden output sequence is recorded as a matrix p"xk, so that features of the sentence are automatically extracted.
Step 5: Based on an output layer matrix of a Bi-LSTM model, set dropout to
prevent model overfitting; and input the output layer matrix into a CRF model for
sentence sequence annotation, that is, predict a tag for each character.
If a tag sequence y = (Y , Y2, 1 -- -, yn) whose length is equal to a sentence length
is recorded, a model scores a sentence x whose tag is equal to y as follows:
n n+1
s(x,y) Piy+ .AAy
where Py, is a probability of outputting yj at an ith position, that is, an initial
score; AYLY is a probability of performing transition from yi_1 to yj, that is, a
conversion score; a score of the entire sequence is equal to a sum of scores at various
positions, and a score at each position is obtained based on two parts: one part is
determined by Pjy, output by LSTM, and the other part is determined by a transition
matrix A of CRF; a normalized probability obtained by using Softmax is as
follows:
exp(s (x,y))
where a numerator indicates an index value for performing scoring by the model
on the sentence x whose tag is equal to y, and a denominator indicates an index sum
for performing scoring by the model on all sentences whose tags are equal to
corresponding y; according to the obtained normalized probability, the sentences are
sorted to identify a place name.
Step 6: Select the optimal place name identification model by using the three
evaluation indicators of the natural language processing field: the precision P, the
recall rate R, and the comprehensive value F.
Step 7: Human-machine interactive Chinese place name annotation.
First, an interactive Chinese place name annotation platform is developed through
Python GUI programming (Tkinter), the Chinese place name identification model in
step 6 is embedded into the interactive Chinese place name annotation platform, and
place name identification is performed on an Internet text. Then human-machine
interactive correction is performed on a Chinese place name identification result.
Finally, a place name finally identified in the Internet text, an added place name tag,
and a deleted place name tag that is wrongly tagged are all visually displayed in a
corresponding window.
Step 8: Update the place name annotated corpus.
When the annotated Chinese place name text linguistic data in step 7 reaches a
quantity of characters (a threshold), the interactive place name annotation platform
automatically merges the initial training linguistic data with text linguistic data with a
place name annotated, to update the place name annotated corpus.
Step 9: Iteratively optimize the Chinese place name identification model.
The training code and the model parameter of the place name identification
model in step 6 are uploaded to a local server or the cloud Google Colaboratory, and
training is continued by using, as training linguistic data, the place name linguistic
data generated in step 8, to optimize the parameter of the model, and improve a model
identification effect. A model training progress, final precision, the recall rate, and the
value F are displayed on the interactive annotation platform.
Step 10: Intelligently update the place name annotated corpus.
Iterative looping from step 2 to step 9 is performed, to intelligently optimize the
annotated corpus, and iterative training and learning are ended until the place name
identification effect and the scale of the place name corpus meet a user requirement.
Main parts of the solutions of the embodiments of the present invention are
further described below with reference to specific experimental examples.
Part 1: A Chinese place name identification method in which Bi-LSTM and CRF
are integrated.
Corpus data in this method separately uses the geographic encyclopedia linguistic
data, the Microsoft MSRA linguistic data, linguistic data obtained by mixing the
geographic encyclopedia and the Microsoft MSRA (referred to as mixed linguistic
data below).
The geographic encyclopedia linguistic data has about 1.18 million characters,
among which a character quantity in a training set accounts for about 82%, a character
quantity in a verification set accounts for about 5%, and a character quantity in a test
set accounts for about 13%. The geographic encyclopedia linguistic data is thematic
linguistic data of place names. Place name entities are in a large quantity and evenly
distributed in a text, and a description text contains rich geographic semantic
relations.
The Microsoft MSRA linguistic data has about 2.36 million characters, among
which a character quantity in a training set accounts for about 85%, a character
quantity in a verification set accounts for about 7%, and a character quantity in a test
set accounts for about 8%. Place name entities in the Microsoft MSRA linguistic data
are in a relatively small quantity in a text and are sparsely and unevenly distributed.
The mixed linguistic data has about 3.57 million characters, among which a
character quantity in a training set accounts for about 85%, a character quantity in a
verification set accounts for about 6%, and a character quantity in a test set accounts
for about 9%. Place name entities in the mixed linguistic data are in an intermediate
quantity in a text, and are relatively evenly distributed.
In this example, 7 groups of experiments (see Table 1) are set for comparison, to
evaluate an effect of this method.
Table 1 Settings of place name identification experiments
Experiment name Experiment content
Use the geographic encyclopedia linguistic data and a Experiment 1 CRF-based method
Experiment 2 Use the Microsoft linguistic data and a CRF-based method
Experiment 3 Use the mixed linguistic data and a CRF-based method
The geographic encyclopedia linguistic data generates a random
Experiment 4 word vector matrix as an input
layer+dropout+Bi-LSTM+dropout+CRF
Geographic encyclopedia linguistic
Experiment5 data+disambiguation+pre-trained word
vector+dropout+Bi-LSTM+dropout+CRF
Microsoft linguistic data+disambiguation+pre-trained word
Experiment 6 vector
+dropout+Bi-LSTM+dropout+CRF
Linguistic data obtained by mixing the geographic encyclopedia
Experiment 7 corpus and the microsoft corpus+disambiguation+pre-trained
word vector+dropout+Bi-LSTM+dropout+CRF
(1) Experiments 1, 2, and 3
The experiments 1, 2, and 3 use a Chinese place name identification method of
different linguistic data based on the traditional statistical model CRF. A same feature
template (Fig. 8) is used, and the different linguistic data are trained to obtain
corresponding CRF models. Model evaluation results are shown in Table 2.
Table 2 Place name evaluation results of the experiments 1, 2, and 3
Comprehensive Experiment name Precision P(%) Recall rate R (%) value F (%)
Experiment 1 89.82 88.61 89.21
Experiment 2 89.81 79.18 84.16
Experiment 3 88.24 83.94 86.04
(2) Experiment 4
First, deduplication and stop word deletion are performed on a geographic
encyclopedia data set, and a word vector matrix corresponding to each character in the
data set is randomly generated by using a tool Word2vec. The word vector matrix is then input into Bi-LSTM+CRF for training to obtain a model. Settings of training parameters of the Bi-LSTM model are shown in Table 3, and evaluation results are shown in Table 4.
Table 3 Settings of the training parameters of the Bi-LSTM model
Parameter Value
Learning rate 0.001
Dropout 0.5
Maximum gradient 5
Quantity of model iteration times 100
Tag category Four categories (BIEO)
Table 4 Place name identification and evaluation result of the experiment 4
Experiment Precision P Recall rate R Comprehensive
name (%) (%) value F (%)
Experi 80.73 84.44 82.54 ment 4
(3) Experiments 5, 6, and 7
The experiments 5, 6, and 7 use a place name identification method integrated
based on different linguistic data and a same "bidirectional long short-term memory
model and CRF model". Therefore, experiment steps are the same.
First, the geographic encyclopedia linguistic data is mixed with the Microsoft
linguistic data, deduplication and stop word deletion are performed, training is
performed by using the tool Word2vec, to obtain a character-level word vector model,
and each character in the place name annotated corpus is represented by using the
word vector model, to generate a word vector matrix of each character. Then, word
segmentation and part-of-speech annotation are performed on a sentence by using the
tool Jieba, to generate a disambiguation matrix of the character, and the
disambiguation matrix and the word vector matrix of each character in the sentence
are spliced and input to the Bi-LSTM model for training. In addition, 100 model
results are evaluated and compared to obtain an optimal model (as shown in Fig. 9).
The evaluation results are shown in Table 5.
Table 5 Place name identification and evaluation results of the experiments 5, 6, and 7
mprehensive Experiment name Precision P(%) Recall rate R (%) value F (%)
Experiment 5 95.09 93.17 94.12
Experiment 6 92.86 89.91 91.36
Experiment 7 90.87 89.53 90.65
Based on a same corpus, compared with the traditional CRF-based Chinese place
name identification method, precision, a recall rate, and a comprehensive value in this
method are all increased (see Table 6).
Table 6 Comparison of place name identification and evaluation results of same
linguistic data and different identification models
Linguistic Variation (%) Variation (%) Variation (%) Experiment data of the value P of the value R of the value F
Experiment 5 VS Geographic 5.27 4.56 4.91 experiment 1 encyclopedia
Microsoft Experiment 6 VS linguistic 3.05 10.73 7.2 experiment 2 data
Mixed Experiment 7 VS linguistic 2.63 5.59 4.61 experiment 3 data
Part 2: A method for intelligent construction of a place name corpus based on
interactive and iterative learning
Step 1: First, develop an interactive Chinese place name annotation platform (see
Fig. 10) through Python GUI programming (Tkinter), and embed, into the interactive
Chinese place name annotation platform, the Chinese place name identification model
in which Bi-LSTM and CRF are integrated; and when a button "place name
identification" is clicked, perform place name entity identification on an input Internet text, and automatically attach a place name tag to a place name (see Fig. 11).
Step 2: Manually perform interactive correction on a Chinese place name
identification result: for a place name that is not identified, right-clicking and
selecting a function "set as a place name", and adding a place name tag to a place
name that is not tagged, and for a place name that is wrongly identified, right-clicking
and selecting a function "cancel setting" on a wrongly tagged place name tag, to
delete the corresponding place name tag.
Step 3: Visually display, in a corresponding window, a place name finally
identified in the Internet text, an added place name tag, and a deleted place name tag
that is wrongly tagged (see Fig. 12).
Step 4: Save the foregoing final tagging result by clicking a button "save a place
name annotation result"; when an accumulated quantity of saved characters in Internet
texts annotated with place names is greater than a threshold (in the present invention,
the quantity is set to 100,000 characters), the platform automatically merges an initial
training corpus with a text corpus in which place names are tagged, and inputs the
initial training linguistic data and the text linguistic data into the Chinese place name
identification model in which Bi-LSTM and CRF are integrated in the part 1 for
retraining, thereby optimizing a parameter of the model, and improving a model
identification effect; and display a model training progress, final precision, a recall
rate, and the value F on an interface (see Fig. 13).
Step 5: Add the foregoing new linguistic data to the place name annotated corpus,
perform iterative looping from step 1 to step 4, and end iterative training and learning
until a place name identification effect and a scale of the place name corpus meet a
user requirement.
1. A method for intelligent construction of a place name annotated corpus based on interactive and iterative learning, comprising the following steps: step 1: reading an initial place name annotated corpus data, comprising geographic encyclopedia linguistic data and Microsoft MSRA linguistic data; step 2: preprocessing the place name annotated corpus data, comprising segmenting sentences by using a blank line, deduplicating a sentence, and deleting a stop word; step 3: mixing the geographic encyclopedia linguistic data and the Microsoft MSRA linguistic data, and performing training by using a tool Word2vec, to obtain a character-level word vector model; step 4: representing each character in the place name annotated corpus by using the word vector model, to generate a word vector matrixX 100of each character; step 5: performing word segmentation and part-of-speech annotation on a sentence by using a tool Jieba, and generating, as a disambiguation matrix of the character, a vector matrix1 x20 of each character in the sentence based on a word segmentation result; step 6: splicing the word vector matrix of each character in the sentence and the disambiguation matrix of the corresponding character, to finally obtain a word vector matrix of the sentence; inputting, for training, the word vector matrix into a place name identification model in which Bi-LSTM and CRF are integrated; and selecting an optimal place name identification model by using three evaluation indicators of a natural language processing field: precision P, a recall rate R, and a comprehensive value F; step 7: developing an interactive Chinese place name annotation platform, and embedding the place name identification model in step 6 into the interactive Chinese place name annotation platform; step 8: performing place name identification on a new Internet text on the interactive place name annotation platform, and performing human-machine
Claims (5)
- interactive correction on a place name identification result; and visually displaying, ina corresponding window, a place name finally identified in the Internet text, an addedplace name tag, and a deleted place name tag that is wrongly tagged;step 9: when a scale of annotated place name text linguistic data in step 8 reachesa specified threshold, automatically merging, by the interactive place name annotationplatform, initial place name annotation linguistic data with place name linguistic dataon which human-machine interactive correction is performed, to update the placename corpus;step 10: continuing training training code and a model parameter of the placename identification model in step 6 by using, as training linguistic data, the placename linguistic data generated in step 9, to optimize the parameter of the model, andimprove a model identification effect; and displaying a model training progress, finalprecision, the recall rate, and the value F on the interactive annotation platform; andstep 11: performing iterative looping from step 2 to step 10 for the new Internettext, to intelligently update and optimize the place name annotated corpus, and endingiterative training and learning until the place name identification effect and the scaleof the place name annotated corpus meet a user requirement.
- 2. The method for intelligent construction of the place name annotated corpusbased on interactive and iterative learning according to claim 1, wherein step 6specifically comprises:step 1: splicing the word vector matrix of each character in the sentence and thedisambiguation matrix of the corresponding character, to obtain the word vectormatrix of the sentence as an input layer, and inputting the word vector matrix into theBi-LSTM for training;step 2: setting a dropout regularization method, to preventing model overfitting;step 3: using a sentence sequence (xIx2 , -- xn) of the input layer as input oftime steps of the Bi-LSTM, wherein n indicates a quantity of character s in asentence, and xj indicates an ith character in the sentence; and then splicing a forward LSTM hidden output sequence (fif2,... fn)and a backward LSTM hidden input sequence (b 1 , b 2 ,... b,) based on positions, to obtain a complete hidden output sequence (fi,f 2, -- fn, bi, b 2 , ... b,), wherein semantic description information above and below is fully considered to achieve deep learning and representation of features; step 4: after dropout is set, connecting a linear layer, to convert the complete hidden output sequence from 2n dimensions to k dimensions, wherein the nk complete hidden output sequence is denoted as a matrix P" , wherein k is a quantity of tag categories in an annotation set, including four categories of tags: B, I, E, and 0,B indicates a beginning character of a place name, I indicates a middle character ofthe place name, E indicates an end character of the place name, and 0 indicates anon-place-name character, so that features of the sentence are automatically extracted;step 5: based on an output layer matrix of a Bi-LSTM model in step 4, settingdropout to prevent model overfitting, inputting the output layer matrix into a CRFmodel for sentence sequence annotation, that is, predicting a tag for each character;andstep 6: selecting the optimal place name identification model by using the threeevaluation indicators of the natural language processing field: the precision P, therecall rate R, and the comprehensive value F.
- 3. The method for intelligent construction of the place name annotated corpusbased on interactive and iterative learning according to claim 2, wherein performingsentence sequence annotation based on the CRF model in step 5 is specifically asfollows:for a tag sequence y = (yi,y2,...,yn) whose length is equal to a sentencelength, a model scores a sentence x whose tag is equal to y as follows:n n+1s(x,y) = P y + Ay _ wherein Pjy, is a probability of outputting yj at an ith position, Ay _IYs a probability of performing transition from y -1 to yj, a score of the entire sequence is equal to a sum of scores at various positions, and a score at each position is obtained based on two parts: one part is determined by Pjy, output by LSTM, and the other part is determined by a transition matrix A of CRF; and a normalized probability obtained by using Softmax is as follows:PGlx) exp(s(x,y)) Zy' exp (s (x, y'))wherein a numerator indicates an index value for performing scoring by themodel on the sentence x whose tag is equal to y, and a denominator indicates an indexsum for performing scoring by the model on all sentences whose tags are equal tocorresponding y; according to the obtained normalized probability, the sentences aresorted to identify a place name.
- 4. The method for intelligent construction of the place name annotated corpusbased on interactive and iterative learning according to claim 1, wherein theinteractive Chinese place name annotation platform is implemented by using thePython GUI programming Tkinter.
- 5. The method for intelligent construction of the place name annotated corpusbased on interactive and iterative learning according to claim 1, wherein the model isoptimized on a local server or by uploading the training code and the model parameterof the place name identification model to the cloud Google Colaboratory in step 10.1 / 10 Fig. 12 / 10 Fig. 3 Fig. 23 / 10 Fig. 5 Fig. 44 / 10 Fig. 6/ 10 Fig. 8 Fig. 76 / 10 Fig. 97 / 10 Fig. 108 / 10 Fig. 119 / 10 Fig. 12/ 10 Fig. 13
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911029958.2 | 2019-10-28 | ||
CN201911029958.2A CN110826331B (en) | 2019-10-28 | 2019-10-28 | Intelligent construction method of place name labeling corpus based on interactive and iterative learning |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2020103654A4 true AU2020103654A4 (en) | 2021-01-14 |
Family
ID=69550890
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2020103654A Ceased AU2020103654A4 (en) | 2019-10-28 | 2020-04-21 | Method for intelligent construction of place name annotated corpus based on interactive and iterative learning |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN110826331B (en) |
AU (1) | AU2020103654A4 (en) |
WO (1) | WO2021082366A1 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113407439A (en) * | 2021-05-24 | 2021-09-17 | 西北工业大学 | Detection method for software self-recognition type technical debt |
CN113657103A (en) * | 2021-08-18 | 2021-11-16 | 哈尔滨工业大学 | Non-standard Chinese express mail information identification method and system based on NER |
CN113722530A (en) * | 2021-09-08 | 2021-11-30 | 云南大学 | Fine-grained geographical position positioning method |
CN114169330A (en) * | 2021-11-24 | 2022-03-11 | 匀熵教育科技(无锡)有限公司 | Chinese named entity identification method fusing time sequence convolution and Transformer encoder |
CN114943230A (en) * | 2022-04-17 | 2022-08-26 | 西北工业大学 | Chinese specific field entity linking method fusing common knowledge |
CN117436449A (en) * | 2023-11-01 | 2024-01-23 | 哈尔滨工业大学 | Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning |
CN117669574A (en) * | 2024-02-01 | 2024-03-08 | 浙江大学 | Artificial intelligence field entity identification method and system based on multi-semantic feature fusion |
CN117669574B (en) * | 2024-02-01 | 2024-05-17 | 浙江大学 | Artificial intelligence field entity identification method and system based on multi-semantic feature fusion |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110826331B (en) * | 2019-10-28 | 2023-04-18 | 南京师范大学 | Intelligent construction method of place name labeling corpus based on interactive and iterative learning |
CN111522914B (en) * | 2020-04-20 | 2023-05-12 | 北大方正集团有限公司 | Labeling data acquisition method and device, electronic equipment and storage medium |
CN112711621A (en) * | 2021-01-18 | 2021-04-27 | 湛江市前程网络有限公司 | Universal object interconnection training platform and control method and device |
US11769015B2 (en) | 2021-04-01 | 2023-09-26 | International Business Machines Corporation | User interface disambiguation |
CN113190678B (en) * | 2021-05-08 | 2023-10-31 | 陕西师范大学 | Chinese dialect language classification system based on parameter sparse sharing |
CN113221575B (en) * | 2021-05-28 | 2022-08-02 | 北京理工大学 | PU reinforcement learning remote supervision named entity identification method |
CN113486173B (en) * | 2021-06-11 | 2023-09-12 | 南京邮电大学 | Text labeling neural network model and labeling method thereof |
CN113255328B (en) * | 2021-06-28 | 2024-02-02 | 北京京东方技术开发有限公司 | Training method and application method of language model |
CN113486127A (en) * | 2021-07-23 | 2021-10-08 | 上海明略人工智能(集团)有限公司 | Knowledge alignment method, system, electronic device and medium |
CN113610993B (en) * | 2021-08-05 | 2022-05-17 | 南京师范大学 | 3D map building object annotation method based on candidate label evaluation |
CN113642336B (en) * | 2021-08-27 | 2024-03-08 | 青岛全掌柜科技有限公司 | SaaS-based insurance automatic question-answering method and system |
CN113901826A (en) * | 2021-12-08 | 2022-01-07 | 中国电子科技集团公司第二十八研究所 | Military news entity identification method based on serial mixed model |
CN114818717A (en) * | 2022-05-25 | 2022-07-29 | 华侨大学 | Chinese named entity recognition method and system fusing vocabulary and syntax information |
CN117435746B (en) * | 2023-12-18 | 2024-02-27 | 广东信聚丰科技股份有限公司 | Knowledge point labeling method and system based on natural language processing |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7069216B2 (en) * | 2000-09-29 | 2006-06-27 | Nuance Communications, Inc. | Corpus-based prosody translation system |
CN102314417A (en) * | 2011-09-22 | 2012-01-11 | 西安电子科技大学 | Method for identifying Web named entity based on statistical model |
CN107102989B (en) * | 2017-05-24 | 2020-09-29 | 南京大学 | Entity disambiguation method based on word vector and convolutional neural network |
CN107861939B (en) * | 2017-09-30 | 2021-05-14 | 昆明理工大学 | Domain entity disambiguation method fusing word vector and topic model |
CN108446269B (en) * | 2018-03-05 | 2021-11-23 | 昆明理工大学 | Word sense disambiguation method and device based on word vector |
CN109359291A (en) * | 2018-08-28 | 2019-02-19 | 昆明理工大学 | A kind of name entity recognition method |
CN109885824B (en) * | 2019-01-04 | 2024-02-20 | 北京捷通华声科技股份有限公司 | Hierarchical Chinese named entity recognition method, hierarchical Chinese named entity recognition device and readable storage medium |
CN110134956A (en) * | 2019-05-14 | 2019-08-16 | 南京邮电大学 | Place name tissue name recognition method based on BLSTM-CRF |
CN110287482B (en) * | 2019-05-29 | 2022-07-08 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Semi-automatic participle corpus labeling training device |
CN110826331B (en) * | 2019-10-28 | 2023-04-18 | 南京师范大学 | Intelligent construction method of place name labeling corpus based on interactive and iterative learning |
-
2019
- 2019-10-28 CN CN201911029958.2A patent/CN110826331B/en active Active
-
2020
- 2020-04-21 WO PCT/CN2020/085809 patent/WO2021082366A1/en active Application Filing
- 2020-04-21 AU AU2020103654A patent/AU2020103654A4/en not_active Ceased
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113407439A (en) * | 2021-05-24 | 2021-09-17 | 西北工业大学 | Detection method for software self-recognition type technical debt |
CN113407439B (en) * | 2021-05-24 | 2024-02-27 | 西北工业大学 | Detection method for software self-recognition type technical liabilities |
CN113657103A (en) * | 2021-08-18 | 2021-11-16 | 哈尔滨工业大学 | Non-standard Chinese express mail information identification method and system based on NER |
CN113722530A (en) * | 2021-09-08 | 2021-11-30 | 云南大学 | Fine-grained geographical position positioning method |
CN113722530B (en) * | 2021-09-08 | 2023-10-24 | 云南大学 | Fine granularity geographic position positioning method |
CN114169330A (en) * | 2021-11-24 | 2022-03-11 | 匀熵教育科技(无锡)有限公司 | Chinese named entity identification method fusing time sequence convolution and Transformer encoder |
CN114943230A (en) * | 2022-04-17 | 2022-08-26 | 西北工业大学 | Chinese specific field entity linking method fusing common knowledge |
CN114943230B (en) * | 2022-04-17 | 2024-02-20 | 西北工业大学 | Method for linking entities in Chinese specific field by fusing common sense knowledge |
CN117436449A (en) * | 2023-11-01 | 2024-01-23 | 哈尔滨工业大学 | Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning |
CN117669574A (en) * | 2024-02-01 | 2024-03-08 | 浙江大学 | Artificial intelligence field entity identification method and system based on multi-semantic feature fusion |
CN117669574B (en) * | 2024-02-01 | 2024-05-17 | 浙江大学 | Artificial intelligence field entity identification method and system based on multi-semantic feature fusion |
Also Published As
Publication number | Publication date |
---|---|
WO2021082366A1 (en) | 2021-05-06 |
CN110826331A (en) | 2020-02-21 |
CN110826331B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2020103654A4 (en) | Method for intelligent construction of place name annotated corpus based on interactive and iterative learning | |
Chang et al. | Chinese named entity recognition method based on BERT | |
CN109271529B (en) | Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian | |
CN110502644B (en) | Active learning method for field level dictionary mining construction | |
CN114036933B (en) | Information extraction method based on legal documents | |
CN110489523B (en) | Fine-grained emotion analysis method based on online shopping evaluation | |
CN114065758A (en) | Document keyword extraction method based on hypergraph random walk | |
CN111651572A (en) | Multi-domain task type dialogue system, method and terminal | |
Li et al. | Integrating language model and reading control gate in BLSTM-CRF for biomedical named entity recognition | |
CN115859980A (en) | Semi-supervised named entity identification method, system and electronic equipment | |
Xi et al. | Global encoding for long Chinese text summarization | |
Wei et al. | GP-GCN: Global features of orthogonal projection and local dependency fused graph convolutional networks for aspect-level sentiment classification | |
CN111178080A (en) | Named entity identification method and system based on structured information | |
CN112836062B (en) | Relation extraction method of text corpus | |
Xue et al. | A method of chinese tourism named entity recognition based on bblc model | |
Zhou et al. | Named entity recognition of ancient poems based on Albert-BiLSTM-MHA-CRF model | |
CN112257442A (en) | Policy document information extraction method based on corpus expansion neural network | |
Liu et al. | The extension of domain ontology based on text clustering | |
CN113779987A (en) | Event co-reference disambiguation method and system based on self-attention enhanced semantics | |
CN113869054A (en) | Deep learning-based electric power field project feature identification method | |
Kan et al. | Grid structure attention for natural language interface to bash commands | |
Shi et al. | Improve on Entity Recognition Method Based on BiLSTM-CRF Model for the Nuclear Technology Knowledge Graph | |
Qiao et al. | A Survey of Deep learning-based Image caption | |
Wang et al. | A text classification model for hypergraph convolutional neural networks with multi-feature fusion | |
Zhu et al. | Image based agorithm for automatic generation of chinese couplets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |