AU2020103654A4

AU2020103654A4 - Method for intelligent construction of place name annotated corpus based on interactive and iterative learning

Info

Publication number: AU2020103654A4
Application number: AU2020103654A
Authority: AU
Inventors: Shuhui Chen; Yubing CHEN; Chen Wang; Chunju ZHANG; Xueying ZHANG
Original assignee: Nanjing Normal University; Nanjing Tech University
Current assignee: Nanjing Normal University
Priority date: 2019-10-28
Filing date: 2020-04-21
Publication date: 2021-01-14
Anticipated expiration: 2028-04-21
Also published as: WO2021082366A1; CN110826331A; CN110826331B

Abstract

The present invention discloses a method for intelligent construction of a place name annotated corpus based on interactive and iterative learning. The method includes: generating a word vector matrix of a character and a disambiguation matrix of the character in a sentence in an initial corpus, after splicing the word vector matrix and the disambiguation matrix, inputting, for training, the word vector matrix and the disambiguation matrix into a model in which Bi-LSTM and CRF are integrated, and generating a place name identification model; embedding the place name identification model into a human-machine interactive place name annotation platform, and performing human-machine interactive correction; and merging initial training linguistic data with annotated place name linguistic data, optimizing a parameter of the place name identification model, and ending iterative training and learning until the constructed corpus meets a requirement, thereby intelligently constructing and optimizing the place name corpus. Based on the present invention, current problems of a lack and slow update of place name linguistic data, and time-consuming, laborious, and inefficient manual construction of the place name linguistic data can be effectively resolved, and intelligent update of the place name annotated corpus facing multi-source, dynamic, heterogeneous, and exponentially growing Internet texts can be effectively implemented. 8 8) e~ a C4. _ C) .- oa CAC Z ~H~i & o a --- -- ----- ----- --. ti Fig. 1 1/10

Description

8 8)

e~

a

C4. _

.- oa

C) Z CAC -- --- ----- ----- --.

~H~i & oa ti

Fig. 1

1/10

METHOD FOR INTELLIGENT CONSTRUCTION OF PLACE NAME ANNOTATED CORPUS BASED ON INTERACTIVE AND ITERATIVE LEARNING FIELD OF TECHNOLOGY

The present invention belongs to the field of geographic information processing

technologies, and specifically, relates to a method for intelligent construction of a

place name annotated corpus based on interactive and iterative learning, to optimize a

parameter of a deep learning model to a full extent, improve a place name

identification effect, and achieve intelligent construction and optimization of the place

name annotated corpus.

BACKGROUND

With rapid development of the Internet and the advent of the era of big data and

artificial intelligence, the world today is entering a ubiquitous information society and

the era of big data (Chenghu Zhou, 2011; Deren Li, 2012; Goodchild, 2017). Big

location data is an important part of big data, and 80% of information in the world is

related to locations (Williams, 1987; Jingnan Liu, 2014). Place names are the proper

names assigned by people to specific geographic entities in the universe, are an

important part of location information, and are also indispensable information for

digital surveying and mapping products. As one of the most commonly used social

public information, place names are the most acceptable positioning method for

ordinary people, and also provide indispensable basic information resources for

national administration, economic construction, and domestic and foreign exchanges.

Texts are a typical representative of ubiquitous geographic big data sources. A

data scale of the texts is getting larger and larger, and the texts cover a plurality of

fields which are more complex. Chinese text expressions have the following

characteristics: being unstructured, vague, and random, and having complex

composition and no obvious separator between words. Description of place name

entities in Chinese texts has the following characteristics: (1) Internal composition of

Chinese place name entities is complex and diverse, including both simple place

names and a large quantity of compound place names, that is, there may be a plurality

of overlapping place name entities, such as "Jiangning Nanjing Jiangsu". (2) Chinese

place names and other categories of entity names often contain each other. For

example, "Zuchong Road" contains a name. (3) Lengths of Chinese place names vary

relatively greatly, including an abbreviation and a full name of a place name, some

Chinese place names contain only one Chinese character, Such as "Ying", "Mei",

"Hu", and some Chinese place names can have up to a dozen of Chinese characters,

such as "Hong Kong Special Administrative Region of the People's Republic of

China". (4) A Chinese sentence is a sequence of Chinese characters, a place name

entity is a segment of the character sequence, and there is no separator between

Chinese characters, which is not conducive to identification of a boundary of place

name entities. (5) Compared with common nouns, a Chinese place name entity has no

obvious distinguishing features such as a case change and a word form change. (6)

Chinese linguistic data resources are small in scale and slowly updated. In particular,

with the rapid development of Intemet+ and big data, a large quantity of new and

unregistered place names has emerged. A plurality of the above-mentioned factors

cause identification of Chinese place name identities to fail to meet requirements of

ubiquitous location information services.

At present, identification methods of Chinese place names are mainly classified

into a method based on rules and dictionaries, a method based on statistics, and a

method based on both. The Chinese place name identification method based on rules

and dictionaries mostly use rule templates manually constructed by linguistic experts.

These rules often depend on specific languages, domains, and text styles, are

compiled in a time-consuming manner and difficult to cover all language phenomena,

have poor system portability, and are costly. According to different linguistic data, the

statistical-based Chinese place name identification method sets up complex feature

templates to extract features, inputs the features into a classification model, and

converts Chinese place name converts into sentence sequence tagging problems. This

method has the following disadvantages: (1) the method relatively severely depends on a corpus, and currently there are relatively few large-scale general corpora that can be used to construct and evaluate a place name entity identification system. (2)

Manually designed features require repeated experiments to complete modification,

adjustment and selection. The process is time-consuming and laborious, and requires

researchers to have a lot of linguistic knowledge. (3) Sparse representation of data

leads to excessively large model parameter space and excessive consumption of

model calculation and storage. In recent years, deep learning methods have provided a

new idea and method for extracting natural language information. In the deep learning

methods, feature templates no longer need to be manually formulated, but final output

is optimized by effectively learning features of input linguistic data and context

representation. Currently, deep learning neural networks commonly used for Chinese

name entity identification include feedforward neural network models, recurrent

neural networks (RNN), and the like. The feedforward neural networks generally

select input information by using fixed-length windows. Therefore, when some

sentences whose length exceeds a length of the window, there will be deficiencies and

context information of a word is ignored. A recurrent neural network (RNN) model is

a sequence model whose structure contains directional loops, can make full use of

sequence information, and has a memory function. Therefore, the RNN can handle

short-distance dependencies better, but a problem of disappearance of gradients or the

like occurs when the RNN deals with long-distance dependencies. To overcome

shortcomings of the RNN model, a variety of complex RNN models have been

proposed, such as a bidirectional recurrent neural network model (Bi-RNN) and a

long short-term memory model (LSTM). Because LSTM can handle long-distance

dependencies, LSTM is effective in natural language processing tasks.

Both traditional methods and the deep learning-based Chinese place name

identification methods rely heavily on corpora. A scale and coverage of training

linguistic data required directly affect an identification effect of Chinese place names.

Existing public place name linguistic data are as follows: (1) the People's Daily

annotated corpus, where the corpus covers a wide range of content, involving finance,

military, sports, entertainment, and the like, but place name information included in the corpus is sparsely and unevenly distributed; (2) linguistic data of "Encyclopedia of

China and Geography of China" (referred to as geographic encyclopedia linguistic

data, http://www.geoip.com.cn:9004/ITIS/corpus.html) is special linguistic data of

place names with independent intellectual property rights of Nanjing Normal

University, and description of place name entities is standardized and evenly

distributed, and include rich spatial semantic relationship information of place names;

(3) the Microsoft MSRA linguistic data is more in line with description characteristics

of free texts, but a quantity of place name entities is relatively small and distribution is

sparse and uneven. At present, large-scale general corpora that can be used to

construct and evaluate place name entity identification are relatively lacking and

slowly updated. Manual construction of the place name linguistic data is

time-consuming, laborious, and inefficient, which makes it impossible to optimize a

model parameter to a full extent during a deep learning training process, thereby

affecting a place name identification effect. In addition, in the era of ubiquitous

geographic information, a large quantity of new place names and unregistered place

names cannot be effectively resolved for exponentially growing multi-source,

dynamic, and heterogeneous Internet texts.

SUMMARY

Invention objective: In view of current problems that large-scale general corpora

for place name identification are relatively few and slowly updated, manual

construction of place name linguistic data is time-consuming, laborious, and

inefficient, and place name entity identification cannot meet requirements of

ubiquitous location information services, the objective of the present invention is to

provide a method for intelligent construction of a place name annotated corpus based

on interactive and iterative learning, to optimize a parameter of a deep learning model

to a full extent, improve a place name identification effect, and achieve intelligent

construction and optimization of the place name annotated corpus.

Technical solutions: To implement the foregoing invention objective, the present

invention uses the following technical solutions:

A method for intelligent construction of a place name annotated corpus based on interactive and iterative learning is provided, including the following steps: step 1: reading an initial place name annotated corpus data, including geographic encyclopedia linguistic data and Microsoft MSRA linguistic data; step 2: preprocessing the place name annotated corpus data, including segmenting sentences by using a blank line, deduplicating a sentence, and deleting a stop word; step 3: mixing the geographic encyclopedia linguistic data and the Microsoft MSRA linguistic data, and performing training by using a tool Word2vec, to obtain a character-level word vector model; step 4: representing each character in the place name annotated corpus by using the word vector model, to generate a word vector matrixX 100of each character; step 5: performing word segmentation and part-of-speech annotation on a sentence by using a tool Jieba, and generating, as a disambiguation matrix of the character, a vector matrixix20 of each character in the sentence based on a word segmentation result; step 6: splicing the word vector matrix of each character in the sentence and the disambiguation matrix of the corresponding character, to finally obtain a word vector matrix of the sentence; inputting, for training, the word vector matrix into a place name identification model in which Bi-LSTM and CRF are integrated; and selecting an optimal place name identification model by using three evaluation indicators of a natural language processing field: precision P, a recall rate R, and a comprehensive value F; step 7: developing an interactive Chinese place name annotation platform, and embedding the place name identification model in step 6 into the interactive Chinese place name annotation platform; step 8: performing place name identification on a new Internet text on the interactive place name annotation platform, and performing human-machine interactive correction on a place name identification result; and visually displaying, in a corresponding window, a place name finally identified in the Internet text, an added place name tag, and a deleted place name tag that is wrongly tagged; step 9: when a scale of annotated place name text linguistic data in step 8 reaches a specified threshold, automatically merging, by the interactive place name annotation platform, initial place name annotation linguistic data with place name linguistic data on which human-machine interactive correction is performed, to update the place name corpus; step 10: continuing training training code and a model parameter of the place name identification model in step 6 by using, as training linguistic data, the place name linguistic data generated in step 9, to optimize the parameter of the model, and improve a model identification effect; and displaying a model training progress, final precision, the recall rate, and the value F on the interactive annotation platform; and step 11: performing iterative looping from step 2 to step 10 for the new Internet text, to intelligently update and optimize the place name annotated corpus, and ending iterative training and learning until the place name identification effect and the scale of the place name annotated corpus meet a user requirement.

Further, step 6 specifically includes:

step 1: splicing the word vector matrix of each character in the sentence and the

disambiguation matrix of the corresponding character, to obtain the word vector

matrix of the sentence an input layer, and inputting the word vector matrix into the

Bi-LSTM for training;

step 2: setting a dropout regularization method, to preventing model overfitting;

step 3: using a sentence sequence (x 1 ,x 2 , ---xn) of the input layer as input of

time steps of the Bi-LSTM, where n indicates a quantity of character s in a sentence,

and x, indicates an ith character in the sentence; and then splicing a forward LSTM

hidden output sequence (fi,f 2 , .. fn) and a backward LSTM hidden input sequence

(bi, b 2 ,... bn) based on positions, to obtain a complete hidden output sequence

(fi,f 2 , .. f, bi, b 2 , ... bn), where semantic description information above and below is

fully considered to achieve deep learning and representation of features;

step 4: after dropout is set, connecting a linear layer, to convert the complete hidden output sequence from 2n dimensions to k dimensions, where the complete hidden output sequence is denoted as a matrix P"(, where k is a quantity of tag categories in an annotation set, including four categories of tags in total: B, I, E, and , B indicates a beginning character of a place name, I indicates a middle character of the place name, E indicates an end character of the place name, and 0 indicates a non-place-name character, so that features of the sentence are automatically extracted; step 5: based on an output layer matrix of a Bi-LSTM model in step 4, setting dropout to prevent model overfitting, inputting the Bi-LSTM model output layer matrix into a CRF model for sentence sequence annotation, that is, predicting a tag for each character; and step 6: selecting the optimal place name identification model by using the three evaluation indicators of the natural language processing field: the precision P, the recall rate R, and the comprehensive value F. Further, performing sentence sequence annotation based on the CRF model in step 5 is specifically as follows: for a tag sequence y = (yy2,-... ,yn) whose length is equal to a sentence length, a model scores a sentence x whose tag is equal to y as follows: n n+1 s(x'y) = P'yi + Ayi-_y where Pey, is a probability of outputting yj at an ith position, Ay _ is a probability of performing transition from yi_1 to y, a score of the entire sequence is equal to a sum of scores at various positions, and a score at each position is obtained based on two parts: one part is determined by Piy, output by LSTM, and the other part is determined by a transition matrix A of CRF; and a normalized probability obtained by using Softmax is as follows: exp(s(x, y)) P(ylx) =yepsxy) where a numerator indicates an index value for performing scoring by the model on the sentence x whose tag is equal to y, and a denominator indicates an index sum for performing scoring by the model on all sentences whose tags are equal to corresponding y; according to the obtained normalized probability, the sentences are sorted to identify a place name.

Further, the interactive Chinese place name annotation platform is implemented

by using the Python GUI programming Tkinter.

Further, the model is optimized on a local server or by uploading the training

code and the model parameter of the place name identification model to the cloud

Google Colaboratory in step 10.

Beneficial effects: Based on the present invention, current problems of a lack and

slow update of place name linguistic data, and time-consuming, laborious, and

inefficient manual construction of the place name linguistic data can be effectively

resolved, and intelligent update of the place name annotated corpus facing

multi-source, dynamic, heterogeneous, and exponentially growing Internet texts can

be effectively implemented. The present invention is widely applied to fields such as

ubiquitous geographic information mining, spatial location services, spatial

information retrieval, and natural language processing.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 is a flowchart of a method for intelligent construction of a place name

annotated corpus based on interactive and iterative learning according to the present

invention;

Fig. 2 is a screenshot of some data of a place name corpus according to an

embodiment of the present invention;

Fig. 3 is a screenshot of a list of some stop words according to an embodiment of

the present invention;

Fig. 4 is a screenshot of a pretrained word vector model according to an

embodiment of the present invention;

Fig. 5 is a screenshot of a result of matching characters in a dictionary and

pretrained word vectors according to an embodiment of the present invention;

Fig. 6 is a structural diagram of a model in which Bi-LSTM and CRF are

integrated according to an embodiment of the present invention;

Fig. 7 is a flowchart of Chinese place name identification in which Bi-LSTM and

CRF are integrated according to an embodiment of the present invention;

Fig. 8 is a screenshot of a CRF feature template according to an embodiment of

the present invention;

Fig. 9 is a screenshot of a training and evaluation result of a model in which

Bi-LSTM and CRF are integrated according to an embodiment of the present

invention;

Fig. 10 is an interface diagram of an interactive Chinese place name identification

and annotation platform according to an embodiment of the present invention;

Fig. 11 is an interface diagram of an identification result of Chinese place names

on an interactive annotation platform according to an embodiment of the present

invention;

Fig. 12 is an interface diagram of a result of human-machine interactive place

name annotation according to an embodiment of the present invention; and

Fig. 13 is an intelligent update interface diagram of an annotated corpus

according to an embodiment of the present invention.

DESCRIPTION OF THE EMBODIMENTS

The method of the present invention is further described in detail below with

reference to specific instances.

As shown in Fig. 1, a method for intelligent construction of a place name

annotated corpus based on interactive and iterative learning disclosed in an

embodiment of the present invention uses a method for integrating a bi-directional

long short-term memory model (Bi-LSTM) and a CRF model to implement identification of a place name entity in a text. Based on this, a human-machine interactive Chinese place name annotation platform is constructed, to perform place name identification on an Internet text, and human-machine interactive correction is performed on a place name identification result. When a scale of annotated Chinese place name text linguistic data reaches a specified threshold, initial training linguistic data is merged with place name annotation linguistic data, and the initial training corpus and the place name annotated corpus are re-input into a place name identification model for training, thereby optimizing a model parameter, improving a model identification effect, and adding new linguistic data to the place name annotated corpus. The above steps are iteratively looped, iterative training and learning are ended until a constructed corpus meets a requirement, thereby implementing intelligent construction and optimization of the place name corpus.

The method mainly includes three parts: the place name identification model in

which Bi-LSTM and CRF are integrated, a human-machine interactive Chinese place

name annotation method, and intelligent construction of the place name annotated

corpus based on iterative learning. Detailed steps are as follows:

Step 1: Read an initial place name annotated corpus data.

Place name linguistic data in geographic encyclopedia linguistic data and place

name linguistic data in Microsoft MSRA linguistic data (Fig. 2) are read.

Step 2: Preprocessing the corpus data.

Sentences in the corpus data are segmented by using a blank line. Then word

segmentation is performed on the geographic encyclopedia linguistic data and the

Microsoft MSRA place name linguistic data by using a tool Jieba, a sentence is

deduplicated, and a stop word is deleted (Fig. 3).

Step 3: Generate a word vector matrix of the place name linguistic data based on

word2vec.

First, the geographic encyclopedia linguistic data is mixed with the Microsoft

MSRA linguistic data, and training is performed by using the tool Word2vec, to obtain

a character-level word vector model (Fig. 4).

Training parameters are as follows: a minimum quantity of appearance times of a word needing to be trained: mincount=5; word vector scale (dimension): size=100; a quantity of words transferred to a thread in each batch: batch_words=10000; training window: window=5; training algorithm: sg=1 (sg=0is a cbow algorithm, and sg=1 is a skip-gram algorithm); thread: workers=4; a quantity of iteration times: Iter--50.

Step 4: Generate a word vector matrix of a character in a place name linguistic

data set.

Each character in a place name annotated corpus is represented by using the word

vector model, to generate a word vector matrixX100 of each character (Fig. 5).

Step 5: Generate a disambiguation matrix of the character in the place name

linguistic data set.

Word segmentation and part-of-speech annotation are performed on the sentence

by using the tool Jieba. Based on a word segmentation result, meanings of character s

in the sentence are classified into 4 categories, represented by numbers 0, 1, 2 and 3. 0

indicates that a character is single word, 1 indicates that a character is a beginning of

the word, 2 indicates that the character is in the middle of the word, and 3 indicates

that this character is an end of the word. For example, "I am Chinese" may be

expressed as [0, 0, 1, 2, 3]. Based on the word segmentation result, a vector matrixix20

(briefly referred to as a disambiguation matrix of the character) is generated for each

character in the sentence, to achieve a purpose of eliminating a plurality of semantic

expressions of the character. For example, "shang" may be an independent positional

preposition or a character in the noun "Shanghai".

Step 6: Place name identification model in which Bi-LSTM and CRF are

integrated.

The word vector matrix of each character in the sentence and the disambiguation

matrix of the corresponding character are spliced, to finally obtain a word vector matrix of the sentence; the word vector matrix of the sentence is input, for training, into the place name identification model in which Bi-LSTM and CRF are integrated.

An optimal place name identification model is selected by using three evaluation

indicators of a natural language processing field: precision P, a recall rate R, and a

comprehensive value F (referring to Fig. 6 and Fig. 7). Details are specifically as

follows:

Step 1: Splice the word vector matrix of each character in the sentence and the

disambiguation matrix of the corresponding character, to obtain the word vector

matrix of the sentence as an input layer (a first layer of the model), and input the word

vector matrix into the Bi-LSTM for training.

Step 2: Set a dropout regularization method, to preventing model overfitting.

During a training process of dropout, some input is randomly discarded. In this case, a

parameter corresponding to the discarded part is not updated. Equivalently, dropout is

an integration method, in which results of all sub-networks are combined, and various

sub-networks may be obtained by randomly discarding input.

Step 3: Use a sentence sequence (x 1 ,x 2 , ---xn) of the input layer as input of

time steps of the Bi-LSTM, where xj indicates an it character in the sentence; and

then splice a forward LSTM hidden output sequence 2,... fn) and a backward (ff

LSTM hidden input sequence (bi, b 2 , ... bn) based on positions, to obtain a complete

hidden output sequence (fif2 ,.. , bi,b2 ,... b,), where semantic description

information above and below is fully considered to achieve deep learning and

representation of features;

Step 4: After dropout is set, connect a linear layer, to convert the complete hidden

output sequence from 2n dimensions to k dimensions, where n indicates a

quantity of characters in the sentence, and k is a quantity of tag categories in an

annotation set, there are four categories of tags in total in the annotated corpus: B, I, E,

and 0 (B indicates a beginning character of a place name, I indicates a middle character of the place name, E indicates an end character of the place name, and 0 indicates a non-place-name character), the complete hidden output sequence is recorded as a matrix p"xk, so that features of the sentence are automatically extracted.

Step 5: Based on an output layer matrix of a Bi-LSTM model, set dropout to

prevent model overfitting; and input the output layer matrix into a CRF model for

sentence sequence annotation, that is, predict a tag for each character.

If a tag sequence y = (Y , Y2, 1 -- -, yn) whose length is equal to a sentence length

is recorded, a model scores a sentence x whose tag is equal to y as follows:

n n+1

s(x,y) Piy+ .AAy

where Py, is a probability of outputting yj at an ith position, that is, an initial

score; AYLY is a probability of performing transition from yi_1 to yj, that is, a

conversion score; a score of the entire sequence is equal to a sum of scores at various

positions, and a score at each position is obtained based on two parts: one part is

determined by Pjy, output by LSTM, and the other part is determined by a transition

matrix A of CRF; a normalized probability obtained by using Softmax is as

follows:

exp(s (x,y))

where a numerator indicates an index value for performing scoring by the model

on the sentence x whose tag is equal to y, and a denominator indicates an index sum

for performing scoring by the model on all sentences whose tags are equal to

corresponding y; according to the obtained normalized probability, the sentences are

sorted to identify a place name.

Step 6: Select the optimal place name identification model by using the three

evaluation indicators of the natural language processing field: the precision P, the

recall rate R, and the comprehensive value F.

Step 7: Human-machine interactive Chinese place name annotation.

First, an interactive Chinese place name annotation platform is developed through

Python GUI programming (Tkinter), the Chinese place name identification model in

step 6 is embedded into the interactive Chinese place name annotation platform, and

place name identification is performed on an Internet text. Then human-machine

interactive correction is performed on a Chinese place name identification result.

Finally, a place name finally identified in the Internet text, an added place name tag,

and a deleted place name tag that is wrongly tagged are all visually displayed in a

corresponding window.

Step 8: Update the place name annotated corpus.

When the annotated Chinese place name text linguistic data in step 7 reaches a

quantity of characters (a threshold), the interactive place name annotation platform

automatically merges the initial training linguistic data with text linguistic data with a

place name annotated, to update the place name annotated corpus.

Step 9: Iteratively optimize the Chinese place name identification model.

The training code and the model parameter of the place name identification

model in step 6 are uploaded to a local server or the cloud Google Colaboratory, and

training is continued by using, as training linguistic data, the place name linguistic

data generated in step 8, to optimize the parameter of the model, and improve a model

identification effect. A model training progress, final precision, the recall rate, and the

value F are displayed on the interactive annotation platform.

Step 10: Intelligently update the place name annotated corpus.

Iterative looping from step 2 to step 9 is performed, to intelligently optimize the

annotated corpus, and iterative training and learning are ended until the place name

identification effect and the scale of the place name corpus meet a user requirement.

Main parts of the solutions of the embodiments of the present invention are

further described below with reference to specific experimental examples.

Part 1: A Chinese place name identification method in which Bi-LSTM and CRF

are integrated.

Corpus data in this method separately uses the geographic encyclopedia linguistic

data, the Microsoft MSRA linguistic data, linguistic data obtained by mixing the

geographic encyclopedia and the Microsoft MSRA (referred to as mixed linguistic

data below).

The geographic encyclopedia linguistic data has about 1.18 million characters,

among which a character quantity in a training set accounts for about 82%, a character

quantity in a verification set accounts for about 5%, and a character quantity in a test

set accounts for about 13%. The geographic encyclopedia linguistic data is thematic

linguistic data of place names. Place name entities are in a large quantity and evenly

distributed in a text, and a description text contains rich geographic semantic

relations.

The Microsoft MSRA linguistic data has about 2.36 million characters, among

which a character quantity in a training set accounts for about 85%, a character

quantity in a verification set accounts for about 7%, and a character quantity in a test

set accounts for about 8%. Place name entities in the Microsoft MSRA linguistic data

are in a relatively small quantity in a text and are sparsely and unevenly distributed.

The mixed linguistic data has about 3.57 million characters, among which a

character quantity in a training set accounts for about 85%, a character quantity in a

verification set accounts for about 6%, and a character quantity in a test set accounts

for about 9%. Place name entities in the mixed linguistic data are in an intermediate

quantity in a text, and are relatively evenly distributed.

In this example, 7 groups of experiments (see Table 1) are set for comparison, to

evaluate an effect of this method.

Table 1 Settings of place name identification experiments

Experiment name Experiment content

Use the geographic encyclopedia linguistic data and a Experiment 1 CRF-based method

Experiment 2 Use the Microsoft linguistic data and a CRF-based method

Experiment 3 Use the mixed linguistic data and a CRF-based method

The geographic encyclopedia linguistic data generates a random

Experiment 4 word vector matrix as an input

layer+dropout+Bi-LSTM+dropout+CRF

Geographic encyclopedia linguistic

Experiment5 data+disambiguation+pre-trained word

vector+dropout+Bi-LSTM+dropout+CRF

Microsoft linguistic data+disambiguation+pre-trained word

Experiment 6 vector

+dropout+Bi-LSTM+dropout+CRF

Linguistic data obtained by mixing the geographic encyclopedia

Experiment 7 corpus and the microsoft corpus+disambiguation+pre-trained

word vector+dropout+Bi-LSTM+dropout+CRF

(1) Experiments 1, 2, and 3

The experiments 1, 2, and 3 use a Chinese place name identification method of

different linguistic data based on the traditional statistical model CRF. A same feature

template (Fig. 8) is used, and the different linguistic data are trained to obtain

corresponding CRF models. Model evaluation results are shown in Table 2.

Table 2 Place name evaluation results of the experiments 1, 2, and 3

Comprehensive Experiment name Precision P(%) Recall rate R (%) value F (%)

Experiment 1 89.82 88.61 89.21

Experiment 2 89.81 79.18 84.16

Experiment 3 88.24 83.94 86.04

(2) Experiment 4

First, deduplication and stop word deletion are performed on a geographic

encyclopedia data set, and a word vector matrix corresponding to each character in the

data set is randomly generated by using a tool Word2vec. The word vector matrix is then input into Bi-LSTM+CRF for training to obtain a model. Settings of training parameters of the Bi-LSTM model are shown in Table 3, and evaluation results are shown in Table 4.

Table 3 Settings of the training parameters of the Bi-LSTM model

Parameter Value

Learning rate 0.001

Dropout 0.5

Maximum gradient 5

Quantity of model iteration times 100

Tag category Four categories (BIEO)

Table 4 Place name identification and evaluation result of the experiment 4

Experiment Precision P Recall rate R Comprehensive

name (%) (%) value F (%)

Experi 80.73 84.44 82.54 ment 4

(3) Experiments 5, 6, and 7

The experiments 5, 6, and 7 use a place name identification method integrated

based on different linguistic data and a same "bidirectional long short-term memory

model and CRF model". Therefore, experiment steps are the same.

First, the geographic encyclopedia linguistic data is mixed with the Microsoft

linguistic data, deduplication and stop word deletion are performed, training is

performed by using the tool Word2vec, to obtain a character-level word vector model,

and each character in the place name annotated corpus is represented by using the

word vector model, to generate a word vector matrix of each character. Then, word

segmentation and part-of-speech annotation are performed on a sentence by using the

tool Jieba, to generate a disambiguation matrix of the character, and the

disambiguation matrix and the word vector matrix of each character in the sentence

are spliced and input to the Bi-LSTM model for training. In addition, 100 model

results are evaluated and compared to obtain an optimal model (as shown in Fig. 9).

The evaluation results are shown in Table 5.

Table 5 Place name identification and evaluation results of the experiments 5, 6, and 7

mprehensive Experiment name Precision P(%) Recall rate R (%) value F (%)

Experiment 5 95.09 93.17 94.12

Experiment 6 92.86 89.91 91.36

Experiment 7 90.87 89.53 90.65

Based on a same corpus, compared with the traditional CRF-based Chinese place

name identification method, precision, a recall rate, and a comprehensive value in this

method are all increased (see Table 6).

Table 6 Comparison of place name identification and evaluation results of same

linguistic data and different identification models

Linguistic Variation (%) Variation (%) Variation (%) Experiment data of the value P of the value R of the value F

Experiment 5 VS Geographic 5.27 4.56 4.91 experiment 1 encyclopedia

Microsoft Experiment 6 VS linguistic 3.05 10.73 7.2 experiment 2 data

Mixed Experiment 7 VS linguistic 2.63 5.59 4.61 experiment 3 data

Part 2: A method for intelligent construction of a place name corpus based on

interactive and iterative learning

Step 1: First, develop an interactive Chinese place name annotation platform (see

Fig. 10) through Python GUI programming (Tkinter), and embed, into the interactive

Chinese place name annotation platform, the Chinese place name identification model

in which Bi-LSTM and CRF are integrated; and when a button "place name

identification" is clicked, perform place name entity identification on an input Internet text, and automatically attach a place name tag to a place name (see Fig. 11).

Step 2: Manually perform interactive correction on a Chinese place name

identification result: for a place name that is not identified, right-clicking and

selecting a function "set as a place name", and adding a place name tag to a place

name that is not tagged, and for a place name that is wrongly identified, right-clicking

and selecting a function "cancel setting" on a wrongly tagged place name tag, to

delete the corresponding place name tag.

Step 3: Visually display, in a corresponding window, a place name finally

identified in the Internet text, an added place name tag, and a deleted place name tag

that is wrongly tagged (see Fig. 12).

Step 4: Save the foregoing final tagging result by clicking a button "save a place

name annotation result"; when an accumulated quantity of saved characters in Internet

texts annotated with place names is greater than a threshold (in the present invention,

the quantity is set to 100,000 characters), the platform automatically merges an initial

training corpus with a text corpus in which place names are tagged, and inputs the

initial training linguistic data and the text linguistic data into the Chinese place name

identification model in which Bi-LSTM and CRF are integrated in the part 1 for

retraining, thereby optimizing a parameter of the model, and improving a model

identification effect; and display a model training progress, final precision, a recall

rate, and the value F on an interface (see Fig. 13).

Step 5: Add the foregoing new linguistic data to the place name annotated corpus,

perform iterative looping from step 1 to step 4, and end iterative training and learning

until a place name identification effect and a scale of the place name corpus meet a

user requirement.

CLAIMES:

1. A method for intelligent construction of a place name annotated corpus based on interactive and iterative learning, comprising the following steps: step 1: reading an initial place name annotated corpus data, comprising geographic encyclopedia linguistic data and Microsoft MSRA linguistic data; step 2: preprocessing the place name annotated corpus data, comprising segmenting sentences by using a blank line, deduplicating a sentence, and deleting a stop word; step 3: mixing the geographic encyclopedia linguistic data and the Microsoft MSRA linguistic data, and performing training by using a tool Word2vec, to obtain a character-level word vector model; step 4: representing each character in the place name annotated corpus by using the word vector model, to generate a word vector matrixX 100of each character; step 5: performing word segmentation and part-of-speech annotation on a sentence by using a tool Jieba, and generating, as a disambiguation matrix of the character, a vector matrix1 x20 of each character in the sentence based on a word segmentation result; step 6: splicing the word vector matrix of each character in the sentence and the disambiguation matrix of the corresponding character, to finally obtain a word vector matrix of the sentence; inputting, for training, the word vector matrix into a place name identification model in which Bi-LSTM and CRF are integrated; and selecting an optimal place name identification model by using three evaluation indicators of a natural language processing field: precision P, a recall rate R, and a comprehensive value F; step 7: developing an interactive Chinese place name annotation platform, and embedding the place name identification model in step 6 into the interactive Chinese place name annotation platform; step 8: performing place name identification on a new Internet text on the interactive place name annotation platform, and performing human-machine

Claims

interactive correction on a place name identification result; and visually displaying, in

a corresponding window, a place name finally identified in the Internet text, an added

place name tag, and a deleted place name tag that is wrongly tagged;

step 9: when a scale of annotated place name text linguistic data in step 8 reaches

a specified threshold, automatically merging, by the interactive place name annotation

platform, initial place name annotation linguistic data with place name linguistic data

on which human-machine interactive correction is performed, to update the place

name corpus;

step 10: continuing training training code and a model parameter of the place

name identification model in step 6 by using, as training linguistic data, the place

name linguistic data generated in step 9, to optimize the parameter of the model, and

improve a model identification effect; and displaying a model training progress, final

precision, the recall rate, and the value F on the interactive annotation platform; and

step 11: performing iterative looping from step 2 to step 10 for the new Internet

text, to intelligently update and optimize the place name annotated corpus, and ending

iterative training and learning until the place name identification effect and the scale

of the place name annotated corpus meet a user requirement.
2. The method for intelligent construction of the place name annotated corpus

based on interactive and iterative learning according to claim 1, wherein step 6

specifically comprises:

step 1: splicing the word vector matrix of each character in the sentence and the

disambiguation matrix of the corresponding character, to obtain the word vector

matrix of the sentence as an input layer, and inputting the word vector matrix into the

Bi-LSTM for training;

step 2: setting a dropout regularization method, to preventing model overfitting;

step 3: using a sentence sequence (xIx2 , -- xn) of the input layer as input of

time steps of the Bi-LSTM, wherein n indicates a quantity of character s in a

sentence, and xj indicates an ith character in the sentence; and then splicing a forward LSTM hidden output sequence (fif2,... fn)and a backward LSTM hidden input sequence (b 1 , b 2 ,... b,) based on positions, to obtain a complete hidden output sequence (fi,f 2, -- fn, bi, b 2 , ... b,), wherein semantic description information above and below is fully considered to achieve deep learning and representation of features; step 4: after dropout is set, connecting a linear layer, to convert the complete hidden output sequence from 2n dimensions to k dimensions, wherein the nk complete hidden output sequence is denoted as a matrix P" , wherein k is a quantity of tag categories in an annotation set, including four categories of tags: B, I, E, and 0,

B indicates a beginning character of a place name, I indicates a middle character of

the place name, E indicates an end character of the place name, and 0 indicates a

non-place-name character, so that features of the sentence are automatically extracted;

step 5: based on an output layer matrix of a Bi-LSTM model in step 4, setting

dropout to prevent model overfitting, inputting the output layer matrix into a CRF

model for sentence sequence annotation, that is, predicting a tag for each character;

and

step 6: selecting the optimal place name identification model by using the three

evaluation indicators of the natural language processing field: the precision P, the

recall rate R, and the comprehensive value F.
3. The method for intelligent construction of the place name annotated corpus

based on interactive and iterative learning according to claim 2, wherein performing

sentence sequence annotation based on the CRF model in step 5 is specifically as

follows:

for a tag sequence y = (yi,y2,...,yn) whose length is equal to a sentence

length, a model scores a sentence x whose tag is equal to y as follows:

n n+1

s(x,y) = P y + Ay _ wherein Pjy, is a probability of outputting yj at an ith position, Ay _IYs a probability of performing transition from y -1 to yj, a score of the entire sequence is equal to a sum of scores at various positions, and a score at each position is obtained based on two parts: one part is determined by Pjy, output by LSTM, and the other part is determined by a transition matrix A of CRF; and a normalized probability obtained by using Softmax is as follows:

PGlx) exp(s(x,y)) Zy' exp (s (x, y'))

wherein a numerator indicates an index value for performing scoring by the

model on the sentence x whose tag is equal to y, and a denominator indicates an index

sum for performing scoring by the model on all sentences whose tags are equal to

corresponding y; according to the obtained normalized probability, the sentences are

sorted to identify a place name.
4. The method for intelligent construction of the place name annotated corpus

based on interactive and iterative learning according to claim 1, wherein the

interactive Chinese place name annotation platform is implemented by using the

Python GUI programming Tkinter.
5. The method for intelligent construction of the place name annotated corpus

based on interactive and iterative learning according to claim 1, wherein the model is

optimized on a local server or by uploading the training code and the model parameter

of the place name identification model to the cloud Google Colaboratory in step 10.

1 / 10 Fig. 1

2 / 10 Fig. 3 Fig. 2

3 / 10 Fig. 5 Fig. 4

4 / 10 Fig. 6

/ 10 Fig. 8 Fig. 7

6 / 10 Fig. 9

7 / 10 Fig. 10

8 / 10 Fig. 11

9 / 10 Fig. 12

/ 10 Fig. 13