CN110826331B

CN110826331B - Intelligent construction method of place name labeling corpus based on interactive and iterative learning

Info

Publication number: CN110826331B
Application number: CN201911029958.2A
Authority: CN
Inventors: 张春菊; 陈玉冰; 张雪英; 汪陈; 陈书慧
Original assignee: Nanjing Normal University; Hefei University of Technology
Current assignee: Nanjing Normal University; Hefei University of Technology
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2023-04-18
Anticipated expiration: 2039-10-28
Also published as: AU2020103654A4; WO2021082366A1; CN110826331A

Abstract

The invention discloses an intelligent construction method of a place name labeling corpus based on interactive and iterative learning, which comprises the following steps: generating a word vector matrix of characters in sentences in an initial corpus and a disambiguation matrix of the characters, splicing the word vector matrix and the disambiguation matrix, inputting a Bi-LSTM and CRF integrated model for training, and generating a place name recognition model; embedding the place name recognition model into a man-machine interactive place name labeling platform for man-machine interactive correction; and fusing the initial training corpus and the marked place name corpus, optimizing parameters of the place name recognition model, and finishing iterative training learning until the constructed corpus meets requirements, thereby realizing intelligent construction and optimization of the place name corpus. The method can effectively solve the problems of lack of place name corpus data, slow updating, time and labor waste and low efficiency of manually constructing the place name corpus and the like at present, and can effectively realize the intelligent updating of the place name labeling corpus of the internet text which is oriented to multi-source, dynamic, heterogeneous and exponentially increased.

Description

Intelligent construction method of place name labeling corpus based on interactive and iterative learning

Technical Field

The invention belongs to the technical field of geographic information processing, and particularly relates to an intelligent construction method of a place name labeling corpus based on interactive and iterative learning, which can be used for optimizing deep learning model parameters to the maximum extent, improving the place name recognition effect and realizing intelligent construction and optimization of the place name labeling corpus.

Background

With the rapid development of the Internet and the arrival of the era of big data and artificial intelligence, the world is entering the ubiquitous information society and big data era (Zhou Chenghu, 2011; li Deren, 2012 Goodchild, 2017. Location big data is an important component of big data, and 80% of the world's information is related to location (Williams, 1987; liu Jing, 2014). The place name is a proprietary name that people give to a specific geographic entity in the universe, is an important component of location information, and is also indispensable information for surveying and mapping digital products. The place name is one of the most common social public information, is the most acceptable positioning mode for common people, and also provides indispensable basic information resources for national administrative management, economic construction, domestic and foreign communications and the like.

Text is a typical representative of a ubiquitous geographically large data source, with data sizes increasing, connotations covering multiple domains and being more complex. The Chinese text expression has the characteristics of non-structure, ambiguity, randomness, complex composition, no obvious separation symbols among words and the like. The description of the place name entity in the Chinese text has the following characteristics: (1) The Chinese place name entity has complex and various internal constitution conditions, not only comprises a simple place name, but also comprises a large number of compound place names, namely the internal part of the Chinese place name entity can comprise a plurality of overlapped place name entities, such as Jiangsu, nanjing, jiangning; (2) Chinese names of places often include other types of physical nouns, such as "ancestral road" including names of people; (3) The Chinese place names have large length variation and comprise short names and full names of the place names, some Chinese place names only comprise one Chinese character, such as English, american and Shanghai, and some Chinese place names can be up to more than ten Chinese characters, such as special administrative districts of hong Kong of the people's republic of China; (4) The Chinese sentence is a Chinese character sequence, the place name entity is a segment in the character sequence, and no separation symbol exists among Chinese words, so that the recognition of the boundary of the place name entity is not facilitated; (5) Compared with the common nouns, the Chinese place name entity has no obvious distinguishing characteristics such as case change, word form change and the like; and (6) the Chinese corpus resources have small scale and are updated slowly. Particularly, with the rapid development of internet +, big data, new place names and unregistered place names are emerging in large quantities. The above-mentioned factors cause that the Chinese place name entity identification can not meet the requirements of the ubiquitous location information service.

At present, the recognition method of Chinese place name is mainly divided into a method based on rules and dictionary, a method based on statistics and a method of mixing the two. The Chinese place name recognition method based on the rules and the dictionary mostly adopts a linguistic expert to manually construct rule templates, the rules are often dependent on specific languages, fields and text styles, the compiling is time-consuming and difficult to cover all language phenomena, and the system has poor portability and huge cost. The Chinese place name recognition method based on statistics sets a complex feature template to extract features for different corpora, inputs the features into a classification model, and converts Chinese place name recognition into a sentence sequence labeling problem. The method has the following defects: (1) The dependence on the database is larger, and the large-scale general database which can be used for building and evaluating the place name entity recognition system is less at present; (2) The manual design features need repeated experiments to complete modification, adjustment and selection, the process is time-consuming and labor-consuming, and a researcher needs to have a large amount of linguistic knowledge; (3) The sparse representation of the data results in an excessively large model parameter space and an excessively large consumption of model calculation and storage. In recent years, deep learning methods provide a new idea and method for extracting natural language information. The deep learning method does not need to manually make a feature template, but optimizes the final output by effectively learning the features of the input corpus and the representation of the context. The deep learning neural network commonly used for Chinese named entity recognition at present comprises a feedforward neural network model, a Recurrent Neural Network (RNN) and the like. The feedforward neural network generally adopts a window with a fixed length to select input information, so that the deficiency occurs when processing sentences with the length exceeding the window length, and the context information of the word is ignored. The Recurrent Neural Network (RNN) is a sequence model, and its structure includes directional cycle, can make full use of sequence information, and has a memory function, so the RNN can better handle short-distance dependency, but the problem of gradient disappearance occurs when handling long-distance dependency. In order to improve the shortcomings of RNN models, a variety of complex RNN models such as a bidirectional recurrent neural network model (Bi-RNN), a long short term memory model (LSTM), etc. have been proposed in succession. LSTM works significantly in natural language processing tasks because it can handle long-range dependencies.

Both the traditional method and the Chinese place name recognition method based on deep learning have larger dependence on a corpus, and the scale and the coverage of the required training corpus directly influence the Chinese place name recognition effect. The prior public place name materials are as follows: (1) The language database for marking the daily news of people covers a wide range of contents and relates to finance, military, sports, entertainment and the like, but the geographical name information contained in the language database is sparsely and unevenly distributed; (2) Chinese encyclopedia Chinese geography corpus (geographic encyclopedia corpus for short, http:// www.geoip.com.cn:9004/ITIS/corpus. Html) is a place name special corpus of autonomous intellectual property rights of Nanjing university and the place name entities are standard in description and uniform in distribution and contain rich place name space semantic relation information; (3) The Microsoft MSRA corpus is more consistent with the description characteristics of free text, but the number of place name entities is small, and the distribution is sparse and uneven. At present, a large-scale general corpus which can be used for building and evaluating place name entity recognition is relatively short and is slowly updated, and manual construction of place name corpora wastes time and labor and is low in efficiency, so that model parameters cannot be optimized to the maximum extent in a deep learning training process, and the image place name recognition effect is achieved. Meanwhile, in the geographic information era, a large number of new place names and un-registered place names which appear in the exponentially growing multi-source, dynamic and heterogeneous internet texts cannot be effectively solved.

Disclosure of Invention

The invention aims to: the invention aims to provide an intelligent construction method of a place name labeling corpus based on interactive and iterative learning, which aims to realize the maximum optimization of deep learning model parameters, improve the place name recognition effect and realize the intelligent construction and optimization of the place name labeling corpus.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:

an intelligent construction method of a place name labeling corpus based on interactive and iterative learning comprises the following steps:

the method comprises the following steps: reading the data of the initial place name labeling corpus: comprises geographic encyclopedia language material and Microsoft MSRA language material;

step two: preprocessing place name labeling corpus data, wherein the preprocessing comprises the steps of using empty line segmentation between sentences, and carrying out duplicate removal and stop word deletion on the sentences;

step three: mixing the geographic encyclopedia corpus and the Microsoft MSRA corpus, and training by using a Word2vec tool to obtain a character-level Word vector model;

step four: each character in the place name labeling corpus is represented by a word vector model, and a word vector matrix of each character is generated ^1×100 ；

Step five: using a Jieba tool to perform word segmentation and part-of-speech tagging on the sentence, and generating a vector matrix for each character in the sentence based on the word segmentation result ^1×20 As a disambiguation matrix of words;

step six: splicing the word vector matrix of each character in the sentence with the disambiguation matrix of the corresponding character to finally obtain the word vector matrix of the sentence, inputting a Bi-LSTM and CRF integrated place name recognition model for training, and selecting an optimal place name recognition model by adopting three evaluation indexes of accuracy P, recall rate R and comprehensive value F in the field of natural language processing;

step seven: developing an interactive Chinese place name labeling platform, and embedding the place name recognition model obtained in the step six into the interactive Chinese place name labeling platform;

step eight: performing place name recognition on the new internet text in an interactive place name labeling platform, and performing man-machine interactive correction on a place name recognition result; visually displaying the place name finally identified in the Internet text, the added place name label and the place name label deleted with wrong labels in corresponding windows;

step nine: when the size of the place name text corpus marked in the step eight reaches a set threshold value, the interactive place name marking platform automatically fuses the place name corpus corrected by human-computer interaction and the initial place name marking corpus data to realize the updating of the place name corpus;

step ten: continuously training the place name corpus generated in the step nine on the place name recognition model training codes and model parameters in the step six by using the place name corpus as a training corpus, optimizing the parameters of the model and improving the recognition effect of the model; displaying the model training progress, the final accuracy, the recall rate and the F value on an interactive labeling platform;

step eleven: and for a new internet text, iteratively circulating the step two to the step ten to realize intelligent updating and optimization of the place name labeling corpus until the place name recognition effect and the place name labeling corpus scale reach the requirements of a user, and finishing iterative training learning.

Further, the sixth step specifically includes:

step1: splicing the word vector matrix of each character in the sentence with the disambiguation matrix of the corresponding character to obtain the word vector matrix of the sentence, using the word vector matrix as an input layer, and inputting Bi-LSTM for training;

step2: setting Dropout regularization method to prevent over fitting of model;

step3: inputting the sentence sequence (x) of the layer ₁ ,x ₂ ,…x _n ) As input for each time step of Bi-LSTM, where n represents the number of words in the sentence, x _i Represents the ith character in the sentence, and then the forward LSTM is hidden to output the sequence (f) ₁ ,f ₂ ,…f _n ) With the inverse LSTM hidden input sequence (b) ₁ ,b ₂ ,…b _n ) Splicing according to positions to obtain a complete hidden output sequence (f) ₁ ,f ₂ ,…f _n ,b ₁ ,b ₂ ,…b _n ) Fully considering the semantic description information above and below to realize deep learning and representation of features;

step4: after Dropout is set, a linear layer is accessed, in order to convert the complete hidden output sequence from 2 dimensions to k dimensions, denoted as matrix P ^n×k Wherein k is the label category number of the label set, and the total number of the labels is B, I, E, O and is 4B represents a place name initial word, I represents a place name middle word, E represents a place name ending word, and O represents a non-place name, so that the sentence characteristics are automatically extracted;

step5: on the basis of a Bi-LSTM model output layer matrix in Step4, dropout is set to prevent overfitting of the model, the Dropout is input into a CRF model to label sentence sequences, and labels of each word are predicted;

step6: and selecting an optimal place name recognition model by adopting three evaluation indexes of accuracy P, recall rate R and comprehensive value F in the field of natural language processing.

Further, in Step5, the method for labeling sentence sequences based on the CRF model specifically comprises the following steps:

for a sequence of tags whose length is equal to the sentence length y = (y) ₁ ,y ₂ ,...,y _n ) The model scores the label of sentence x equal to y as:

wherein the content of the first and second substances,

output y for the ith position _i Is greater than or equal to>

Is from y _i-1 To y _i The score for the entire sequence is equal to the sum of the scores for the positions, and the score for each position is derived from two parts, one part being output by the LSTM->

Determining the other part of the CRF by using a transfer matrix A of the CRF;

the normalized probability using Softmax is:

wherein the numerator representation model equals to the scoring index value of y for the label of sentence x, and the denominator representation model equals to the sum of the scoring indexes corresponding to y for all sentence labels; and according to the obtained normalized probability, the sentences are sorted, and the purpose of place name identification is achieved.

Further, the interactive Chinese place name labeling platform is realized by programming Tkinger through Python GUI.

Further, in the step ten, the local server or the place name recognition model training codes and the model parameters are uploaded to the cloud Google Colorator to optimize the model.

Has the advantages that: the method can effectively solve the problems of lack of place name corpus data, slow updating, time and labor waste and low efficiency of manually constructing the place name corpus and the like at present, can effectively realize the intelligent updating problem of the place name labeling corpus of the internet text which is oriented to multisource, dynamic, heterogeneous and exponentially increased, and is widely applied to the fields of ubiquitous geographic information mining, spatial location service, spatial information retrieval, natural language processing and the like.

Drawings

FIG. 1 is a flowchart of an intelligent construction method of a place name labeling corpus based on interactive and iterative learning according to the present invention;

FIG. 2 is a partial data screenshot of a place name corpus in an embodiment of the present invention;

FIG. 3 is a screenshot of a portion of a stop word list in an embodiment of the invention;

FIG. 4 is a diagram of a pre-trained word vector model screenshot in an embodiment of the present invention;

FIG. 5 is a screenshot of a result of a match between a word in a dictionary and a pre-training word vector in an embodiment of the present invention;

FIG. 6 is the structure diagram of the Bi-LSTM and CRF integrated model in the embodiment of the present invention;

FIG. 7 is a flow chart of Chinese place name recognition based on Bi-LSTM and CRF integration in the embodiment of the present invention;

FIG. 8 is a screenshot of a CRF feature template in an embodiment of the present invention;

FIG. 9 is a screenshot of the results of training and evaluating the Bi-LSTM and CRF integrated models in the embodiment of the present invention;

FIG. 10 is an interface diagram of an interactive Chinese place name recognition and annotation platform in an embodiment of the present invention;

FIG. 11 is an interface diagram of the recognition result of the Chinese place name in the interactive annotation platform according to the embodiment of the present invention;

FIG. 12 is a diagram illustrating a human-computer interactive place name labeling result interface according to an embodiment of the present invention;

FIG. 13 is a diagram of an intelligent update interface of an annotation corpus in an embodiment of the present invention.

Detailed Description

The process of the present invention is described in further detail below with reference to specific examples.

As shown in fig. 1, the intelligent construction method of a place name labeling corpus based on interactive and iterative learning disclosed by the embodiment of the invention adopts a two-way long-short term memory model (Bi-LSTM) and CRF model integration method to realize the identification of place name entities in texts, and on the basis of the identification, a man-machine interactive Chinese place name labeling platform is constructed to identify the place names of internet texts and correct the place name identification results in a man-machine interactive manner; when the scale of the Chinese place name text corpus to be labeled reaches a set threshold value, fusing the initial training corpus and the labeled place name corpus, inputting the initial training corpus and the labeled place name corpus into the place name recognition model again for training, optimizing parameters of the model, improving the recognition effect of the model, and simultaneously adding a new corpus into the place name labeling corpus; and (4) iterating and circulating the steps until the constructed corpus meets the requirements, finishing the iterative training learning, and realizing the intelligent construction and optimization of the place name corpus.

The method mainly comprises three parts: the intelligent construction method comprises the steps of integrating a Bi-LSTM and CRF place name recognition model, a man-machine interactive Chinese place name labeling method and an iterative learning-based place name labeling corpus. The detailed steps are as follows:

the method comprises the following steps: reading initial place name labeling corpus data

Geographic encyclopedia corpus and location name corpus data of Microsoft MSRA corpus are read (FIG. 2).

Step two: corpus data preprocessing

An empty line segmentation is used between each sentence in the corpus data. Then, the geographical encyclopedia corpus and the microsoft MSRA geographical name corpus are participled by adopting a Jieba tool, and the stop words are removed and deleted from the sentences (fig. 3).

Step three: word2 vec-based generation of locality name corpus word vector matrix

Firstly, mixing geographic encyclopedia linguistic data and Microsoft MSRA linguistic data, and training by using a Word2vec tool to obtain a character-level Word vector model (figure 4);

the training parameters were as follows:

a minimum number of occurrences of the training word is required: min _ count =5;

word vector scale (dimension): size =100;

number of words passed to thread per batch: batch _ words =10000;

training a window: window =5;

training algorithm: sg =1 (sg =0 for cbow algorithm, sg =1 for skip-gram algorithm);

thread: workers =4;

iteration times are as follows: iter =50.

Step four: word vector matrix generation for words of a corpus data set of place names

Each character in the place name labeling corpus is represented by a word vector model, and a word vector matrix of each character is generated ^1×100 (FIG. 5).

Step five: disambiguation matrix generation for words of a corpus data set of place names

The method comprises the following steps of using a Jieba tool to perform word segmentation and part-of-speech tagging on sentences. The meaning of each word in the sentence is classified into 4 categories based on the word segmentation result, represented by the

numbers

0,1,2 and 3, 0 indicating that the word is a single word, 1 indicating that the word is the beginning of the word, 2 indicating that the word is the middle of the word, and 3 indicating that the word is the end of the word. For example, "i am a chinese. "can be expressed as [0,0,1,2,3]. Generating a vector matrix for each word in a sentence based on word segmentation results ^1×20 (character disambiguation matrix for short) to achieve the goal of eliminating multiple semantic expressions of characters. For example, "up" may be an independent preposition or may be a word in the term "shanghai".

Step six: bi-LSTM and CRF integration-based place name recognition model

And splicing the word vector matrix of each character in the sentence with the disambiguation matrix of the corresponding character to finally obtain the word vector matrix of the sentence, inputting a Bi-LSTM and CRF integrated place name recognition model for training, and selecting an optimal place name recognition model by adopting three evaluation indexes of accuracy P, recall rate R and comprehensive value F in the natural language processing field (see fig. 6 and 7). The method specifically comprises the following steps:

step1: and splicing the word vector matrix of each character in the sentence with the disambiguation matrix of the corresponding character to obtain the word vector matrix of the sentence, wherein the word vector matrix is used as an input layer (a first layer of the model), and the input Bi-LSTM is used for training.

Step2: the Dropout regularization method is set to prevent overfitting of the model. Dropout randomly discards a part of the input during the training process, and the parameters corresponding to the discarded part are not updated. Equivalent to Dropout is an integrated approach, combining all sub-net results, and by randomly dropping the input, various sub-nets can be obtained.

Step3: inputting the sentence sequence (x) of the layer ₁ ,x ₂ ,…x _n ) As input for each time step of Bi-LSTM, where x _i Representing the ith word in the sentence; then the forward LSTM is output with the sequence (f) ₁ ,f ₂ ,…f _n ) With an inverse LSTM hidden input sequence (b) ₁ ,b ₂ ,…b _n ) Splicing according to positions to obtain a complete hidden output sequence (f) ₁ ,f ₂ ,…f _n ,b ₁ ,b ₂ ,…b _n ) And the semantic description information above and below is fully considered, so that deep learning and representation of the features are realized.

Step4: after Dropout is set, a linear layer is accessed, the purpose is to convert the complete hidden output sequence from 2 dimensions to k dimensions, wherein n represents the number of words in a sentence, k is the label category number of a label set, and the total number of 4 labels (B represents place name initial words, I represents place name intermediate words, E represents place name ending words, and O represents non-place name words) in the label corpus is B, I, E, O (B represents place name initial words, I represents place name intermediate words, E represents place name ending words, and O represents non-place name words) and is marked as a matrix P ^n×k Thereby automatically extracting sentence features.

Step5: setting Dropout to prevent overfitting of the model based on the Bi-LSTM model output layer matrix; and inputting the sentence sequence into a CRF model for sentence sequence annotation, namely predicting the label of each word.

If one keeps a tag sequence y = (y) with length equal to sentence length ₁ ,y ₂ ,...,y _n ) Then the model scores for sentence x with a label equal to y as:

wherein, the first and the second end of the pipe are connected with each other,

output y for the ith position _i The probability of (c) is the initial score. />

Is from y _i-1 To y _i The transition probability of (2) is the transition score. The score of the whole sequence is equal to the score of each positionThe sum of points, and scoring for each position is achieved by two parts, one part being output by LSTM->

The other part is determined by the transfer matrix A of the CRF. The normalized probability using Softmax is:

where the numerator represents the index value where the model equates to the scoring index value of y for sentence x's tag and the denominator represents the index sum where the model equates to the scoring index sum for y for all sentence tags. And sequencing sentences according to the obtained normalized probability to achieve the purpose of Chinese place name identification.

Step seven: man-machine interactive Chinese place name label

Firstly, developing an interactive Chinese place name labeling platform through Python GUI programming (Tkinter), embedding the Chinese place name recognition model in the step six, and performing place name recognition on an internet text; then, performing man-machine interactive correction on the Chinese place name recognition result; and finally, visually displaying the place name finally recognized in the Internet text, the added place name label and the place name label with the deleted wrong label in corresponding windows.

Step eight: updating of place name labeling corpus

When the Chinese place name text corpus labeled in the step seven reaches a certain number of words (threshold value), the interactive place name labeling platform can automatically fuse the initial training corpus and the text corpus labeled with the place name, so as to update the place name labeling corpus.

Step nine: optimization of iterative Chinese place name recognition model

Uploading the place name recognition model training codes and model parameters in the six places to a local server or a cloud Google color, continuously training by taking the place name corpus generated in the step eight as a training corpus, optimizing the parameters of the model, and improving the recognition effect of the model; and displaying the model training progress, the final accuracy, the recall rate and the F value on an interactive labeling platform.

Step ten: intelligent optimization of place name labeling corpus

And (5) iterating and circulating the step two to the step nine to realize intelligent optimization of the labeled corpus until the place name recognition effect and the place name corpus scale reach the user requirement, and finishing the iterative training learning.

The main part of the scheme of the embodiment of the present invention will be further explained below with reference to specific experimental examples.

A first part: bi-LSTM and CRF integration-based Chinese place name identification method

The corpus data of the method respectively adopts geographic encyclopedia corpus, microsoft MSRA corpus, and language corpus mixed by geographic encyclopedia and Microsoft MSRA (hereinafter called mixed language corpus).

The geographic encyclopedia corpus comprises about 118 ten thousand words, wherein the training set words account for about 82%, the verification set words account for about 5%, and the test set words account for about 13%. The geographic encyclopedia corpus is a place name thematic corpus, and place name entities are large in quantity and uniform in distribution in the text and describe the geographic semantic relationship abundant in the text.

Microsoft MSRA corpus, about 236 ten thousand words, wherein training set words account for about 85%, verification set words account for about 7%, and test set words account for about 8%. The place name entities in the Microsoft MSRA corpus are small in number, sparse in distribution and uneven.

Mixed corpus, about 357 ten thousand words, wherein training set words account for about 85%, verification set words account for about 6%, and test set words account for about 9%. The place name entities in the mixed corpus are medium in number in the text and are distributed uniformly.

This example sets up 7 experiments (see table 1) for comparison to evaluate the effectiveness of the method.

TABLE 1 location identification experiment setup

(1) Experiment one, two and three

Experiments I, II and III are Chinese place name identification methods based on a traditional statistical model CRF for different linguistic data. The same feature templates (as fig. 8) are used, different corpora are trained respectively to obtain corresponding CRF models, and the model evaluation results are shown in table 2.

TABLE 2 results of the location name identification and evaluation of experiments I, II and III

(2) Experiment four

Firstly, carrying out duplication removal and stop Word deletion operations on a geographic encyclopedia data set, and randomly generating a Word vector matrix corresponding to each Word in the data set by using a Word2vec tool; and inputting the word vector matrix into Bi-LSTM + CRF for training to obtain a model. Wherein the training parameter settings of the Bi-LSTM model are shown in Table 3, and the evaluation results are shown in Table 4.

TABLE 3 training parameter settings for the Bi-LSTM model

TABLE 4 location name identification and evaluation results of experiment four

(3) Experiment five, six and seven

Experiments five, six and seven are place name identification methods based on different linguistic data and the same 'integration of a bidirectional long-short term memory model and a CRF model', so that the experiment steps are the same.

Firstly, mixing geographic encyclopedia corpora and Microsoft corpora, performing duplication removal and stop Word deletion operation, training by using a Word2vec tool to obtain a character-level Word vector model, representing each Word in a place name labeling corpus by using the Word vector model, and generating a Word vector matrix of each Word; then, a Jieba tool is used for carrying out word segmentation and part-of-speech tagging on the sentence to generate a disambiguation matrix of the characters, the disambiguation matrix is spliced with a word vector matrix of each character in the sentence and is input into a Bi-LSTM model for training, and meanwhile, model results of 100 times are evaluated and compared to obtain an optimal model (as shown in FIG. 9). The evaluation results are shown in Table 5.

TABLE 5 results of place name identification and evaluation of experiments five, six and seven

Under the condition of being based on the same corpus, compared with the traditional Chinese place name identification method based on CRF, the method has the advantages that the accuracy, the recall rate and the comprehensive value are all improved (see table 6).

TABLE 6 comparison of place name recognition and evaluation results of the same corpus and different recognition models

A second part: intelligent construction method of place name corpus based on interactive and iterative learning

Step1: first, an interactive chinese place name labeling platform (see fig. 10) is developed through Python GUI programming (tkater), and a Bi-LSTM and CRF-based integrated chinese place name recognition model is embedded therein, and when a button 'place name recognition' is clicked, place name entity recognition is performed on an inputted internet text and a place name tag is automatically attached to the place name (see fig. 11).

Step2: and (3) manually carrying out interactive correction on the Chinese place name recognition result: for the missed place names, clicking and selecting a function of 'setting as place names', and adding place name labels to the missed place names; for the wrong place name, the function of 'canceling the setting' can be selected by right clicking on the wrong place name label, and the corresponding place name label is deleted.

Step3: and (3) visually displaying the finally recognized place names, the added place name labels and the place name labels with the deleted wrong labels in the Internet text in corresponding windows (see FIG. 12).

Step4: and storing the final label result by clicking a 'storage place name labeling result button', and when the number of the cumulatively stored internet text words subjected to place name labeling is greater than a threshold value (the number is set to be 10 ten thousand words), automatically fusing an initial training corpus and a text corpus of the labeled place name by the platform, inputting the initial training corpus and the text corpus of the labeled place name into a Chinese place name recognition model integrated by Bi-LSTM and CRF of the first part for retraining, optimizing parameters of the model, improving the recognition effect of the model, and displaying the progress, the final accuracy, the recall rate and the F value of model training on an interface (see figure 13).

Step5: and adding the newly added linguistic data into the place name labeling corpus, and iteratively circulating Step1-Step4 until the place name recognition effect and the place name corpus scale reach the user requirement, and finishing the iterative training learning.

Claims

1. An intelligent construction method of a place name labeling corpus based on interactive and iterative learning is characterized by comprising the following steps:

the method comprises the following steps: reading initial place name labeling corpus data: comprises geographic encyclopedia language material and Microsoft MSRA language material;

step three: mixing geographic encyclopedia linguistic data and Microsoft MSRA linguistic data, and training through a Word2vec tool to obtain a character-level Word vector model;

step four: representing each character in the place name labeling corpus by a word vector model to generate a word vector matrix of 1 multiplied by 100 of each character;

step five: using a Jieba tool to perform word segmentation and part-of-speech tagging on the sentence, and generating a vector matrix 1 multiplied by 20 as a disambiguation matrix of the words for each word in the sentence based on the word segmentation result;

step seven: developing an interactive Chinese place name labeling platform, and embedding the place name recognition model in the step six into the interactive Chinese place name labeling platform;

step eight: performing place name recognition on a new internet text in an interactive place name labeling platform, and performing man-machine interactive correction on a place name recognition result; visually displaying the place name finally identified in the Internet text, the added place name label and the place name label deleted with wrong labels in corresponding windows;

2. The intelligent construction method of the place name labeling corpus based on interactive and iterative learning according to claim 1, wherein the sixth step specifically comprises:

step2: setting Dropout regularization method to prevent model over-fitting;

step3: inputting the sentence sequence (x) of the layer ₁ ,x ₂ ,…x _n ) As input for each time step of Bi-LSTM, where n represents the number of words in the sentence, x _i Representing the ith word in the sentence, and then outputting the forward LSTM hidden sequence (f) ₁ ,f ₂ ,…f _n ) With the inverse LSTM hidden input sequence (b) ₁ ,b ₂ ,…b _n ) Splicing according to positions to obtain a complete hidden output sequence (f) ₁ ,f ₂ ,…f _n ,b ₁ ,b ₂ ,…b _n ) Fully considering the semantic description information above and below to realize deep learning and representation of features;

step4: after Dropout is set, a linear layer is accessed, in order to convert the complete hidden output sequence from 2-dimension to k-dimension, which is denoted as matrix P ^n×k Wherein k is the label category number of the label set, and comprises B, I, E, O which are 4 kinds of labels in total, B represents a place name initial word, I represents a place name middle word, E represents a place name final word, and O represents a non-place name, so that the sentence characteristics are automatically extracted;

3. The intelligent construction method of place name labeling corpus based on interactive and iterative learning as claimed in claim 2, wherein the sentence sequence labeling based on the CRF model in Step5 is specifically:

for a sequence of tags with length equal to the sentence length y = (y) ₁ ,y ₂ ,...,y _n ) The model scores for sentence x with a label equal to y as:

wherein the content of the first and second substances,

output y for the ith position _i Is greater than or equal to>

Is from y _i-1 To y _i The score of the whole sequence is equal to the sum of the scores of the positions, and the score of each position is obtained by two parts, one part is output by the LSTM

Determining the other part of the CRF by using a transfer matrix A of the CRF;

the normalized probability using Softmax is:

4. The intelligent place name labeling corpus building method based on interactive and iterative learning of claim 1, wherein the interactive Chinese place name labeling platform is implemented by using Python GUI programming Tkinger.

5. The intelligent construction method of place name labeling corpus based on interactive and iterative learning as claimed in claim 1, characterized in that, in the tenth step, the training codes and model parameters of the place name recognition model are uploaded to cloud Google Colorator to optimize the model in local server or other server.