CN110826331B - Intelligent construction method of place name labeling corpus based on interactive and iterative learning - Google Patents

Intelligent construction method of place name labeling corpus based on interactive and iterative learning Download PDF

Info

Publication number
CN110826331B
CN110826331B CN201911029958.2A CN201911029958A CN110826331B CN 110826331 B CN110826331 B CN 110826331B CN 201911029958 A CN201911029958 A CN 201911029958A CN 110826331 B CN110826331 B CN 110826331B
Authority
CN
China
Prior art keywords
place name
corpus
model
sentence
interactive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911029958.2A
Other languages
Chinese (zh)
Other versions
CN110826331A (en
Inventor
张春菊
陈玉冰
张雪英
汪陈
陈书慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Normal University
Hefei University of Technology
Original Assignee
Nanjing Normal University
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Normal University, Hefei University of Technology filed Critical Nanjing Normal University
Priority to CN201911029958.2A priority Critical patent/CN110826331B/en
Publication of CN110826331A publication Critical patent/CN110826331A/en
Priority to PCT/CN2020/085809 priority patent/WO2021082366A1/en
Priority to AU2020103654A priority patent/AU2020103654A4/en
Application granted granted Critical
Publication of CN110826331B publication Critical patent/CN110826331B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an intelligent construction method of a place name labeling corpus based on interactive and iterative learning, which comprises the following steps: generating a word vector matrix of characters in sentences in an initial corpus and a disambiguation matrix of the characters, splicing the word vector matrix and the disambiguation matrix, inputting a Bi-LSTM and CRF integrated model for training, and generating a place name recognition model; embedding the place name recognition model into a man-machine interactive place name labeling platform for man-machine interactive correction; and fusing the initial training corpus and the marked place name corpus, optimizing parameters of the place name recognition model, and finishing iterative training learning until the constructed corpus meets requirements, thereby realizing intelligent construction and optimization of the place name corpus. The method can effectively solve the problems of lack of place name corpus data, slow updating, time and labor waste and low efficiency of manually constructing the place name corpus and the like at present, and can effectively realize the intelligent updating of the place name labeling corpus of the internet text which is oriented to multi-source, dynamic, heterogeneous and exponentially increased.

Description

Intelligent construction method of place name labeling corpus based on interactive and iterative learning
Technical Field
The invention belongs to the technical field of geographic information processing, and particularly relates to an intelligent construction method of a place name labeling corpus based on interactive and iterative learning, which can be used for optimizing deep learning model parameters to the maximum extent, improving the place name recognition effect and realizing intelligent construction and optimization of the place name labeling corpus.
Background
With the rapid development of the Internet and the arrival of the era of big data and artificial intelligence, the world is entering the ubiquitous information society and big data era (Zhou Chenghu, 2011; li Deren, 2012 Goodchild, 2017. Location big data is an important component of big data, and 80% of the world's information is related to location (Williams, 1987; liu Jing, 2014). The place name is a proprietary name that people give to a specific geographic entity in the universe, is an important component of location information, and is also indispensable information for surveying and mapping digital products. The place name is one of the most common social public information, is the most acceptable positioning mode for common people, and also provides indispensable basic information resources for national administrative management, economic construction, domestic and foreign communications and the like.
Text is a typical representative of a ubiquitous geographically large data source, with data sizes increasing, connotations covering multiple domains and being more complex. The Chinese text expression has the characteristics of non-structure, ambiguity, randomness, complex composition, no obvious separation symbols among words and the like. The description of the place name entity in the Chinese text has the following characteristics: (1) The Chinese place name entity has complex and various internal constitution conditions, not only comprises a simple place name, but also comprises a large number of compound place names, namely the internal part of the Chinese place name entity can comprise a plurality of overlapped place name entities, such as Jiangsu, nanjing, jiangning; (2) Chinese names of places often include other types of physical nouns, such as "ancestral road" including names of people; (3) The Chinese place names have large length variation and comprise short names and full names of the place names, some Chinese place names only comprise one Chinese character, such as English, american and Shanghai, and some Chinese place names can be up to more than ten Chinese characters, such as special administrative districts of hong Kong of the people's republic of China; (4) The Chinese sentence is a Chinese character sequence, the place name entity is a segment in the character sequence, and no separation symbol exists among Chinese words, so that the recognition of the boundary of the place name entity is not facilitated; (5) Compared with the common nouns, the Chinese place name entity has no obvious distinguishing characteristics such as case change, word form change and the like; and (6) the Chinese corpus resources have small scale and are updated slowly. Particularly, with the rapid development of internet +, big data, new place names and unregistered place names are emerging in large quantities. The above-mentioned factors cause that the Chinese place name entity identification can not meet the requirements of the ubiquitous location information service.
At present, the recognition method of Chinese place name is mainly divided into a method based on rules and dictionary, a method based on statistics and a method of mixing the two. The Chinese place name recognition method based on the rules and the dictionary mostly adopts a linguistic expert to manually construct rule templates, the rules are often dependent on specific languages, fields and text styles, the compiling is time-consuming and difficult to cover all language phenomena, and the system has poor portability and huge cost. The Chinese place name recognition method based on statistics sets a complex feature template to extract features for different corpora, inputs the features into a classification model, and converts Chinese place name recognition into a sentence sequence labeling problem. The method has the following defects: (1) The dependence on the database is larger, and the large-scale general database which can be used for building and evaluating the place name entity recognition system is less at present; (2) The manual design features need repeated experiments to complete modification, adjustment and selection, the process is time-consuming and labor-consuming, and a researcher needs to have a large amount of linguistic knowledge; (3) The sparse representation of the data results in an excessively large model parameter space and an excessively large consumption of model calculation and storage. In recent years, deep learning methods provide a new idea and method for extracting natural language information. The deep learning method does not need to manually make a feature template, but optimizes the final output by effectively learning the features of the input corpus and the representation of the context. The deep learning neural network commonly used for Chinese named entity recognition at present comprises a feedforward neural network model, a Recurrent Neural Network (RNN) and the like. The feedforward neural network generally adopts a window with a fixed length to select input information, so that the deficiency occurs when processing sentences with the length exceeding the window length, and the context information of the word is ignored. The Recurrent Neural Network (RNN) is a sequence model, and its structure includes directional cycle, can make full use of sequence information, and has a memory function, so the RNN can better handle short-distance dependency, but the problem of gradient disappearance occurs when handling long-distance dependency. In order to improve the shortcomings of RNN models, a variety of complex RNN models such as a bidirectional recurrent neural network model (Bi-RNN), a long short term memory model (LSTM), etc. have been proposed in succession. LSTM works significantly in natural language processing tasks because it can handle long-range dependencies.
Both the traditional method and the Chinese place name recognition method based on deep learning have larger dependence on a corpus, and the scale and the coverage of the required training corpus directly influence the Chinese place name recognition effect. The prior public place name materials are as follows: (1) The language database for marking the daily news of people covers a wide range of contents and relates to finance, military, sports, entertainment and the like, but the geographical name information contained in the language database is sparsely and unevenly distributed; (2) Chinese encyclopedia Chinese geography corpus (geographic encyclopedia corpus for short, http:// www.geoip.com.cn:9004/ITIS/corpus. Html) is a place name special corpus of autonomous intellectual property rights of Nanjing university and the place name entities are standard in description and uniform in distribution and contain rich place name space semantic relation information; (3) The Microsoft MSRA corpus is more consistent with the description characteristics of free text, but the number of place name entities is small, and the distribution is sparse and uneven. At present, a large-scale general corpus which can be used for building and evaluating place name entity recognition is relatively short and is slowly updated, and manual construction of place name corpora wastes time and labor and is low in efficiency, so that model parameters cannot be optimized to the maximum extent in a deep learning training process, and the image place name recognition effect is achieved. Meanwhile, in the geographic information era, a large number of new place names and un-registered place names which appear in the exponentially growing multi-source, dynamic and heterogeneous internet texts cannot be effectively solved.
Disclosure of Invention
The invention aims to: the invention aims to provide an intelligent construction method of a place name labeling corpus based on interactive and iterative learning, which aims to realize the maximum optimization of deep learning model parameters, improve the place name recognition effect and realize the intelligent construction and optimization of the place name labeling corpus.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:
an intelligent construction method of a place name labeling corpus based on interactive and iterative learning comprises the following steps:
the method comprises the following steps: reading the data of the initial place name labeling corpus: comprises geographic encyclopedia language material and Microsoft MSRA language material;
step two: preprocessing place name labeling corpus data, wherein the preprocessing comprises the steps of using empty line segmentation between sentences, and carrying out duplicate removal and stop word deletion on the sentences;
step three: mixing the geographic encyclopedia corpus and the Microsoft MSRA corpus, and training by using a Word2vec tool to obtain a character-level Word vector model;
step four: each character in the place name labeling corpus is represented by a word vector model, and a word vector matrix of each character is generated 1×100
Step five: using a Jieba tool to perform word segmentation and part-of-speech tagging on the sentence, and generating a vector matrix for each character in the sentence based on the word segmentation result 1×20 As a disambiguation matrix of words;
step six: splicing the word vector matrix of each character in the sentence with the disambiguation matrix of the corresponding character to finally obtain the word vector matrix of the sentence, inputting a Bi-LSTM and CRF integrated place name recognition model for training, and selecting an optimal place name recognition model by adopting three evaluation indexes of accuracy P, recall rate R and comprehensive value F in the field of natural language processing;
step seven: developing an interactive Chinese place name labeling platform, and embedding the place name recognition model obtained in the step six into the interactive Chinese place name labeling platform;
step eight: performing place name recognition on the new internet text in an interactive place name labeling platform, and performing man-machine interactive correction on a place name recognition result; visually displaying the place name finally identified in the Internet text, the added place name label and the place name label deleted with wrong labels in corresponding windows;
step nine: when the size of the place name text corpus marked in the step eight reaches a set threshold value, the interactive place name marking platform automatically fuses the place name corpus corrected by human-computer interaction and the initial place name marking corpus data to realize the updating of the place name corpus;
step ten: continuously training the place name corpus generated in the step nine on the place name recognition model training codes and model parameters in the step six by using the place name corpus as a training corpus, optimizing the parameters of the model and improving the recognition effect of the model; displaying the model training progress, the final accuracy, the recall rate and the F value on an interactive labeling platform;
step eleven: and for a new internet text, iteratively circulating the step two to the step ten to realize intelligent updating and optimization of the place name labeling corpus until the place name recognition effect and the place name labeling corpus scale reach the requirements of a user, and finishing iterative training learning.
Further, the sixth step specifically includes:
step1: splicing the word vector matrix of each character in the sentence with the disambiguation matrix of the corresponding character to obtain the word vector matrix of the sentence, using the word vector matrix as an input layer, and inputting Bi-LSTM for training;
step2: setting Dropout regularization method to prevent over fitting of model;
step3: inputting the sentence sequence (x) of the layer 1 ,x 2 ,…x n ) As input for each time step of Bi-LSTM, where n represents the number of words in the sentence, x i Represents the ith character in the sentence, and then the forward LSTM is hidden to output the sequence (f) 1 ,f 2 ,…f n ) With the inverse LSTM hidden input sequence (b) 1 ,b 2 ,…b n ) Splicing according to positions to obtain a complete hidden output sequence (f) 1 ,f 2 ,…f n ,b 1 ,b 2 ,…b n ) Fully considering the semantic description information above and below to realize deep learning and representation of features;
step4: after Dropout is set, a linear layer is accessed, in order to convert the complete hidden output sequence from 2 dimensions to k dimensions, denoted as matrix P n×k Wherein k is the label category number of the label set, and the total number of the labels is B, I, E, O and is 4B represents a place name initial word, I represents a place name middle word, E represents a place name ending word, and O represents a non-place name, so that the sentence characteristics are automatically extracted;
step5: on the basis of a Bi-LSTM model output layer matrix in Step4, dropout is set to prevent overfitting of the model, the Dropout is input into a CRF model to label sentence sequences, and labels of each word are predicted;
step6: and selecting an optimal place name recognition model by adopting three evaluation indexes of accuracy P, recall rate R and comprehensive value F in the field of natural language processing.
Further, in Step5, the method for labeling sentence sequences based on the CRF model specifically comprises the following steps:
for a sequence of tags whose length is equal to the sentence length y = (y) 1 ,y 2 ,...,y n ) The model scores the label of sentence x equal to y as:
Figure BDA0002249845070000051
wherein the content of the first and second substances,
Figure BDA0002249845070000052
output y for the ith position i Is greater than or equal to>
Figure BDA0002249845070000053
Is from y i-1 To y i The score for the entire sequence is equal to the sum of the scores for the positions, and the score for each position is derived from two parts, one part being output by the LSTM->
Figure BDA0002249845070000054
Determining the other part of the CRF by using a transfer matrix A of the CRF;
the normalized probability using Softmax is:
Figure BDA0002249845070000055
wherein the numerator representation model equals to the scoring index value of y for the label of sentence x, and the denominator representation model equals to the sum of the scoring indexes corresponding to y for all sentence labels; and according to the obtained normalized probability, the sentences are sorted, and the purpose of place name identification is achieved.
Further, the interactive Chinese place name labeling platform is realized by programming Tkinger through Python GUI.
Further, in the step ten, the local server or the place name recognition model training codes and the model parameters are uploaded to the cloud Google Colorator to optimize the model.
Has the advantages that: the method can effectively solve the problems of lack of place name corpus data, slow updating, time and labor waste and low efficiency of manually constructing the place name corpus and the like at present, can effectively realize the intelligent updating problem of the place name labeling corpus of the internet text which is oriented to multisource, dynamic, heterogeneous and exponentially increased, and is widely applied to the fields of ubiquitous geographic information mining, spatial location service, spatial information retrieval, natural language processing and the like.
Drawings
FIG. 1 is a flowchart of an intelligent construction method of a place name labeling corpus based on interactive and iterative learning according to the present invention;
FIG. 2 is a partial data screenshot of a place name corpus in an embodiment of the present invention;
FIG. 3 is a screenshot of a portion of a stop word list in an embodiment of the invention;
FIG. 4 is a diagram of a pre-trained word vector model screenshot in an embodiment of the present invention;
FIG. 5 is a screenshot of a result of a match between a word in a dictionary and a pre-training word vector in an embodiment of the present invention;
FIG. 6 is the structure diagram of the Bi-LSTM and CRF integrated model in the embodiment of the present invention;
FIG. 7 is a flow chart of Chinese place name recognition based on Bi-LSTM and CRF integration in the embodiment of the present invention;
FIG. 8 is a screenshot of a CRF feature template in an embodiment of the present invention;
FIG. 9 is a screenshot of the results of training and evaluating the Bi-LSTM and CRF integrated models in the embodiment of the present invention;
FIG. 10 is an interface diagram of an interactive Chinese place name recognition and annotation platform in an embodiment of the present invention;
FIG. 11 is an interface diagram of the recognition result of the Chinese place name in the interactive annotation platform according to the embodiment of the present invention;
FIG. 12 is a diagram illustrating a human-computer interactive place name labeling result interface according to an embodiment of the present invention;
FIG. 13 is a diagram of an intelligent update interface of an annotation corpus in an embodiment of the present invention.
Detailed Description
The process of the present invention is described in further detail below with reference to specific examples.
As shown in fig. 1, the intelligent construction method of a place name labeling corpus based on interactive and iterative learning disclosed by the embodiment of the invention adopts a two-way long-short term memory model (Bi-LSTM) and CRF model integration method to realize the identification of place name entities in texts, and on the basis of the identification, a man-machine interactive Chinese place name labeling platform is constructed to identify the place names of internet texts and correct the place name identification results in a man-machine interactive manner; when the scale of the Chinese place name text corpus to be labeled reaches a set threshold value, fusing the initial training corpus and the labeled place name corpus, inputting the initial training corpus and the labeled place name corpus into the place name recognition model again for training, optimizing parameters of the model, improving the recognition effect of the model, and simultaneously adding a new corpus into the place name labeling corpus; and (4) iterating and circulating the steps until the constructed corpus meets the requirements, finishing the iterative training learning, and realizing the intelligent construction and optimization of the place name corpus.
The method mainly comprises three parts: the intelligent construction method comprises the steps of integrating a Bi-LSTM and CRF place name recognition model, a man-machine interactive Chinese place name labeling method and an iterative learning-based place name labeling corpus. The detailed steps are as follows:
the method comprises the following steps: reading initial place name labeling corpus data
Geographic encyclopedia corpus and location name corpus data of Microsoft MSRA corpus are read (FIG. 2).
Step two: corpus data preprocessing
An empty line segmentation is used between each sentence in the corpus data. Then, the geographical encyclopedia corpus and the microsoft MSRA geographical name corpus are participled by adopting a Jieba tool, and the stop words are removed and deleted from the sentences (fig. 3).
Step three: word2 vec-based generation of locality name corpus word vector matrix
Firstly, mixing geographic encyclopedia linguistic data and Microsoft MSRA linguistic data, and training by using a Word2vec tool to obtain a character-level Word vector model (figure 4);
the training parameters were as follows:
Figure BDA0002249845070000071
a minimum number of occurrences of the training word is required: min _ count =5;
Figure BDA0002249845070000072
word vector scale (dimension): size =100;
Figure BDA0002249845070000073
number of words passed to thread per batch: batch _ words =10000;
Figure BDA0002249845070000074
training a window: window =5;
Figure BDA0002249845070000075
training algorithm: sg =1 (sg =0 for cbow algorithm, sg =1 for skip-gram algorithm);
Figure BDA0002249845070000076
thread: workers =4;
Figure BDA0002249845070000077
iteration times are as follows: iter =50.
Step four: word vector matrix generation for words of a corpus data set of place names
Each character in the place name labeling corpus is represented by a word vector model, and a word vector matrix of each character is generated 1×100 (FIG. 5).
Step five: disambiguation matrix generation for words of a corpus data set of place names
The method comprises the following steps of using a Jieba tool to perform word segmentation and part-of-speech tagging on sentences. The meaning of each word in the sentence is classified into 4 categories based on the word segmentation result, represented by the numbers 0,1,2 and 3, 0 indicating that the word is a single word, 1 indicating that the word is the beginning of the word, 2 indicating that the word is the middle of the word, and 3 indicating that the word is the end of the word. For example, "i am a chinese. "can be expressed as [0,0,1,2,3]. Generating a vector matrix for each word in a sentence based on word segmentation results 1×20 (character disambiguation matrix for short) to achieve the goal of eliminating multiple semantic expressions of characters. For example, "up" may be an independent preposition or may be a word in the term "shanghai".
Step six: bi-LSTM and CRF integration-based place name recognition model
And splicing the word vector matrix of each character in the sentence with the disambiguation matrix of the corresponding character to finally obtain the word vector matrix of the sentence, inputting a Bi-LSTM and CRF integrated place name recognition model for training, and selecting an optimal place name recognition model by adopting three evaluation indexes of accuracy P, recall rate R and comprehensive value F in the natural language processing field (see fig. 6 and 7). The method specifically comprises the following steps:
step1: and splicing the word vector matrix of each character in the sentence with the disambiguation matrix of the corresponding character to obtain the word vector matrix of the sentence, wherein the word vector matrix is used as an input layer (a first layer of the model), and the input Bi-LSTM is used for training.
Step2: the Dropout regularization method is set to prevent overfitting of the model. Dropout randomly discards a part of the input during the training process, and the parameters corresponding to the discarded part are not updated. Equivalent to Dropout is an integrated approach, combining all sub-net results, and by randomly dropping the input, various sub-nets can be obtained.
Step3: inputting the sentence sequence (x) of the layer 1 ,x 2 ,…x n ) As input for each time step of Bi-LSTM, where x i Representing the ith word in the sentence; then the forward LSTM is output with the sequence (f) 1 ,f 2 ,…f n ) With an inverse LSTM hidden input sequence (b) 1 ,b 2 ,…b n ) Splicing according to positions to obtain a complete hidden output sequence (f) 1 ,f 2 ,…f n ,b 1 ,b 2 ,…b n ) And the semantic description information above and below is fully considered, so that deep learning and representation of the features are realized.
Step4: after Dropout is set, a linear layer is accessed, the purpose is to convert the complete hidden output sequence from 2 dimensions to k dimensions, wherein n represents the number of words in a sentence, k is the label category number of a label set, and the total number of 4 labels (B represents place name initial words, I represents place name intermediate words, E represents place name ending words, and O represents non-place name words) in the label corpus is B, I, E, O (B represents place name initial words, I represents place name intermediate words, E represents place name ending words, and O represents non-place name words) and is marked as a matrix P n×k Thereby automatically extracting sentence features.
Step5: setting Dropout to prevent overfitting of the model based on the Bi-LSTM model output layer matrix; and inputting the sentence sequence into a CRF model for sentence sequence annotation, namely predicting the label of each word.
If one keeps a tag sequence y = (y) with length equal to sentence length 1 ,y 2 ,...,y n ) Then the model scores for sentence x with a label equal to y as:
Figure BDA0002249845070000081
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002249845070000082
output y for the ith position i The probability of (c) is the initial score. />
Figure BDA0002249845070000083
Is from y i-1 To y i The transition probability of (2) is the transition score. The score of the whole sequence is equal to the score of each positionThe sum of points, and scoring for each position is achieved by two parts, one part being output by LSTM->
Figure BDA0002249845070000084
The other part is determined by the transfer matrix A of the CRF. The normalized probability using Softmax is:
Figure BDA0002249845070000085
where the numerator represents the index value where the model equates to the scoring index value of y for sentence x's tag and the denominator represents the index sum where the model equates to the scoring index sum for y for all sentence tags. And sequencing sentences according to the obtained normalized probability to achieve the purpose of Chinese place name identification.
Step6: and selecting an optimal place name recognition model by adopting three evaluation indexes of accuracy P, recall rate R and comprehensive value F in the field of natural language processing.
Step seven: man-machine interactive Chinese place name label
Firstly, developing an interactive Chinese place name labeling platform through Python GUI programming (Tkinter), embedding the Chinese place name recognition model in the step six, and performing place name recognition on an internet text; then, performing man-machine interactive correction on the Chinese place name recognition result; and finally, visually displaying the place name finally recognized in the Internet text, the added place name label and the place name label with the deleted wrong label in corresponding windows.
Step eight: updating of place name labeling corpus
When the Chinese place name text corpus labeled in the step seven reaches a certain number of words (threshold value), the interactive place name labeling platform can automatically fuse the initial training corpus and the text corpus labeled with the place name, so as to update the place name labeling corpus.
Step nine: optimization of iterative Chinese place name recognition model
Uploading the place name recognition model training codes and model parameters in the six places to a local server or a cloud Google color, continuously training by taking the place name corpus generated in the step eight as a training corpus, optimizing the parameters of the model, and improving the recognition effect of the model; and displaying the model training progress, the final accuracy, the recall rate and the F value on an interactive labeling platform.
Step ten: intelligent optimization of place name labeling corpus
And (5) iterating and circulating the step two to the step nine to realize intelligent optimization of the labeled corpus until the place name recognition effect and the place name corpus scale reach the user requirement, and finishing the iterative training learning.
The main part of the scheme of the embodiment of the present invention will be further explained below with reference to specific experimental examples.
A first part: bi-LSTM and CRF integration-based Chinese place name identification method
The corpus data of the method respectively adopts geographic encyclopedia corpus, microsoft MSRA corpus, and language corpus mixed by geographic encyclopedia and Microsoft MSRA (hereinafter called mixed language corpus).
The geographic encyclopedia corpus comprises about 118 ten thousand words, wherein the training set words account for about 82%, the verification set words account for about 5%, and the test set words account for about 13%. The geographic encyclopedia corpus is a place name thematic corpus, and place name entities are large in quantity and uniform in distribution in the text and describe the geographic semantic relationship abundant in the text.
Microsoft MSRA corpus, about 236 ten thousand words, wherein training set words account for about 85%, verification set words account for about 7%, and test set words account for about 8%. The place name entities in the Microsoft MSRA corpus are small in number, sparse in distribution and uneven.
Mixed corpus, about 357 ten thousand words, wherein training set words account for about 85%, verification set words account for about 6%, and test set words account for about 9%. The place name entities in the mixed corpus are medium in number in the text and are distributed uniformly.
This example sets up 7 experiments (see table 1) for comparison to evaluate the effectiveness of the method.
TABLE 1 location identification experiment setup
Figure BDA0002249845070000091
Figure BDA0002249845070000101
(1) Experiment one, two and three
Experiments I, II and III are Chinese place name identification methods based on a traditional statistical model CRF for different linguistic data. The same feature templates (as fig. 8) are used, different corpora are trained respectively to obtain corresponding CRF models, and the model evaluation results are shown in table 2.
TABLE 2 results of the location name identification and evaluation of experiments I, II and III
Figure BDA0002249845070000102
(2) Experiment four
Firstly, carrying out duplication removal and stop Word deletion operations on a geographic encyclopedia data set, and randomly generating a Word vector matrix corresponding to each Word in the data set by using a Word2vec tool; and inputting the word vector matrix into Bi-LSTM + CRF for training to obtain a model. Wherein the training parameter settings of the Bi-LSTM model are shown in Table 3, and the evaluation results are shown in Table 4.
TABLE 3 training parameter settings for the Bi-LSTM model
Figure BDA0002249845070000103
Figure BDA0002249845070000111
TABLE 4 location name identification and evaluation results of experiment four
Figure BDA0002249845070000112
(3) Experiment five, six and seven
Experiments five, six and seven are place name identification methods based on different linguistic data and the same 'integration of a bidirectional long-short term memory model and a CRF model', so that the experiment steps are the same.
Firstly, mixing geographic encyclopedia corpora and Microsoft corpora, performing duplication removal and stop Word deletion operation, training by using a Word2vec tool to obtain a character-level Word vector model, representing each Word in a place name labeling corpus by using the Word vector model, and generating a Word vector matrix of each Word; then, a Jieba tool is used for carrying out word segmentation and part-of-speech tagging on the sentence to generate a disambiguation matrix of the characters, the disambiguation matrix is spliced with a word vector matrix of each character in the sentence and is input into a Bi-LSTM model for training, and meanwhile, model results of 100 times are evaluated and compared to obtain an optimal model (as shown in FIG. 9). The evaluation results are shown in Table 5.
TABLE 5 results of place name identification and evaluation of experiments five, six and seven
Figure BDA0002249845070000113
Under the condition of being based on the same corpus, compared with the traditional Chinese place name identification method based on CRF, the method has the advantages that the accuracy, the recall rate and the comprehensive value are all improved (see table 6).
TABLE 6 comparison of place name recognition and evaluation results of the same corpus and different recognition models
Figure BDA0002249845070000114
A second part: intelligent construction method of place name corpus based on interactive and iterative learning
Step1: first, an interactive chinese place name labeling platform (see fig. 10) is developed through Python GUI programming (tkater), and a Bi-LSTM and CRF-based integrated chinese place name recognition model is embedded therein, and when a button 'place name recognition' is clicked, place name entity recognition is performed on an inputted internet text and a place name tag is automatically attached to the place name (see fig. 11).
Step2: and (3) manually carrying out interactive correction on the Chinese place name recognition result: for the missed place names, clicking and selecting a function of 'setting as place names', and adding place name labels to the missed place names; for the wrong place name, the function of 'canceling the setting' can be selected by right clicking on the wrong place name label, and the corresponding place name label is deleted.
Step3: and (3) visually displaying the finally recognized place names, the added place name labels and the place name labels with the deleted wrong labels in the Internet text in corresponding windows (see FIG. 12).
Step4: and storing the final label result by clicking a 'storage place name labeling result button', and when the number of the cumulatively stored internet text words subjected to place name labeling is greater than a threshold value (the number is set to be 10 ten thousand words), automatically fusing an initial training corpus and a text corpus of the labeled place name by the platform, inputting the initial training corpus and the text corpus of the labeled place name into a Chinese place name recognition model integrated by Bi-LSTM and CRF of the first part for retraining, optimizing parameters of the model, improving the recognition effect of the model, and displaying the progress, the final accuracy, the recall rate and the F value of model training on an interface (see figure 13).
Step5: and adding the newly added linguistic data into the place name labeling corpus, and iteratively circulating Step1-Step4 until the place name recognition effect and the place name corpus scale reach the user requirement, and finishing the iterative training learning.

Claims (5)

1. An intelligent construction method of a place name labeling corpus based on interactive and iterative learning is characterized by comprising the following steps:
the method comprises the following steps: reading initial place name labeling corpus data: comprises geographic encyclopedia language material and Microsoft MSRA language material;
step two: preprocessing place name labeling corpus data, wherein the preprocessing comprises the steps of using empty line segmentation between sentences, and carrying out duplicate removal and stop word deletion on the sentences;
step three: mixing geographic encyclopedia linguistic data and Microsoft MSRA linguistic data, and training through a Word2vec tool to obtain a character-level Word vector model;
step four: representing each character in the place name labeling corpus by a word vector model to generate a word vector matrix of 1 multiplied by 100 of each character;
step five: using a Jieba tool to perform word segmentation and part-of-speech tagging on the sentence, and generating a vector matrix 1 multiplied by 20 as a disambiguation matrix of the words for each word in the sentence based on the word segmentation result;
step six: splicing the word vector matrix of each character in the sentence with the disambiguation matrix of the corresponding character to finally obtain the word vector matrix of the sentence, inputting a Bi-LSTM and CRF integrated place name recognition model for training, and selecting an optimal place name recognition model by adopting three evaluation indexes of accuracy P, recall rate R and comprehensive value F in the field of natural language processing;
step seven: developing an interactive Chinese place name labeling platform, and embedding the place name recognition model in the step six into the interactive Chinese place name labeling platform;
step eight: performing place name recognition on a new internet text in an interactive place name labeling platform, and performing man-machine interactive correction on a place name recognition result; visually displaying the place name finally identified in the Internet text, the added place name label and the place name label deleted with wrong labels in corresponding windows;
step nine: when the size of the place name text corpus marked in the step eight reaches a set threshold value, the interactive place name marking platform automatically fuses the place name corpus corrected by human-computer interaction and the initial place name marking corpus data to realize the updating of the place name corpus;
step ten: continuously training the place name corpus generated in the step nine on the place name recognition model training codes and model parameters in the step six by using the place name corpus as a training corpus, optimizing the parameters of the model and improving the recognition effect of the model; displaying the model training progress, the final accuracy, the recall rate and the F value on an interactive labeling platform;
step eleven: and for a new internet text, iteratively circulating the step two to the step ten to realize intelligent updating and optimization of the place name labeling corpus until the place name recognition effect and the place name labeling corpus scale reach the requirements of a user, and finishing iterative training learning.
2. The intelligent construction method of the place name labeling corpus based on interactive and iterative learning according to claim 1, wherein the sixth step specifically comprises:
step1: splicing the word vector matrix of each character in the sentence with the disambiguation matrix of the corresponding character to obtain the word vector matrix of the sentence, using the word vector matrix as an input layer, and inputting Bi-LSTM for training;
step2: setting Dropout regularization method to prevent model over-fitting;
step3: inputting the sentence sequence (x) of the layer 1 ,x 2 ,…x n ) As input for each time step of Bi-LSTM, where n represents the number of words in the sentence, x i Representing the ith word in the sentence, and then outputting the forward LSTM hidden sequence (f) 1 ,f 2 ,…f n ) With the inverse LSTM hidden input sequence (b) 1 ,b 2 ,…b n ) Splicing according to positions to obtain a complete hidden output sequence (f) 1 ,f 2 ,…f n ,b 1 ,b 2 ,…b n ) Fully considering the semantic description information above and below to realize deep learning and representation of features;
step4: after Dropout is set, a linear layer is accessed, in order to convert the complete hidden output sequence from 2-dimension to k-dimension, which is denoted as matrix P n×k Wherein k is the label category number of the label set, and comprises B, I, E, O which are 4 kinds of labels in total, B represents a place name initial word, I represents a place name middle word, E represents a place name final word, and O represents a non-place name, so that the sentence characteristics are automatically extracted;
step5: on the basis of a Bi-LSTM model output layer matrix in Step4, dropout is set to prevent overfitting of the model, the Dropout is input into a CRF model to label sentence sequences, and labels of each word are predicted;
step6: and selecting an optimal place name recognition model by adopting three evaluation indexes of accuracy P, recall rate R and comprehensive value F in the field of natural language processing.
3. The intelligent construction method of place name labeling corpus based on interactive and iterative learning as claimed in claim 2, wherein the sentence sequence labeling based on the CRF model in Step5 is specifically:
for a sequence of tags with length equal to the sentence length y = (y) 1 ,y 2 ,...,y n ) The model scores for sentence x with a label equal to y as:
Figure FDA0002249845060000021
wherein the content of the first and second substances,
Figure FDA0002249845060000022
output y for the ith position i Is greater than or equal to>
Figure FDA0002249845060000023
Is from y i-1 To y i The score of the whole sequence is equal to the sum of the scores of the positions, and the score of each position is obtained by two parts, one part is output by the LSTM
Figure FDA0002249845060000024
Determining the other part of the CRF by using a transfer matrix A of the CRF;
the normalized probability using Softmax is:
Figure FDA0002249845060000031
wherein the numerator representation model equals to the scoring index value of y for the label of sentence x, and the denominator representation model equals to the sum of the scoring indexes corresponding to y for all sentence labels; and according to the obtained normalized probability, the sentences are sorted, and the purpose of place name identification is achieved.
4. The intelligent place name labeling corpus building method based on interactive and iterative learning of claim 1, wherein the interactive Chinese place name labeling platform is implemented by using Python GUI programming Tkinger.
5. The intelligent construction method of place name labeling corpus based on interactive and iterative learning as claimed in claim 1, characterized in that, in the tenth step, the training codes and model parameters of the place name recognition model are uploaded to cloud Google Colorator to optimize the model in local server or other server.
CN201911029958.2A 2019-10-28 2019-10-28 Intelligent construction method of place name labeling corpus based on interactive and iterative learning Active CN110826331B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201911029958.2A CN110826331B (en) 2019-10-28 2019-10-28 Intelligent construction method of place name labeling corpus based on interactive and iterative learning
PCT/CN2020/085809 WO2021082366A1 (en) 2019-10-28 2020-04-21 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
AU2020103654A AU2020103654A4 (en) 2019-10-28 2020-04-21 Method for intelligent construction of place name annotated corpus based on interactive and iterative learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911029958.2A CN110826331B (en) 2019-10-28 2019-10-28 Intelligent construction method of place name labeling corpus based on interactive and iterative learning

Publications (2)

Publication Number Publication Date
CN110826331A CN110826331A (en) 2020-02-21
CN110826331B true CN110826331B (en) 2023-04-18

Family

ID=69550890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911029958.2A Active CN110826331B (en) 2019-10-28 2019-10-28 Intelligent construction method of place name labeling corpus based on interactive and iterative learning

Country Status (3)

Country Link
CN (1) CN110826331B (en)
AU (1) AU2020103654A4 (en)
WO (1) WO2021082366A1 (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826331B (en) * 2019-10-28 2023-04-18 南京师范大学 Intelligent construction method of place name labeling corpus based on interactive and iterative learning
CN111522914B (en) * 2020-04-20 2023-05-12 北大方正集团有限公司 Labeling data acquisition method and device, electronic equipment and storage medium
CN112711621A (en) * 2021-01-18 2021-04-27 湛江市前程网络有限公司 Universal object interconnection training platform and control method and device
US11769015B2 (en) 2021-04-01 2023-09-26 International Business Machines Corporation User interface disambiguation
CN113190678B (en) * 2021-05-08 2023-10-31 陕西师范大学 Chinese dialect language classification system based on parameter sparse sharing
CN113407439B (en) * 2021-05-24 2024-02-27 西北工业大学 Detection method for software self-recognition type technical liabilities
CN113221575B (en) * 2021-05-28 2022-08-02 北京理工大学 PU reinforcement learning remote supervision named entity identification method
CN113486173B (en) * 2021-06-11 2023-09-12 南京邮电大学 Text labeling neural network model and labeling method thereof
CN113255328B (en) * 2021-06-28 2024-02-02 北京京东方技术开发有限公司 Training method and application method of language model
CN113486127A (en) * 2021-07-23 2021-10-08 上海明略人工智能(集团)有限公司 Knowledge alignment method, system, electronic device and medium
CN113610993B (en) * 2021-08-05 2022-05-17 南京师范大学 3D map building object annotation method based on candidate label evaluation
CN113657103B (en) * 2021-08-18 2023-05-12 哈尔滨工业大学 Non-standard Chinese express mail information identification method and system based on NER
CN113642336B (en) * 2021-08-27 2024-03-08 青岛全掌柜科技有限公司 SaaS-based insurance automatic question-answering method and system
CN113722530B (en) * 2021-09-08 2023-10-24 云南大学 Fine granularity geographic position positioning method
CN114169330B (en) * 2021-11-24 2023-07-14 匀熵智能科技(无锡)有限公司 Chinese named entity recognition method integrating time sequence convolution and transform encoder
CN113901826A (en) * 2021-12-08 2022-01-07 中国电子科技集团公司第二十八研究所 Military news entity identification method based on serial mixed model
CN114943230B (en) * 2022-04-17 2024-02-20 西北工业大学 Method for linking entities in Chinese specific field by fusing common sense knowledge
CN115510245A (en) * 2022-10-14 2022-12-23 北京理工大学 Unstructured data oriented domain knowledge extraction method
CN117436449A (en) * 2023-11-01 2024-01-23 哈尔滨工业大学 Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning
CN117435746B (en) * 2023-12-18 2024-02-27 广东信聚丰科技股份有限公司 Knowledge point labeling method and system based on natural language processing
CN117669574A (en) * 2024-02-01 2024-03-08 浙江大学 Artificial intelligence field entity identification method and system based on multi-semantic feature fusion

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108446269A (en) * 2018-03-05 2018-08-24 昆明理工大学 A kind of Word sense disambiguation method and device based on term vector
CN109359291A (en) * 2018-08-28 2019-02-19 昆明理工大学 A kind of name entity recognition method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7069216B2 (en) * 2000-09-29 2006-06-27 Nuance Communications, Inc. Corpus-based prosody translation system
CN102314417A (en) * 2011-09-22 2012-01-11 西安电子科技大学 Method for identifying Web named entity based on statistical model
CN109885824B (en) * 2019-01-04 2024-02-20 北京捷通华声科技股份有限公司 Hierarchical Chinese named entity recognition method, hierarchical Chinese named entity recognition device and readable storage medium
CN110134956A (en) * 2019-05-14 2019-08-16 南京邮电大学 Place name tissue name recognition method based on BLSTM-CRF
CN110287482B (en) * 2019-05-29 2022-07-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Semi-automatic participle corpus labeling training device
CN110826331B (en) * 2019-10-28 2023-04-18 南京师范大学 Intelligent construction method of place name labeling corpus based on interactive and iterative learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108446269A (en) * 2018-03-05 2018-08-24 昆明理工大学 A kind of Word sense disambiguation method and device based on term vector
CN109359291A (en) * 2018-08-28 2019-02-19 昆明理工大学 A kind of name entity recognition method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
冯冲等. 基于词向量语义分类的微博实体链接方法.《自动化学报》 .2016,(第6期),全文. *
李国佳等. 一种基于多义词向量表示的词义消歧方法.《智能计算机与应用》.2018,(第4期),全文. *

Also Published As

Publication number Publication date
AU2020103654A4 (en) 2021-01-14
WO2021082366A1 (en) 2021-05-06
CN110826331A (en) 2020-02-21

Similar Documents

Publication Publication Date Title
CN110826331B (en) Intelligent construction method of place name labeling corpus based on interactive and iterative learning
CN109697232B (en) Chinese text emotion analysis method based on deep learning
CN108897857B (en) Chinese text subject sentence generating method facing field
CN111177394B (en) Knowledge map relation data classification method based on syntactic attention neural network
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN108460011B (en) Entity concept labeling method and system
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN110489523B (en) Fine-grained emotion analysis method based on online shopping evaluation
CN110555084A (en) remote supervision relation classification method based on PCNN and multi-layer attention
CN112163424A (en) Data labeling method, device, equipment and medium
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN110442880B (en) Translation method, device and storage medium for machine translation
CN113204967B (en) Resume named entity identification method and system
CN113177412A (en) Named entity identification method and system based on bert, electronic equipment and storage medium
CN109872775B (en) Document labeling method, device, equipment and computer readable medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN112800184B (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN111324771A (en) Video tag determination method and device, electronic equipment and storage medium
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN111178080A (en) Named entity identification method and system based on structured information
CN112084783B (en) Entity identification method and system based on civil aviation non-civilized passengers
CN115795060B (en) Entity alignment method based on knowledge enhancement
CN112257442A (en) Policy document information extraction method based on corpus expansion neural network
CN115017271A (en) Method and system for intelligently generating RPA flow component block

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant