CN111144119A

CN111144119A - Entity identification method for improving knowledge migration

Info

Publication number: CN111144119A
Application number: CN201911374613.0A
Authority: CN
Inventors: 赵平; 孙连英; 涂帅; 王金峰
Original assignee: Beijing Union University
Current assignee: Beijing Union University
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2020-05-12
Anticipated expiration: 2039-12-27
Also published as: CN111144119B

Abstract

The invention relates to a scenic spot entity identification method, which solves the problem that labeled data in the tourism field is difficult to obtain by using the thought of knowledge migration, solves the problems of serious dependence on labeled data and labeled data quality in scenic spots identified by using a deep learning method, and solves the problem of word ambiguity represented by Chinese named entity identification characteristics by using a BERT + BilSTM + CRF method fused with a language model. The method utilizes the labeled data of the existing auxiliary field samples to evaluate the target field training set according to three levels of keywords, sentences and extensibility by utilizing the evaluation result. Experiments prove that the invention can obtain more obvious identification effect only by taking 1/4 labeled data. In addition, on the premise of not needing to label a large amount of manual data, the existing label data of the auxiliary field samples are used for expansion, and on the premise of not influencing the identification efficiency, the time and the energy spent on manually labeling the data are reduced.

Description

Entity identification method for improving knowledge migration

Technical Field

The invention relates to a scenic spot entity identification method, in particular to a scenic spot entity identification model for improving knowledge migration.

Background

The method has important significance for researches such as question-answering systems, public opinion analysis, personalized recommendation and the like in the tourism field by carrying out efficient information management and data mining on massive unstructured texts similar to tourism travel notes, and directly influences the information extraction in the tourism field on the entity identification accuracy of scenic spots.

For the identification of tourist attractions, there are currently the following categories: machine learning (hidden markov models and conditional random fields) and deep learning (convolutional neural networks) based methods. The hidden Markov model is a double random process for identifying the scenic spots, semantic information between contexts cannot be considered by the method, the word ambiguity problem represented by text features cannot be solved in the process of extracting features from the text, scenic spot words in the tourist field generally have different meanings under different contexts, for example, Huangshan can refer to Huangshan city of Anhui province under different contexts, belong to place names, can also refer to Huangshan in tourist attraction areas and the like, and then the scenic spot entity identification efficiency is general. The method based on the conditional random field mainly depends on manually constructing the feature template, for the tourism field, the number of scenic spot entities is too large to be listed one by one, time and labor are consumed when the feature template is manually constructed, and information of context and semantics cannot be considered. The method based on the convolutional neural network has high scenic spot identification efficiency, but a large amount of artificial labeling linguistic data are needed, the identification result seriously depends on the linguistic data labeling quality, in addition, the artificial labeling consumes great energy, and the linguistic data quality of the automatically labeled training set directly influences the identification efficiency. Therefore, the biggest problem at present for tourist attraction identification is that: 1) for the duplicate names of tourist attractions, the problem of different meanings of scenic spot words in different contexts cannot be solved when the text features are expressed; 2) for a specific tourism field, the number of scenic spot entities is too large, the scenic spot entities cannot be listed one by one, time and labor are consumed when a feature template is constructed manually, manual marking data is needed when a machine learning algorithm is used for learning, the model depends heavily on the quality of the marking data, and the marking data are difficult to obtain;

disclosure of Invention

The invention aims to solve the problems and provides a scenic spot entity recognition model for improving knowledge migration. The auxiliary field text is normalized and labeled data, so the difficulty of migration is how to evaluate the similarity between the auxiliary field and the target field, and the semantic information about the target field in the auxiliary field is expanded as much as possible without generating negative migration in the processes of feature extraction and knowledge migration.

In contrast, the method provides two different calculation modes of keyword importance and sample expandability for the text characteristics of the travel field to evaluate the quality of one sample. Three different degrees of similarity are designed to evaluate the similarity between the auxiliary field and the target field. The method has the advantage of expanding the target field training set by using the auxiliary field, and can accurately and effectively identify the scenic spots.

In order to achieve the purpose, the invention adopts the following technical scheme:

a scenic spot entity recognition model for improving knowledge migration comprises the following specific steps:

the method comprises the following steps: training a Chinese named entity recognition model by using a BERT + BilSTM + CRF method by using an auxiliary field training set, wherein the Chinese named entity recognition model comprises a BERT model, a BiILSTM and a CRF layer, and specifically comprises the following steps: and obtaining text word vectors by the training set through a BERT model, then carrying out named entity recognition through BiILSTM deep learning context characteristic information, and finally processing the output sequence of the BilSTM by using a CRF layer.

Step two: training a word2Vec model by using an auxiliary field training set, wherein the trained word2Vec model is called an auxiliary field word vectorization model, training the word2Vec model by using a target field training set, and the trained word2Vec model is called a target field word vectorization model;

step three: calculating the importance of words for each sample in the auxiliary field training set, and arranging the importance of the words according to a sequence from big to small, wherein the first m words are auxiliary field keywords; calculating the importance of words for each sample in the target field training set, and arranging the importance of the words according to a sequence from large to small, wherein the first m words are keywords of the target field;

step four: calculating the similarity between the auxiliary domain keywords and the target domain keywords obtained in the second step, and setting a keyword level similarity threshold;

step five: calculating the similarity between the auxiliary field sentence and the target field sentence, and setting a sentence level similarity threshold;

step six: calculating the expandability of the auxiliary field sample, and setting a expandability threshold;

step seven: expanding the target domain sample with the auxiliary domain sample; training a scenic spot entity recognition classifier by using the extended target field sample;

the scenic spot entity recognition model for improving knowledge migration of claim 1, wherein the named entity recognition model in the first step is a method comprising:

(1-1) inputting the auxiliary field training set into a BERT model, wherein the auxiliary field training set is a text set which is acquired from a daily report of people and is marked with names of people, places and organizations, and the BERT model outputs text word vectors;

(1-2) inputting the Chinese named entity recognition model into the text word vector in the step (1-1); extracting context information by using a BiILSTM deep learning method;

and (1-3) processing the output sequence of the BilSTM by using a CRF layer, and obtaining a global optimal sequence according to labels between adjacent layers by combining a state transition matrix in the CRF.

(1-4) outputting the Chinese named entity recognition model as a predicted entity label;

the second step comprises the following specific steps:

(2-1) the target field sample set is travel notes on manually crawled travel websites such as hornet cells;

(2-2) performing word segmentation on the auxiliary field sample set by using a jieba word segmentation method to obtain an auxiliary field word segmentation text, and performing word segmentation on the target field sample set by using a jieba word segmentation method to obtain a target field word segmentation text;

and (2-3) loading stop words and a user-defined dictionary. The user-defined dictionary is composed of words, and the expressed meaning is the words which are not separated by the word splitter;

(2-4) training a word2Vec model by using the auxiliary field word segmentation text to obtain an auxiliary field word vectorization model, and training the word2Vec model by using the target field word segmentation text to obtain an auxiliary field word vectorization model;

the third step comprises the following specific steps:

(3-1) calculating keyword frequency KF in sentence for target domain sample_i,jComputing keyword frequencies KF in sentences for the auxiliary domain samples_i,j', wherein the ith keyword frequency KF_i,jThe calculation method is as follows:

in the formula, KF_i,jRepresenting the frequency of occurrence, n, of the keyword i in the sentence j_i,jRepresenting the number of times the keyword i appears in the sentence j.

(3-2) calculating a sample inverse sentence frequency ISF for the auxiliary domain sample, and calculating a sample inverse sentence frequency ISF' for the target domain sample;

wherein: SF (sequence frequency) represents sentence frequency, ISF (inverse sequence frequency) represents inverse sentence frequency, ISF_iRepresenting the inverse sentence frequency of the word i, | S | sentence total amount, | j: t_i∈S_jI denotes t_i∈S_jAnd 1 is added to prevent denominator from changing to zero and leading to meaningless expression.

(3-3) calculating the importance degree of a certain word i in the sentence j for the auxiliary field sample, wherein the calculation formula is as follows: i (I, j) ═ KF_i,j*ISF_i；

(3-4) calculating the importance degree of a certain word i in the sentence j for the target field sample, wherein the calculation formula is as follows: i (I, j) ═ KF_i,j′*ISF_i′；

The fourth concrete step is as follows:

(4-1) for auxiliary Domain keywords

Calculating to obtain L by using the auxiliary field word2Vec language model obtained by training in the second step_word＝{l₁,l₂,…,l_n}；

(4-2) for target Domain keyword

M is obtained by calculating the word2Vec language model of the target field obtained by the training in the step two_word＝{m₁,m₂,…,m_n}；

(4-3) is in step three

And

and calculating the similarity of the keywords according to the cosine similarity in the following way:

(4-4) setting keyword level similarity threshold values (0.4, 0.6);

the concrete steps of the fifth step are as follows:

(5-1) for each sentence x in the auxiliary domain sample_sCalculating sentence vectors by using the auxiliary field word2Vec language model obtained by training in the second step to obtain L_sen＝{l₁,l₂,…,l_n}；

(5-2) for each sentence x in the target domain sample_sCalculating sentence vectors by using the target field word2Vec language model obtained by training in the step two to obtain M_sen＝{m₁,m₂,…,m_n}；

(5-3) to L_senAnd M_senAnd calculating sentence level similarity according to the cosine similarity in the following way:

(5-4) setting sentence level similarity thresholds (0.4, 0.6);

the sixth step is to determine the sample scalability, and the specific steps are as follows:

(6-1) obtaining sim according to the preceding paragraph_senAnd sim_wordFrom

Calculating sample scalability SEA, wherein α is SEA weight occupied by sentence-level similarity, and β is SEA weight occupied by keyword similarity;

(6-2) setting a scalability threshold (0.4, 0.6);

the seventh concrete step is as follows:

(7-1) expanding the sample with high keyword similarity to a target field sample set according to the keyword similarity threshold;

(7-2) expanding the samples with high sentence similarity into the target domain samples according to the sentence level similarity threshold;

(7-3) expanding the sample with high expandability into the target field sample according to the sample expandability capacity similarity threshold;

has the advantages that:

the invention solves the problem of difficult acquisition of the labeled data in the tourism field by using the idea of knowledge migration, and simultaneously solves the problems of serious dependence on the labeled data and labeled data quality in the scenic spot identification by using a deep learning method. And expanding a target field training set according to the evaluation of three levels of keywords, sentences and expandable capacity by using the idea of transfer learning and the labeled data of the existing auxiliary field samples and the evaluation result.

A large number of experiments prove that the invention can obtain more remarkable identification effect only by taking 1/4 marking data. In addition, on the premise of not needing to label a large amount of manual data, the method can be expanded by means of the labeled data of the existing auxiliary field samples, so that the entity extraction of the target field samples is realized, and the time and the energy spent on manually labeling the data are greatly reduced on the premise of not influencing the recognition efficiency.

Drawings

FIG. 1 is a diagram of the algorithm structure of the present invention

FIG. 2 is a diagram of classifier models

FIG. 3 is a graph of classifier hierarchy verification

FIG. 4 is a graph of similarity thresholds for different keywords

FIG. 5 is a graph of similarity thresholds for different sentences

FIG. 6 is a graph of different SEAs

FIG. 7 is a graph of sample size impact results for different target domains

Detailed Description

The invention will be further explained with reference to the following drawings and examples

As shown in fig. 1, an improved knowledge migration entity recognition model research and application includes the following specific steps:

(1) for auxiliary field sample X_sAnd a small number of target field samples X_tTraining Chinese entity recognition classifier C using the classifier model of FIG. 2₁(x) And scenery spot entity recognition classifier C₂(x)。C₁(x) For detecting word-polysemous problems in Chinese named entity recognition, C₂(x) The method is used for detecting the problem of scenic spot entity identification in the travel field named entity identification. In the design of the classifier in FIG. 2, a BERT embedded entity recognition model is used, the purpose is to solve the problem of ambiguous text representation of Chinese words during feature extraction, a method of memorizing neural network in two-way long-and-short time is used for learning context feature information, a conditional random field is used for processing an output sequence of a previous layer, and a state transition matrix in the conditional random field is combined to extract a global optimal sequence.

Use of auxiliary Domain sample test set C₁(x) And testing to obtain a test result curve chart shown in fig. 3, wherein the P value is the accuracy, the R value is the recall rate, and the F value is the comprehensive evaluation index.

(2) Initializing a similarity threshold m, and marking the proportion mu of a small number of sample sets in the target field in the expanded training set;

(3) for each X_t＝Tr_tTo X_s、X_tPreprocessing, respectively training corresponding language models, and for any X(s) epsilon X_sIs provided with

n is the vector dimension, for X (t) e X_tIs provided with

(4) Computing keyword frequency, text similarity, and extensibility for each sample in the auxiliary domain and the sample in the target domain

①

Calculating the importance of the words and obtaining the first m most relevant keywords

And

the calculation method comprises the following steps: i (I, j) ═ KF_i，j*ISF_i

Wherein KF_i,jIndicating the frequency of occurrence, ISF, of key i in sentence j_iRepresenting the inverse sentence frequency of the word i

② for each sample

v_sen(x_t)∈v(x_t) And calculating text similarity at sentence level, wherein the calculation method comprises the following steps:

③ for each assistive domain sample there is:

calculating the expandable capacity of the system, wherein the calculation method comprises the following steps:

wherein α and β represent weight coefficients;

(4) expanding the sample with stronger expansibility in the auxiliary field sample set to the target field sample set Tr according to the SEA value obtained by ③ in (2)_tTo obtain

(5) For expanded

Training a new scenery spot entity recognition model c (x) by using a BERT + BilSTM + CRF method;

(6) using target Domain test set Te_tObtaining a group of identification results;

(7) updating the value m, and repeatedly verifying to obtain an experimental result shown in the figure 5;

(8) updating the SEA threshold value, and repeatedly verifying to obtain an experimental result shown in the figure 6;

(9) updating the mu value, and repeatedly verifying to obtain an experimental result shown in the figure 7;

it can be seen that: the method provided by the method only marks data in the target field of 1/4, and the accuracy of the test result is as high as 95.06%.

Claims

1. An entity identification method for improving knowledge migration, characterized by:

the method comprises the following steps: training a Chinese named entity recognition model by using a BERT + BilSTM + CRF method by using an auxiliary field training set, wherein the Chinese named entity recognition model comprises a BERT model, a BiILSTM and a CRF layer, and specifically comprises the following steps: obtaining text word vectors by a BERT model through a training set, then carrying out named entity recognition by BiILSTM deep learning context characteristic information, and finally processing the output sequence of the BilSTM by using a CRF layer;

step three: for each auxiliary field sample, calculating the importance degree of words, and arranging the importance degrees of the words according to a sequence from large to small, wherein the first m words are auxiliary field keywords

For each target field sample, calculating the importance degree of words, and arranging the importance degrees of the words according to a sequence from large to small, wherein the first m words are keywords of the target field

Step four: calculating the auxiliary domain keywords obtained in the third step

And target domain keywords

Setting a similarity threshold value;

step seven: and (3) training a scenic spot entity recognition classifier by using the expanded target field sample by using the BERT + BilSTM + CRF method in the step one, wherein the scenic spot entity recognition classifier and the Chinese named entity recognition model have the same structure.

2. The scenic spot entity recognition model for improving knowledge migration of claim 1, wherein: the Chinese named entity recognition model in the first step is as follows:

(1-1) inputting the auxiliary field training set into a BERT model, wherein the auxiliary field training set is a text set which is collected from the daily newspaper of people and is marked with names of people, place names and organization names, and the BERT model outputs text word vectors;

(1-2) inputting the text word vector in the step (1-1) into BiILSTM, and extracting context information;

(1-3) processing the output sequence of the BilSTM by using a CRF layer to obtain predicted score values of different types of entity labels;

the model adopts an optimization method maximum likelihood estimation loss function, and the label identifies and labels a BIO labeling mode for the named entity.

3. The scenic spot entity recognition model for improving knowledge migration as claimed in claim 1, wherein in step two,

the target field sample set is travel notes on a manually crawled travel website;

the auxiliary field sample set is further subjected to word segmentation by a jieba word segmentation method to obtain an auxiliary field word segmentation text, and the target field sample set is subjected to word segmentation by a jieba word segmentation method to obtain a target field word segmentation text;

further comprising loading stop words and a user-defined dictionary, wherein the user-defined dictionary is composed of words including words that are not intended to be separated by the segmenter.

4. The scenic spot entity recognition model for improving knowledge migration as claimed in claim 1, wherein the specific steps of calculating the importance of the auxiliary domain words in the third step are as follows:

(3-1) calculating a word frequency in the auxiliary domain sample sentence, wherein an appearance frequency KF of an ith word in the sentence j_i,jIs obtained by the following formula:

in the formula, n_i,jIs shown asThe number of times i words appear in sentence j;

(3-2) calculating a sample inverse sentence frequency ISF for the auxiliary domain sample, wherein the inverse sentence frequency ISF for word i_iIs obtained by the following formula;

wherein: | S | represents the total number of sentences in the auxiliary domain sample set, | j: t_i∈S_jI denotes t_i∈S_jNumber of occurrences, t_iRepresenting words in sentences, S_jRepresents the jth sentence;

(3-3) calculating the importance degree of the word in the sentence in the assistant domain sample, wherein the importance degree I (I, j) of the word I in the sentence j is calculated by the following formula:

I(i,j)＝KF_i,j*ISF_i。

5. the model of scenic spot entity recognition for improving knowledge migration as claimed in claim 4, wherein the calculation method of the target domain keyword in the third step is the same as the calculation method of the auxiliary domain keyword, and the only difference is that the related data are all samples in the target domain training set.

6. The scenic spot entity recognition model for improving knowledge migration of claim 1, wherein the fourth specific step is:

(4-1) obtaining a word vector L of each auxiliary field keyword by using the auxiliary field word vectorization model obtained by the training in the step two_word；

(4-2) obtaining a word vector M of each target field keyword by using the target field word vectorization model obtained by training in the step two_word；

(4-3) calculating auxiliary domain keywords according to cosine similarity

And target domain keywords

The specific calculation formula of the similarity is as follows:

L_word＝{l₁,l₂,…,l_ndenotes a word vector of auxiliary domain keywords, M_word＝{m₁,m₂,…,m_nThe word vector representing the target domain keyword.

7. The model for identifying scenic spot entities with improved knowledge migration as claimed in claim 1, wherein the value range of the keyword level similarity threshold in step four is (0.4, 0.6).

8. The model for identifying scenic spot entities with improved knowledge migration as claimed in claim 1, wherein the similarity between the auxiliary domain sentence and the target domain sentence in step 5 is calculated by:

(5-1) obtaining each sentence x in the auxiliary field sample by utilizing the auxiliary field word vectorization model obtained by the training in the step two_sSentence vector L_sen＝{l₁,l₂,…,l_n}；

(5-2) obtaining each sentence x in the target field sample by using the target field word vectorization model obtained by the training in the step two_tSentence vector M_sen＝{m₁,m₂,…,m_n}；

and 5, the value range of the sentence level similarity threshold is (0.4, 0.6).

9. The scenic spot entity recognition model for improving knowledge migration as claimed in claim 1, wherein the method for computing the auxiliary domain sample extensibility capability SEA in step 6 is as follows:

wherein α is a weight coefficient, and the value range is (0, 0.5);

the value range of the expandable capacity threshold is (0.4, 0.6).

10. The scenic spot entity recognition model for improving knowledge migration as claimed in claim 1, wherein the sample expansion condition in step 7 is:

(7-1) expanding the samples with the keyword similarity higher than the threshold value into a target field sample set according to the keyword similarity threshold value;

(7-2) expanding samples, of which sentence similarity is higher than a threshold value, into target domain samples according to a sentence-level similarity threshold value;

and (7-3) expanding the samples with the expandability capacity higher than the threshold value into the target domain samples according to the sample expandability capability similarity threshold value.