CA3039280C

CA3039280C - Method for recognizing network text named entity based on neural network probability disambiguation

Info

Publication number: CA3039280C
Application number: CA3039280A
Authority: CA
Inventors: Yong Zhou; Bing Liu; Zhaoyu HAN; Zhongqiu WANG
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2017-05-27
Filing date: 2017-06-20
Publication date: 2021-07-20
Anticipated expiration: 2037-06-20
Also published as: WO2018218705A1; CA3039280A1; CN107203511B; CN107203511A; RU2722571C1; AU2017416649A1

Abstract

A method for recognizing network text named entities based on neural network probability disambiguation comprising: carrying out word segmentation on an unlabeled corpus, using Word2Vec to extract a word vector; converting a sample corpus into a word feature matrix and windowing same; building a deep neural network to carry out training, and adding a softmax function into an output layer of the neural network to carry out normalization processing, so as to obtain a probability matrix of the named entity category corresponding to each word; and re-windowing the probability matrix, and using a conditional random field model to carry out disambiguation, so as to obtain a final named entity annotation. A probability disambiguation method is used in order to deal with the problems of a nonstandard grammatical structure and many wrongly written characters in the network text.

Description

TITLE OF THE INVENTION
METHOD FOR RECOGNIZING NETWORK TEXT NAMED ENTITY BASED ON
NEURAL NETWORK PROBABILITY DISAMBIGUATION
TECHNICAL FIELD
The present invention relates to processing and analysis of network text, particularly to a method for recognizing network text named entities based on neural network probability disambiguation.
BACKGROUND ART
Networks have driven the speed and scale of information collection and dissemination to an unprecedented level and brought global information sharing and interaction into reality, and have become an indispensable infrastructure in the information society. Modern communication and dissemination techniques have greatly improved the speed and breadth of information dissemination. However, there are accompanying problems and "side effects":
sometimes people are confused by the turbulent information, and it is very hard to obtain the precise information needed quickly and accurately from the vast sea of information. It is a prerequisite to analyze and obtain named entities, such as people, places, and organizations, etc., concerned by Internet users from within a mas of network text, in order to provide important support information for various higher-level applications such as online marketing, group emotion analysis, etc. Accordingly, network text named entity recognition has become an important core technique in network data processing and analysis.
Two kinds of methods for dealing with named entity recognition are considered in the research, i.e., rule-based method and statistics-based method. As the machine learning theory is consummated continuously and the computing performance is improved greatly, the statistics-based method is increasingly favored.
At present, statistical models and methods applied in named entity recognition mainly include:
hidden Markov model, decision tree, maximum entropy model, support vector machine, conditional random field and artificial neural network. Artificial neural networks can achieve a better result in named entity recognition than conditional random field, maximum entropy model, and other models, but conditional random field and maximum entropy models are still dominant practical models. For example, in the Patent Document No. CN201310182978.X, named entity recognition method and apparatus for MicroBlog text based on conditional random field and named entity library are proposed. In the Patent Document No.
CN200710098635.X, a named entity recognition method utilizing word features and using maximum entropy model to model is proposed. Artificial neural networks are difficult to use practically because they often require the conversion of words into vectors in a word vector space in the field of named entity recognition.
Consequently, artificial neural networks can not be applied in large-scale practical applications, because they are unable to obtain corresponding vectors for new words.
Owing to the above-mentioned present situation, there are mainly the following problems in named entity recognition for network text: firstly, it is unable to train a word vector space that contains all words in order to train a neural network, because there are a lot of network words, new words, and wrongly written or mispronounced characters in network text;
secondly, the accuracy of named entity recognition for network texts is degraded as a result of phenomena existing in network text, such as arbitrary language forms, non-standard grammatical structures, and wrongly written or mispronounced characters, etc.
SUMMARY OF THE INVENTION
The object of the invention is to overcome the drawbacks in the prior art. The present invention provides a network text named entity recognition method based on neural network probability disambiguation, which extracts word features incrementally without neural network retraining, and performs recognition with the aid of probability disambiguation. The method obtains a prediction probability matrix on the named entity category of a word from a neural network by training neural network, and performs disambiguation on the prediction matrix outputted from the neural network in a probability model, and thereby improves accuracy and precision of network text named entity recognition.
In order to attain the object described above, the technical scheme employed by the present invention is as follows.
The network text named entity recognition method is based on neural network probability disambiguation, performing word segmentation on an untagged corpus, utilizing Word2Vec to extract a word vector, converting sample corpora into a word feature matrix and windowing, building a deep neural network for training, adding a softmax function into an output layer of the neural network, and performing normalization, to acquire a probability matrix of named entity categorycorresponding to each word; re-windowing the probability matrix, and utilizing a conditional random field model for disambiguation to obtain a final named entity tag.
Specifically, the method comprises the following steps:
step 1: acquiring an untagged corpus by means of a web crawler, acquiring sample corpora with named entity tags from a corpus base, and performing word segmentation on the untagged corpus by a natural language tool;
step 2: performing word vector space training on the segmented untagged corpus and the sample corpora by a Word2Vec tool;
step 3: converting the text in the sample corpora into a word vector representing word features according to the trained Word2Vec model, windowing the word vector, and taking a two-dimensional matrix composed by multiplying the window w by the length d of the word vector as an input to a neural network; converting the tags in the sample corpora into a one-hot

2 form and taking them as outputs of the neural network; performing normalization on an output layer of the neural network with a softmax function, so that a categorization result produced by the neural network becomes a probability of whether the word belongs to an unnamed entity or a named entity, adjusting the structure, depth, number of nodes, step length, activation function, and initial value parameters in the neural network, and selecting an activation function to train the neural network;
step 4: re-windowing a prediction matrix outputted from the neural network, taking context prediction information of the word to be tagged as a point of correlation with an actual category of the word to be tagged in a conditional random field model, utilizing an EM
algorithm to calculate expected values at all sides according to training corpora, and training a corresponding conditional random field model;
step 5: in the recognition process, first, converting the text to be recognized into a word vector that represents word features according to the trained Word2Vec model, and, if the Word2Vec model doesn't contain a corresponding training word, converting the word into a word vector by means of incremental learning, word vector acquisition, and word vector space backtracking, etc., windowing the word vector, and taking a two-dimensional matrix composed by multiplying the window w by the length d of the word vector as an input to the neural network;
then, re-windowing a prediction matrix obtained from the neural network, performing disambiguation on the prediction matrix in the trained conditional random field model, and obtaining a final named entity tag of the text to be recognized.
Preferably, the parameters of the Word2Vec tool are as follows: length of word vector: 200, number of iterations: 25, initial step length: 0.025, minimum step length:
0.0001, and a CBOW
model is selected.
Preferably, the parameters of the neural network are as follows: number of hidden layers: 2, number of hidden nodes: 150, step length: 0.01, batchSize: 40, activation function: sigmoid function.
Preferably, the tags in the sample corpora are converted into an one-hot form with the following method: converting the tags "/o", "In", and "/p" in the sample corpora into named entity tags "/Org-B", "Org-I", "/Per-B", "/Per-I", "/Loc-B", and "/Loc-I" correspondingly, and then converting the named entity tags into the one-hot form.
Preferably, the window size for windowing the word vector is 5.
Preferably, in neural network training, one-tenth words are extracted from the sample data and excluded from the neural network training, but are used as evaluation criteria for the neural network.
Compared with the prior art, the present invention targets the following beneficial effects:
Word vectors without retraining the neural network may be extracted incrementally, prediction may be carried out with the neural network, and disambiguation may be performed with a

3 probability model, so that the method achieves better practicability, accuracy and precision in named entity recognition of network text. In the task of named entity recognition of network text, the present invention provides an incremental word vector learning method without changing the structure of a neural network according to a characteristic that network words and new words exist, and employs a probability disambiguation method to deal with the problems that network texts are non-standard in grammatical structure and contain a lot of wrongly written or mispronounced characters. Thus, the method provided in the present invention attains high accuracy in network text named entity recognition tasks.
BRIEF DESCRIPTION OF DRAWINGS
Fig. 1 is a flow chart of training a network text named entity recognition device based on neural network probability disambiguation according to the present invention;
Fig. 2 is a flow chart of converting a word into word features according to the present invention;
Fig. 3 is a schematic diagram of the text processing and neural network architecture according to the present invention.
DETAILED DESCRIPTION OF THE EMBODIMENTS
Hereunder the present invention will be further detailed in embodiments, with reference to the accompanying drawings. It should be appreciated that those embodiments are provided only for describing the present invention, and shall not be deemed as constituting any limitation to the scope of the present invention. After reading the present invention, modifications to the present invention in various equivalent forms made by those skilled in the art shall be deemed as falling into the protected scope as defined by the attached claims in this application.
A network text named entity recognition method based on neural network probability disambiguation, performing word segmentation on an untagged corpus, utilizing Word2Vec to extract a word vector, converting sample corpora into a word feature matrix and windowing, building a deep neural network for training, adding a softmax function into an output layer of the neural network, and performing normalization, to acquire a probability matrix of named entity category corresponding to each word; re-windowing the probability matrix, and utilizing a conditional random field model for disambiguation to obtain a final named entity tag.
Specifically, the method comprises the following steps:
step I: Acquiring untagged network text by means of a web crawler, downloading corpora with named entity tags as sample corpora from a corpus base, and performing word segmentation on the untagged corpus with a natural language tool;
step 2: Performing word vector space training on the segmented untagged corpus and the sample corpora with a Word2Vec tool;
step 3: Converting the text in the sample corpora to a word vector that represents word features

4 according to a trained Word2Vec model, and taking the word vector as an input to a neural network; converting the tags in the sample corpora into an one-hot form and taking them as outputs of the neural network. In view that a named entity may be divided into several words in a text processing task, the tagging is performed in an JOB pattern, in order to ensure that the recognized named entity has integrality.
Which named entity category a word belongs to should not be judged merely on the basis of the word itself, but should be further judged according to the context information of the word.
Therefore, a concept of "window" is introduced in the building of the neural network, i.e., in the judgment of a word, both the word and the characteristic information of content in fixed length thereof are taken as inputs to the neural network; thus, the input to the neural network is no longer the length d of a word feature vector, but is a two-dimensional matrix composed by multiplying the window w by the length d of word feature vector instead.
An output layer of the neural network is normalized with a softmax function, so that a categorization result produced by the neural network becomes a probability of whether the word belongs to an unnamed entity or a named entity. The structure, depth, number of nodes, step length, activation function, initial value parameters in the neural network are adjusted, and an activation function is selected to train the neural network.
step 4: Re-windowing a prediction matrix outputted from the neural network, taking context prediction information of the word to be tagged as a point of correlation with an actual category of the word to be tagged in a conditional random field model, utilizing an EM
algorithm to calculate expected values at all sides according to training corpora, and training a corresponding conditional random field model;
step 5: In the recognition process, first, converting the text to be recognized into a word vector that represents word features according to the trained Word2Vec model, and, if the Word2Vec model doesn't contain a corresponding training word, converting the word into a word vector by means of incremental learning, word vector acquisition, and word vector space backtracking, etc.
(1) matching the word to be converted in a trained word vector space;
(2) converting the word to be converted directly to a corresponding word vector, if the word is matched in the word vector space;
(3) if the Word2Vec model doesn't contain a corresponding word, backing up the word vector space to prevent degradation of the accuracy of the neural network model incurred by deviation of a word space created in incremental learning, loading the Word2Vec model, acquiring a sentence where the mismatched word exists, inputting the sentence into the Word2Vec model and performing increment training, acquiring the word vector of the word, and utilizing the backup word vector space to performing backtracking of the model;
windowing the word vector, and taking a two-dimensional matrix composed by multiplying the window w by the length d of word vector as an input to the neural network;
then, re-windowing a prediction matrix obtained from the neural network, performing disambiguation on the prediction matrix in the trained conditional random field model, and obtaining a final named entity tag of the text to be recognized.
Example Network text is acquired by means of a web crawler from Sogou News website (http://news.sogou.com/), corpora with named entity tags are downloaded from Datatang corpus base (http://www.datatang.com/) as sample corpora, word segmentation is performed on the acquired network text with a natural language tool, word vector space training is performed on the segmented corpus and sample corpora with gensim package in python by Word2Vec model, utilizing the following parameters: length of word vector: 200, number of iterations: 25, initial step length: 0.025, and minimum step length: 0.0001, and a CBOW model is selected.
The text in the sample corpora is converted into a word vector that represents word features according to the trained Word2Vec model, and, if the Word2Vec model doesn't contain a corresponding training word, the word is converted into a word vector by means of incremental learning, word vector acquisition, and word vector space backtracking, etc., as the features of the word. The tags "/o", "In", and "/p" in the sample corpora acquired from Datatang are converted into named entity tags "/Org-B", "/Org-I", "/Per-B", "/Per-I", "/Loc-B", and "/Loc-I", etc.
correspondingly, and then the named entity tags are converted into the one-hot form as outputs of the neural network.
The window size is set to 5, i.e., in the consideration of the named entity category of the current word, the word features of the word and two words before the word and two words after the word are used as inputs to the neural network; the input to the neural network is a batchSize*1000 vector; one-tenth words are extracted from the sample data and excluded from the neural network training, but are used as evaluation criteria for the neural network; the output layer of the neural network is normalized with a softmax function, so that a categorization result produced by the neural network becomes a probability of whether the word belongs to an unnamed entity or named entity; the maximum value of probability is taken as the final categorization result temporarily. The parameters in the neural network, such as structure, depth, number of nodes, step length, activation function, and initial value, etc., are adjusted to ensure the neural network attain high accuracy; the final parameters are as follows:
number of hidden layers: 2, number of hidden nodes: 150, step length: 0.01, batchSize: 40, activation function:
sigmoid; thus, a good categorization effect can be attained, the accuracy may be as high as 99.83%, and the F values of the most representative personal names, place names, and organization names may be 93.4%, 84.2%, and 80.4% respectively.
The step of taking the maximum probability value of the prediction matrix outputted from the neural network as the final categorization result is removed, the probability matrix is re-windowed directly, the context prediction information of the word to be tagged is used as a point of correlation with the actual category of the word to be tagged in a conditional random field model, an EM algorithm is used to calculate expected values at all sides of the conditional random field according to the training corpora, and a corresponding conditional random field model is trained; after disambiguation with the conditional random field, the F values of personal names, place names, and organization names can be improved to 94.8%, 85.0%, and 82.0%
respectively.
It is seen from the embodiment described above: compared with the conventional supervised named entity recognition method, the text named entity recognition method based on neural network probability disambiguation provided in the present invention employs a word vector conversion method that can be used to extract word features incrementally without causing deviation of the word vector space; thus, the neural network can be applied to network text that contains a lot of new words and wrongly written or mispronounced characters.
Moreover, in the present invention, the probability matrix outputted from the neural network is re-windowed, and context disambiguation is performed with a conditional random field model, so as to deal with the phenomenon that the network text involves a lot of wrongly written or mispronounced characters and non-standard grammatical structures successfully.
While the present invention is described above in some preferred embodiments, it should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and those improvements and modifications should be deemed as falling in the scope of protection of the present invention.

Claims

1. A method for recognizing network text named entity based on neural network probability disambiguation, comprising: performing word segmentation on an untagged corpus, utilizing Word2Vec to extract a word vector, converting sample corpora into a word feature matrix, windowing, building a deep neural network for training, adding a softmax function into an output layer of the neural network, and perfonning normalization, to acquire a probability matrix of named entity category corresponding to each word;
re-windowing the probability matrix, and utilizing a conditional random field model for disambiguation to obtain a final named entity tag.

2. The method for recognizing network text named entity based on neural network probability disambiguation according to claim 1, comprising the following steps:
step 1: acquiring the untagged corpus by means of a web crawler, acquiring sample corpora with named entity tags from a corpus base, and perfonning word segmentation on the untagged corpus with a natural language tool;
step 2: performing word vector space training on the segmented untagged corpus and the sample corpora by the Word2Vec tool;
step 3: converting the text in the sample corpora into the word vector representing word features according to the trained Word2Vec model, windowing the word vector, and taking a two-dimensional matrix composed by multiplying the window w by the length d of the word vector as an input to the neural network; converting the tags in the sample corpora into a one-hot fonn and taking them as outputs of the neural network; performing nonnalization on an output layer of the neural network with the softmax function, so that a categorization result produced by the neural network becomes a probability of whether the word belongs to an unnamed entity or a named entity, adjusting the structure, depth, number of nodes, step length, activation function, and initial value parameters in the neural network, and selecting an activation function to train the neural network;
step 4: re-windowing a prediction matrix outputted from the neural network, taking context prediction information of the word to be tagged as a point of correlation with an actual category of the word to be tagged in the conditional random field model, utilizing an expectation-maximization (EM) algorithm to calculate expected values at all sides according to training corpora, and training a corresponding conditional random field model;
step 5: in the recognition process, first, converting the text to be recognized into the word vector that represents word features according to the trained Word2Vec model, and, if the Word2Vec model doesn't contain a corresponding word, converting the word into the word vector by means of incremental learning, word vector acquisition, and word vector space backtracking, windowing the word vector, and Date Recue/Date Received 2020-09-10 taking the two-dimensional matrix composed by multiplying the window w by the length d of the word vector as an input to the neural network; then, re-windowing the prediction matrix obtained from the neural network, performing disambiguation on the prediction matrix in the trained conditional random field model, and obtaining the final named entity tag of the text to be recognized.

3. The method for recognizing network text named entity based on neural network probability disambiguation according to claim 1, wherein, the parameters of the Word2Vec tool are as follows: length of word vector: 200, number of iterations: 25, initial step length: 0.025, minimum step length: 0.0001, and a continuous bag-of-words (CBOW) model is selected.

4. The method for recognizing network text named entity based on neural network probability disambiguation according to claim 1, wherein, the parameters of the neural network are as follows: number of hidden layers: 2, number of hidden nodes:
150, step length: 0.01, batch size: 40, activation function: sigmoid function.

5. The method for recognizing network text named entity based on neural network probability disambiguation according to claim 1, wherein, the tags in the sample corpora are converted into a one-hot form with the following method: converting the tags "/n", and "/p" in the sample corpora into named entity tags "/Org-B", "/Org-I", "/Per-B", "/Per-I", "/Loc-B", and "/Loc-I" correspondingly, and then converting the named entity tags into the one-hot form.

6. The method for recognizing network text named entity based on neural network probability disambiguation according to claim 1, wherein, the window size for windowing the word vector is 5.

7. The method for recognizing network text named entity based on neural network probability disambiguation according to claim 1, wherein, in neural network training, one-tenth of the words are extracted from the sample data and excluded from the neural network training, but are used as evaluation criteria for the neural network.

Date Recue/Date Received 2020-09-10