CN107203511B

CN107203511B - Network text named entity identification method based on neural network probability disambiguation

Info

Publication number: CN107203511B
Application number: CN201710390409.2A
Authority: CN
Inventors: 周勇; 刘兵; 韩兆宇; 王重秋
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2017-05-27
Filing date: 2017-05-27
Publication date: 2020-07-17
Anticipated expiration: 2037-05-27
Also published as: AU2017416649A1; CN107203511A; CA3039280C; WO2018218705A1; CA3039280A1; RU2722571C1

Abstract

The invention discloses a network text named entity recognition method based on neural network probability disambiguation, which comprises the steps of segmenting words of a non-tag corpus, extracting Word vectors by using Word2Vec, converting a sample corpus into Word feature matrixes and windowing, constructing a deep neural network for training, adding a softmax function into an output layer of the neural network for normalization processing, and obtaining a probability matrix of each Word corresponding to a named entity category; and (4) re-windowing the probability matrix, and disambiguating by using a conditional random field model to obtain the final named entity label. The invention provides a word vector increment learning method without changing a neural network structure according to the characteristics of network words and new words, and adopts a probability disambiguation method for solving the problems of non-standard grammatical structure and many wrongly written words in a network text. Therefore, the method of the invention can generate higher accuracy in the task of identifying the network text naming entity.

Description

Network text named entity identification method based on neural network probability disambiguation

Technical Field

The invention relates to processing and analysis of web texts, in particular to a method for identifying a web text named entity based on neural network probability disambiguation.

Background

The network enables the speed and the scale of information acquisition and transmission to reach unprecedented levels, realizes global information sharing and interaction, and becomes an indispensable infrastructure of the information society. Modern communication and propagation technology greatly improves the speed and the breadth of information propagation. However, the problems and "side effects" associated with this are: the rough information sometimes makes it difficult to get the information needed most quickly and accurately from information oceans such as the sea. How to analyze named entities such as people, places, organizations and the like concerned by internet users from massive network texts provides important support information for various upper-layer applications such as online marketing, group emotion analysis and the like. This makes named entity recognition oriented to web text an important core technology in network data processing and analysis.

Methods of dealing with named entity recognition research has been largely divided into two categories, rule-based methods (rule-based) and statistical-based methods (static-based). With the continuous perfection of the machine learning theory and the great improvement of the calculation performance, the method based on statistics is more favored by people.

At present, a statistical model method for named entity recognition application mainly includes: hidden markov models, decision trees, maximum entropy models, support vector machines, conditional random fields, and artificial neural networks. The named entity recognition of the artificial neural network can obtain better results than that of a conditional random field model, a maximum entropy model and other models, but the practical use still mainly adopts the conditional random field model and the maximum entropy model, for example, a named entity recognition method and a named entity recognition device for microblog texts are provided by using the conditional random field in patent number CN201310182978.X and combining a named entity library, and a named entity recognition method using the maximum entropy model for modeling by using word features is provided in patent number CN200710098635. X. The reason why the artificial neural network is difficult to be used is that the artificial neural network often needs to convert words into vectors in a word vector space in the field of named entity recognition, and therefore, corresponding vectors cannot be obtained for new words, and large-scale practical application cannot be obtained.

Based on the above current situation, the named entity identification for web text mainly has the following problems: first, because there are a lot of network words, new words, and wrongly written words, the network text cannot train a word vector space containing all words to train a neural network. Secondly, the named entity recognition accuracy rate of the network text is reduced due to the phenomena of any language form, irregular grammar structure, many wrongly written characters and the like.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides the network text named entity recognition method based on the neural network probability disambiguation, which can extract the word characteristics in an increment mode without retraining the neural network and simultaneously recognize the probability disambiguation.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a network text named entity recognition method based on neural network probability disambiguation is characterized in that unlabeled corpora are participled, Word vectors are extracted through Word2Vec, sample corpora are converted into Word feature matrixes and windowed, a deep neural network is constructed for training, a softmax function is added into an output layer of the neural network for normalization processing, and a probability matrix of a named entity category corresponding to each Word is obtained. And (4) re-windowing the probability matrix, and disambiguating by using a conditional random field model to obtain the final named entity label.

The method specifically comprises the following steps:

step 1, obtaining a non-tag corpus through a webpage crawler, obtaining a sample corpus labeled with a named entity from a corpus, and segmenting the non-tag corpus by using a natural language tool.

And 2, training a Word vector space for the unlabeled corpus and the sample corpus which are well participled through a Word2Vec tool.

And 3, converting the text in the sample corpus into Word vectors representing Word characteristics according to the trained Word2Vec model, windowing the Word vectors, and taking a two-dimensional matrix obtained by multiplying the window w by the length d of the Word vectors as the input of the neural network. And converting the labels in the sample corpus into a one-hot form to be used as the output of the neural network. The output layer of the neural network is normalized by adopting a softmax function, so that the classification result of the neural network is the probability that the vocabulary belongs to the non-named entities and various named entities, the structure, the depth, the number of nodes, the step length, the activation function and the initial value parameter in the neural network are adjusted, and the activation function is selected to train the neural network.

And 4, re-windowing the prediction matrix output by the neural network, taking context prediction information of the words to be labeled as correlation points of actual classification of the words to be labeled in the conditional random field model, calculating expected values of all sides by using an EM (effective noise) algorithm according to the training corpus, and training the corresponding conditional random field model.

And 5, during recognition, firstly converting the text to be recognized into Word vectors representing Word characteristics according to the trained Word2Vec model, if the Word2Vec model does not contain corresponding training words, converting the words into the Word vectors by adopting a method of incremental learning, obtaining the Word vectors and backtracking Word vector space, windowing the Word vectors, and taking a two-dimensional matrix formed by multiplying a window w by the length d of the Word vectors as the input of the neural network. And then, the prediction matrix obtained by the neural network is placed into the trained conditional random field model again in a windowed manner for disambiguation, and the final named entity label in the text to be recognized is obtained.

Preferably: the parameters of the Word2Vec tool are as follows: selecting the word vector length 200, performing 25 times of iteration, performing initial step size 0.025 and minimum step size 0.0001, and selecting a CBOW model.

Preferably: the parameters of the neural network are as follows: hiding the layer 2, the number of hidden nodes is 150, the step size is 0.01, the batch size is selected 40, and the sigmoid function is used as the activation function.

Preferably, the method for converting the tags in the sample corpus into the one-hot form is that the tags of the "/o", "/n", "/p" in the sample corpus are correspondingly converted into the named entity tags "/Org-B", "/Org-I", "/Per-B", "/Per-I", "/L oc-B", "/L oc-I", and then converted into the one-hot form.

Preferably: the window size for word vector windowing is 5.

Preferably: when the neural network is trained, one tenth of words are extracted from the sample data and do not participate in the training of the neural network, and the words are used as the measuring standard of the neural network.

Compared with the prior art, the invention has the following beneficial effects:

the word vectors without retraining the neural network can be extracted in an incremental mode, and the neural network is used for predicting and disambiguating by using the probability model, so that the method has better practicability, accuracy and accuracy in the named entity recognition of the web text. In the named entity recognition task of the web text, the invention provides a word vector increment learning method without changing a neural network structure according to the characteristics of the existing web words and the new words, and adopts a probability disambiguation method for solving the problems of non-standard grammatical structure and more wrongly written words in the web text. Therefore, the method of the invention can generate higher accuracy in the task of identifying the network text naming entity.

Drawings

FIG. 1 is a flow diagram for training a device for network-text-named-entity recognition based on neural-network probability disambiguation in accordance with the present invention.

Fig. 2 is a flow chart for converting words into word features according to the present invention.

FIG. 3 is a schematic diagram of text processing and neural network architecture in accordance with the present invention.

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.

The method specifically comprises the following steps:

step 1, through a web crawler non-tag web text, a corpus with named entity labels is downloaded from each corpus to serve as a sample corpus, and a natural language tool is used for word segmentation of the non-tag corpus.

And 3, converting the text in the sample corpus into a Word vector representing Word characteristics according to the trained Word2Vec model, and taking the Word vector as the input of the neural network. And (3) converting the labels in the sample corpus into a one-hot form as the output of the neural network, wherein in a text processing task, one named entity can be divided into a plurality of vocabularies, so that in order to ensure the completeness of the identified named entity, the labeling form is labeled by adopting an IOB mode.

The method is characterized in that the vocabulary is named after the word is input, and the word is input into the neural network, namely the word and the characteristic information of the fixed-length context of the word are used as the input of the neural network when the vocabulary is judged, and the input of the neural network is not the length d of a word characteristic vector but a two-dimensional matrix of the window w multiplied by the word characteristic length d.

The output layer of the neural network is normalized by adopting a softmax function, so that the classification result of the neural network is the probability that the vocabulary belongs to the non-named entities and various named entities. Adjusting the structure, depth, node number, step length, activation function, initial value parameter and selecting activation function in the neural network to train the neural network.

And 5, during recognition, firstly converting the text to be recognized into Word vectors representing Word characteristics according to the trained Word2Vec model, and if the Word2Vec model does not contain corresponding training words, converting the words into the Word vectors by adopting a method of incremental learning, obtaining the Word vectors and backtracking a Word vector space.

(1) And matching the vocabulary to be converted in the trained word vector space.

(2) If the vocabulary to be converted can be matched in the word vector space, the vocabulary is directly converted into the corresponding word vector.

(3) If the Word2Vec model does not contain the corresponding vocabulary, the Word vector space is backed up, the reduction of the precision of the neural network model caused by the Word space offset generated by incremental learning is prevented, the Word2Vec model is loaded, the sentence where the unmatched vocabulary is located is obtained and is put into the Word2Vec model for incremental training, the Word vector of the vocabulary is obtained, and the model is backtracked by utilizing the backed Word vector space.

Windowing the word vector, and taking a two-dimensional matrix of the window w multiplied by the length d of the word vector as the input of the neural network. And then, the prediction matrix obtained by the neural network is placed into the trained conditional random field model again in a windowed manner for disambiguation, and the final named entity label in the text to be recognized is obtained.

Examples of the invention

The method comprises the steps of downloading named entity corpora from a website crawler web text for dog searching news as sample corpora from a data room corpus, utilizing a natural language tool to perform Word segmentation on the crawler web text, utilizing a generic packet in python to perform Word vector space training through a Word2Vec model on the Word corpus and the sample corpora with the following specific parameters, selecting a Word vector length of 200, iterating for 25 times, an initial step length of 0.025, a minimum step length of 0.0001, and selecting a CBOW model.

The method comprises the steps of converting texts of sample corpora into Word vectors representing Word characteristics according to a trained Word2Vec model, and converting the words into Word vectors by adopting a method of incremental learning, Word vector acquisition and Word vector backtracking if the Word2Vec model does not contain corresponding training words, wherein the Word vectors are used as the characteristics of each Word.

Setting the window size to be 5, namely when the category of the named entity of the current word is considered, taking the word characteristics of the current word and the word characteristics of the front word and the word characteristics of the rear word as the input of a neural network, wherein the input of the neural network is the vector of batchSize 1000, extracting one tenth of words from sample data without participating in the training of the neural network as the measurement standard of the neural network, normalizing the output layer of the neural network by adopting a softmax function, enabling the classification result of the neural network to be the probability that the words belong to non-named entities and various named entities, and temporarily taking the maximum value of the probability as the final classification result. The method comprises the steps of adjusting parameters such as the structure, the depth, the number of nodes, the step size, the activation function and the initial value in the neural network to enable the neural network to obtain good accuracy, wherein the final specific parameters are as follows, the number of hidden nodes is 150 in the hidden layer 2, the step size is 0.01, the batch size is selected to be 40, the activation function can generate a good classification effect when using sigmoid, the accuracy can reach 99.83%, and the F values of most representative names of people, places and mechanisms can reach 93.4%, 84.2% and 80.4%.

Removing the step of taking the maximum probability value of the prediction matrix output by the neural network as the final classification result, directly re-windowing the probability matrix, taking context prediction information of the word to be labeled as the correlation point of the actual classification of the word to be labeled in the conditional random field model, calculating the expected value of each side of the conditional random field by using an EM (effective electromagnetic field) algorithm according to the training corpus, training a corresponding conditional random field model, and enabling the F values of the name of the person, the place and the organization to be increased to 94.8%, 85.0% and 82.0% after disambiguation by using the conditional random field.

It can be seen from the above specific embodiments that, compared with the conventional supervised named entity recognition method, the text named entity recognition method based on neural network probability disambiguation provided by the invention uses a word vector conversion method capable of incrementally extracting word features without generating word vector space offset, so that the neural network can be applied to network texts with many new words and wrongly-written or mispronounced words. Moreover, the probability matrix output by the neural network is windowed again, and the conditional random field model is adopted for context disambiguation, so that the phenomena of more wrongly written characters and irregular grammar in the web text can be better solved.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A network text named entity identification method based on neural network probability disambiguation is characterized in that: segmenting words of the unlabeled corpus, extracting Word vectors by using Word2Vec, converting the sample corpus into Word feature matrixes, windowing, constructing a deep neural network for training, adding a softmax function into an output layer of the neural network for normalization processing, and obtaining a probability matrix of the named entity category corresponding to each Word; the probability matrix is re-windowed, and disambiguation is performed by using a conditional random field model to obtain a final named entity label, which comprises the following steps:

step 1, obtaining a non-tag corpus through a webpage crawler, obtaining a sample corpus labeled with a named entity from a corpus, and segmenting the non-tag corpus by using a natural language tool;

step 2, training Word vector space of the unlabeled corpus and the sample corpus which are well participled through a Word2Vec tool;

step 3, converting the text in the sample corpus into Word vectors representing Word characteristics according to the trained Word2Vec model, windowing the Word vectors, and taking a two-dimensional matrix obtained by multiplying a window w by the length d of the Word vectors as the input of a neural network; converting the labels in the sample corpus into a one-hot form to be used as the output of the neural network; the output layer of the neural network is normalized by adopting a softmax function, so that the classification result of the neural network is the probability that the vocabulary belongs to the non-named entities and various named entities, the structure, the depth, the number of nodes, the step length, the activation function and the initial value parameter in the neural network are adjusted, and the activation function is selected to train the neural network;

step 4, the prediction matrix output by the neural network is windowed again, context prediction information of the words to be labeled is used as correlation points of actual classification of the words to be labeled in the conditional random field model, expected values of all sides are calculated by using an EM algorithm according to training corpora, and a corresponding conditional random field model is trained;

step 5, during recognition, firstly converting a text to be recognized into Word vectors representing Word characteristics according to a trained Word2Vec model, if the Word2Vec model does not contain corresponding words, converting the words into the Word vectors by adopting a method of incremental learning, obtaining the Word vectors and backtracking Word vector space, windowing the Word vectors, and taking a two-dimensional matrix of a window w multiplied by the length d of the Word vectors as the input of a neural network; and then, the prediction matrix obtained by the neural network is placed into the trained conditional random field model again in a windowed manner for disambiguation, and the final named entity label in the text to be recognized is obtained.

2. The network text named entity recognition method based on neural network probability disambiguation as claimed in claim 1, characterized in that: the parameters of the Word2Vec tool are as follows: selecting the word vector length 200, performing 25 times of iteration, performing initial step size 0.025 and minimum step size 0.0001, and selecting a CBOW model.

3. The network text named entity recognition method based on neural network probability disambiguation as claimed in claim 1, characterized in that: the parameters of the neural network are as follows: hiding the layer 2, the number of hidden nodes is 150, the step size is 0.01, the batch size is selected 40, and the sigmoid function is used as the activation function.

4. The network text named entity recognition method based on neural network probability disambiguation as claimed in claim 1, wherein the method for converting the tags in the sample corpus into one-hot format comprises converting the "/o", "/n", "/p" tags in the sample corpus into named entity tags "/Org-B", "/Org-I", "/Per-B", "/Per-I", "/L oc-B", "/L oc-I", and converting them into one-hot format.

5. The network text named entity recognition method based on neural network probability disambiguation as claimed in claim 1, characterized in that: the window size for word vector windowing is 5.

6. The network text named entity recognition method based on neural network probability disambiguation as claimed in claim 1, characterized in that: when the neural network is trained, one tenth of words are extracted from the sample data and do not participate in the training of the neural network, and the words are used as the measuring standard of the neural network.