CN111428501A - Named entity recognition method, recognition system and computer readable storage medium - Google Patents

Named entity recognition method, recognition system and computer readable storage medium Download PDF

Info

Publication number
CN111428501A
CN111428501A CN201910020553.6A CN201910020553A CN111428501A CN 111428501 A CN111428501 A CN 111428501A CN 201910020553 A CN201910020553 A CN 201910020553A CN 111428501 A CN111428501 A CN 111428501A
Authority
CN
China
Prior art keywords
named entity
corpus
model
constructing
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910020553.6A
Other languages
Chinese (zh)
Inventor
贾丹丹
张丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201910020553.6A priority Critical patent/CN111428501A/en
Publication of CN111428501A publication Critical patent/CN111428501A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a named entity identification method, a named entity identification system and a computer readable storage medium. The method for identifying the named entity comprises the following steps: constructing a named entity word dictionary and constructing a named entity recognition corpus; constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus; and acquiring the linguistic data to be recognized, and inputting the linguistic data to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the linguistic data to be recognized. By adopting the technical scheme of the invention, the named entity recognition model is obtained by constructing the named entity dictionary training, so that the recognition performance of the named entity recognition model on unknown words is improved, the recognition accuracy of the named entity recognition model is improved, and the named entity of the corpus to be recognized is accurately recognized.

Description

Named entity recognition method, recognition system and computer readable storage medium
Technical Field
The invention relates to the technical field of data processing, in particular to a named entity identification method, a named entity identification system and a computer readable storage medium.
Background
With the advent of the information age, the internet has become an indispensable tool in people's life. The popularization of the internet brings profound influences to life styles of people, such as online shopping, social contact, entertainment, navigation and the like, and the internet brings convenience to people and simultaneously generates massive data which are closely related to the life of people. The text data in the mass data occupies a large part, so people urgently need to automatically extract valuable information in the mass unstructured text, and the named entity identification technology can extract key entity information from the text, has high application value and is an important research direction.
Although natural language processing technology has been developed for decades, the research of basic processing module, especially the recognition of Chinese named entity, has not made obvious breakthrough and leap. In the related art, the defects of slow speed, more errors, low efficiency, high cost and the like are exposed when the processing mode of manpower and domain experts faces to the massive natural language data with multiple domains and complications, and the processing mode becomes a bottleneck for limiting the development of natural language processing.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art or the related art.
To this end, one aspect of the present invention is directed to a method for identifying a named entity.
Another aspect of the invention is directed to a system for identifying named entities.
Yet another aspect of the present invention is directed to a computer-readable storage medium.
In view of the above, according to an aspect of the present invention, a method for identifying a named entity is provided, including: constructing a named entity word dictionary and constructing a named entity recognition corpus; constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus; and acquiring the linguistic data to be recognized, and inputting the linguistic data to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the linguistic data to be recognized.
The invention provides a named entity recognition method, which comprises the steps of firstly constructing a named entity word dictionary, secondly constructing a named entity recognition corpus, and thirdly constructing a named entity recognition model fusing the named entity word dictionary, namely constructing a named entity recognition system for automatically recognizing and extracting the names of people, places and mechanism in a text, wherein the named entity recognition model can be a named entity recognition model fusing two-way long-short time memory network conditional random field (Bi L STM-CRF) of the dictionary.
The method for identifying the named entity according to the present invention may further have the following technical features:
in the foregoing technical solution, preferably, the corpus to be recognized is input to the named entity recognition model, and a recognition result of the named entity of the corpus to be recognized is obtained, which specifically includes: calculating the probability of the identification label of each character in the corpus to be identified by using the named entity identification model, and taking the identification label with the maximum probability as the target identification label of the character; and identifying the named entity of the corpus to be identified according to the target identification label of the character.
In the technical scheme, the probability of the identification tag of each character in the corpus to be identified is calculated by using the named entity identification model, the identification tag with the maximum probability is used as the target identification tag of the character, and further, the named entity of the corpus to be identified is identified according to the target identification tag of the character, so that the named entity of the corpus to be identified is accurately identified.
In any of the above technical solutions, preferably, constructing a naming entity word dictionary specifically includes: acquiring entries in a preset database; performing font conversion on the entries, and performing duplication removal on the entries subjected to font conversion; calculating the addition value of the reverse file frequency value of the entry and the entry length value of the entry, comparing the addition value with a preset threshold value, and filtering the entries of which the addition value is smaller than the preset threshold value; and constructing a naming entity word dictionary by using the filtered entries.
In the technical scheme, an XM L (Extensible Markup L anguage) compressed file is downloaded in a preset database (Wikipedia database), then an entry is extracted from an XM L file by utilizing a python open source code Wikipedia extra, a traditional entry is converted into a simplified entry by utilizing an open source item open cc, duplication of the converted entry is removed, the sum of a reverse file frequency value of the entry and an entry length value of the entry is calculated, the entry of which the sum value is smaller than a preset threshold value is filtered, namely, a shorter entry is filtered, more long words are reserved, the elimination of the common words (the length of the common words is smaller) can be realized, and adverse effects caused by the introduction of the common words into a named entity recognition model are avoided.
In any of the above technical solutions, preferably, the constructing of the named entity identification corpus specifically includes: obtaining sample corpora; and marking the named entity words and the non-named entity words in the sample corpus by taking the characters as marking units according to the marking modes of the beginning, the middle or the end, adding the boundary information of the non-named entity words into the marking of the sample corpus, and further constructing the named entity recognition corpus.
In the technical scheme, compared with a BIO (beginning, middle or end) labeling mode in the related technology, the technical scheme of the invention also considers the boundary information of the non-named entity words when constructing the named entity recognition corpus, so that the boundary information of the non-named entity words is also added into the corpus labeling besides the named entity words, and the named entity recognition model trained by utilizing the named entity recognition corpus is more accurate.
In any of the above technical solutions, preferably, the constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus specifically includes: acquiring an embedded vector of characters in the named entity recognition corpus, and generating a dictionary feature vector of the characters according to the named entity dictionary; training a first bidirectional long-short time memory network model by using an embedded vector, training a second bidirectional long-short time memory network model by using a dictionary feature vector, and training a conditional random field model by using the output of the first bidirectional long-short time memory network model and the output of the second bidirectional long-short time memory network model; and taking the trained first bidirectional long-short time memory network model, the trained second bidirectional long-short time memory network model and the trained conditional random field model as named entity recognition models.
In the technical scheme, after a named entity dictionary is constructed, the information of the named entity dictionary needs to be converted into features to participate in the model training process, namely, dictionary feature vectors of characters are generated according to the information of the named entity dictionary, and meanwhile, embedded vectors of the characters in the named entity recognition corpus are obtained. And training a first bidirectional long-short term memory network model by using the embedded vector, training a second bidirectional long-short term memory network model by using the dictionary characteristic vector, and finally obtaining a named entity recognition model by using the output training conditional random field model of the two bidirectional long-short term memory network models.
According to another aspect of the present invention, there is provided a named entity recognition system, comprising: a memory for storing a computer program; a processor for executing a computer program to: constructing a named entity word dictionary and constructing a named entity recognition corpus; constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus; and acquiring the linguistic data to be recognized, and inputting the linguistic data to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the linguistic data to be recognized.
The invention provides a named entity recognition system, which comprises a named entity dictionary, a named entity recognition corpus and a named entity recognition model fused with the named entity dictionary, wherein the named entity recognition model can be a named entity recognition model of a two-way long-short time memory network conditional random field (Bi L STM-CRF) fused dictionary, and can automatically recognize and extract the names, the place names and the mechanism names in a text.
The system for identifying named entities according to the present invention may further have the following technical features:
in the foregoing technical solution, preferably, the processor inputs the corpus to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the corpus to be recognized, and the method specifically includes: calculating the probability of the identification label of each character in the corpus to be identified by using the named entity identification model, and taking the identification label with the maximum probability as the target identification label of the character; and identifying the named entity of the corpus to be identified according to the target identification label of the character.
In the technical scheme, a named entity recognition model is used for obtaining a plurality of recognition tags of each character in the corpus to be recognized and the probability of each recognition tag, the recognition tag with the highest probability is used as a target recognition tag of the character, further, the named entity of the corpus to be recognized is recognized according to the target recognition tag of the character, and accurate recognition of the named entity of the corpus to be recognized is achieved.
In any of the above technical solutions, preferably, the processor constructs a naming entity word dictionary, which specifically includes: acquiring entries in a preset database; performing font conversion on the entries, and performing duplication removal on the entries subjected to font conversion; calculating the addition value of the reverse file frequency value of the entry and the entry length value of the entry, comparing the addition value with a preset threshold value, and filtering the entries of which the addition value is smaller than the preset threshold value; and constructing a naming entity word dictionary by using the filtered entries.
In the technical scheme, an XM L compressed file is downloaded in a preset database (Wikipedia database), then a python open source code Wikipedia Extractor is used for extracting entries from an XM L file, an open source item opencc is used for converting traditional entries into simplified entries, the converted entries are subjected to de-duplication, the sum of a reverse file frequency value of the entries and an entry length value of the entries is calculated, the entries with the sum smaller than a preset threshold value are filtered, namely, shorter entries are filtered, more long words are reserved, the removal of common words (the length of the common words is usually smaller) can be realized, and adverse effects caused by the introduction of the common words into a named entity recognition model are avoided.
In any of the foregoing technical solutions, preferably, the processor constructs the named entity identification corpus, which specifically includes: obtaining sample corpora; and marking the named entity words and the non-named entity words in the sample corpus by taking the characters as marking units according to the marking modes of the beginning, the middle or the end, adding the boundary information of the non-named entity words into the marking of the sample corpus, and further constructing the named entity recognition corpus.
In the technical scheme, compared with a BIO (beginning, middle or end) labeling mode in the related technology, the technical scheme of the invention also considers the boundary information of the non-named entity words when constructing the named entity recognition corpus, so that the boundary information of the non-named entity words is also added into the corpus labeling besides the named entity words, and the named entity recognition model trained by utilizing the named entity recognition corpus is more accurate.
In any of the above technical solutions, preferably, the processor constructs the named entity recognition model according to the named entity word dictionary and the named entity recognition corpus, and specifically includes: acquiring an embedded vector of characters in the named entity recognition corpus, and generating a dictionary feature vector of the characters according to the named entity dictionary; training a first bidirectional long-short time memory network model by using an embedded vector, training a second bidirectional long-short time memory network model by using a dictionary feature vector, and training a conditional random field model by using the output of the first bidirectional long-short time memory network model and the output of the second bidirectional long-short time memory network model; and taking the trained first bidirectional long-short time memory network model, the trained second bidirectional long-short time memory network model and the trained conditional random field model as named entity recognition models.
In the technical scheme, after a named entity dictionary is constructed, the information of the named entity dictionary needs to be converted into features to participate in the model training process, namely, dictionary feature vectors of characters are generated according to the information of the named entity dictionary, and meanwhile, embedded vectors of the characters in the named entity recognition corpus are obtained. And training a first bidirectional long-short term memory network model by using the embedded vector, training a second bidirectional long-short term memory network model by using the dictionary characteristic vector, and finally obtaining a named entity recognition model by using the output training conditional random field model of the two bidirectional long-short term memory network models.
According to a further aspect of the present invention, a computer-readable storage medium is proposed, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method for identifying a named entity according to any one of the previous claims.
The computer-readable storage medium provided by the present invention, when being executed by a processor, implements the steps of the method for identifying a named entity according to any one of the above technical solutions, so that the computer-readable storage medium includes all the beneficial effects of the method for identifying a named entity according to any one of the above technical solutions.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 illustrates a flow diagram of a method for named entity identification in accordance with an embodiment of the present invention;
FIG. 2 is a flow chart of a Bi L STM-CRF named entity recognition method based on a common sense dictionary according to a specific embodiment of the invention;
FIG. 3 is a flow diagram illustrating the construction of a named entity word dictionary in accordance with an exemplary embodiment of the present invention;
FIG. 4 is a schematic diagram of an embodiment of a long term memory network element architecture;
FIG. 5 is a diagram illustrating a process for generating dictionary feature vectors from feature templates in accordance with an embodiment of the present invention;
FIG. 6 is a diagram illustrating the construction of a named entity recognition model according to one embodiment of the invention;
FIG. 7 illustrates a network diagram of a named entity recognition model in accordance with a specific embodiment of the present invention;
FIG. 8 shows a schematic block diagram of a named entity recognition system of one embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.
An embodiment of the first aspect of the present invention provides a method for identifying a named entity, and fig. 1 illustrates a flow diagram of the method for identifying a named entity according to an embodiment of the present invention. Wherein, the method comprises the following steps:
102, constructing a named entity word dictionary and constructing a named entity recognition corpus;
104, constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus;
and 106, acquiring the linguistic data to be recognized, inputting the linguistic data to be recognized into the named entity recognition model, and obtaining the recognition result of the named entity of the linguistic data to be recognized.
The invention provides a named entity recognition method, which comprises the steps of firstly constructing a named entity word dictionary, secondly constructing a named entity recognition corpus, and thirdly constructing a named entity recognition model fusing the named entity word dictionary, namely constructing a named entity recognition system for automatically recognizing and extracting the names of people, places and mechanism in a text, wherein the named entity recognition model can be a named entity recognition model fusing two-way long-short time memory network conditional random field (Bi L STM-CRF) of the dictionary.
Optionally, in step 106, the corpus to be recognized is input into the named entity recognition model, and a recognition result of the named entity of the corpus to be recognized is obtained, which specifically includes: calculating the probability of the identification label of each character in the corpus to be identified by using the named entity identification model, and taking the identification label with the maximum probability as the target identification label of the character; and identifying the named entity of the corpus to be identified according to the target identification label of the character.
In the embodiment, the named entity recognition model is used for calculating the probability of the recognition tag of each character in the corpus to be recognized, the recognition tag with the maximum probability is used as the target recognition tag of the character, and further, the named entity of the corpus to be recognized is recognized according to the target recognition tag of the character, so that the named entity of the corpus to be recognized is accurately recognized.
Optionally, in step 102, constructing a naming entity word dictionary specifically includes: acquiring entries in a preset database; performing font conversion on the entries, and performing duplication removal on the entries subjected to font conversion; calculating the addition value of the reverse file frequency value of the entry and the entry length value of the entry, comparing the addition value with a preset threshold value, and filtering the entries of which the addition value is smaller than the preset threshold value; and constructing a naming entity word dictionary by using the filtered entries.
In the embodiment, an XM L compressed file is downloaded in a preset database (Wikipedia database), and then a python open source code Wikipedia Extractor is used for extracting entries from an XM L file, an open source item open cc is used for converting traditional entries into simplified entries, the converted entries are subjected to de-duplication, the sum of a reverse file frequency value of the entries and an entry length value of the entries is calculated, the entries with the sum smaller than a preset threshold value are filtered, namely, shorter entries are filtered, more long words are reserved, the elimination of common words (the length of the common words is smaller) can be realized, and adverse effects caused by the introduction of the common words into a named entity recognition model are avoided.
Optionally, in step 102, constructing a named entity identification corpus specifically includes: obtaining sample corpora; and marking the named entity words and the non-named entity words in the sample corpus by taking the characters as marking units according to the marking modes of the beginning, the middle or the end, adding the boundary information of the non-named entity words into the marking of the sample corpus, and further constructing the named entity recognition corpus.
In this embodiment, with respect to a BIO (beginning, middle, or end) labeling manner in the related art, the technical solution of the present invention also considers boundary information of non-named entity words when constructing the named entity recognition corpus, so that the boundary information of the non-named entity words is also added to the corpus labeling in addition to the named entity words, so that a named entity recognition model trained by using the named entity recognition corpus is more accurate.
Optionally, in step 104, a named entity recognition model is constructed according to the named entity word dictionary and the named entity recognition corpus, and specifically includes: acquiring an embedded vector of characters in the named entity recognition corpus, and generating a dictionary feature vector of the characters according to the named entity dictionary; training a first bidirectional long-short time memory network model by using an embedded vector, training a second bidirectional long-short time memory network model by using a dictionary feature vector, and training a conditional random field model by using the output of the first bidirectional long-short time memory network model and the output of the second bidirectional long-short time memory network model; and taking the trained first bidirectional long-short time memory network model, the trained second bidirectional long-short time memory network model and the trained conditional random field model as named entity recognition models.
In this embodiment, after the named entity dictionary is constructed, it is necessary to convert the information of the named entity dictionary into features to participate in the model training process, that is, dictionary feature vectors of characters are generated according to the information of the named entity dictionary, and embedded vectors of the characters in the named entity recognition corpus are obtained at the same time. And training a first bidirectional long-short term memory network model by using the embedded vector, training a second bidirectional long-short term memory network model by using the dictionary characteristic vector, and finally obtaining a named entity recognition model by using the output training conditional random field model of the two bidirectional long-short term memory network models.
In the specific embodiment of the invention, a Bi L STM-CRF named entity recognition method based on a common sense dictionary is provided, a named entity recognition system for automatically recognizing and extracting the names of people, places and organizations in a text is constructed, a priori common sense dictionary is fused to assist a model in recognizing unknown words, and the recognition effect of the model on the unknown words is improved, as shown in FIG. 2, the method mainly comprises the following steps:
step 202, constructing a named entity word dictionary according to a Wikipedia knowledge base;
step 204, constructing a named entity identification corpus;
and step 206, constructing a named entity recognition model of the bidirectional long-short time memory network conditional random field (Bi L STM-CRF) fused with the named entity word dictionary.
With respect to step 202, a named entity word dictionary is constructed from the wikipedia knowledge base. The wikipedia knowledge base comprises millions of Chinese entries, wherein the Chinese entries comprise a large number of named entity words, and new entity words are continuously added along with the development of time. Therefore, the Wikipedia entries are filtered to generate a named entity dictionary, and the recognition performance of the named entity recognition model on the unknown words is improved. The specific process of constructing the naming entity word dictionary according to the wikipedia knowledge base is shown in fig. 3:
step 302, downloading Wikipedia XM L data, and downloading an XM L compressed file of the Wikipedia data at an official website of the Wikipedia;
step 304, extracting an entry by the Wikipedia Extractor, and extracting the entry of the Wikipedia from the XM L file by utilizing a python open source code Wikipedia Extractor;
step 306, open cc converts the traditional Chinese vocabulary entries into simplified forms, and the open source project open cc is used for converting the traditional Chinese vocabulary entries into simplified forms;
step 308, duplicate removal, namely, the entry after the simplified body conversion is subjected to duplicate removal;
step 310, filtering the entries, integrating the IDF (Term Frequency Inverse document Frequency value) of the entries and the length of the words, and filtering the entries;
in step 312, a named entity word dictionary is generated.
Many words in the Wikipedia entries are entries of non-named entity words, the entries such as 'society' and 'traffic' are common words and frequently appear in training test linguistic data, and the entries are introduced into a model as named entity words and may have great negative effects on the model. Therefore, filtering of common terms in wikipedia is required. In calculating the IDF value, a document set is required, where the participle corpus of wikipedia entries is used as the document set.
Inverse File frequency value (idf)i):
Figure BDA0001940557840000101
Where D is the document set, | D | represents the total number of documents in the document set, tiIs an entry I, djAre documents in a document collection.
Therefore, the present invention uses a combination of the length of the entry and the IDF value as a basis for filtering Wikipedia entries, assuming that the length of the entry is L EN, L ENIDF (a combination of the length of the entry and the IDF value) is calculated as follows:
LENIDF=log(LEN)+IDF (2)
by setting a minimum threshold value for L ENIDF of the entry, the common words can be eliminated, and adverse effects caused by introducing the common words into the model are avoided.
For step 204, a named entity identification corpus is constructed. Compared with a general BIO mark form, the named entity recognition corpus constructed by the invention considers the boundary information of the non-entity words, and the specific flow is as follows:
the invention uses part-of-speech tagging corpora of 1 month and 6 days of 1998 of the people's daily newspaper, marks the characters as tagging units according to the form of BIO, and marks the characters as 7 types of identification names of people, place and organization, according to the tagging form of BIO, O, B-PER, I-PER, B-L OC, I-L OC, B-ORG and I-ORG, wherein PER is name identification, L OC is place name identification, ORG is organization name identification, O is non-naming entity word identification, additional identification B indicates that the current unit is the beginning part of the entity, I indicates that the current unit is the middle or ending part of the entity, for example, Wangchua is the originator of the great northern group, B-PER O O B-ORG I-ORG I-ORG I-ORG I-ORG O O O O
In order to consider the boundary information of the words of the non-entity words, the invention provides an improved BIO labeling system which comprises B-O, I-O, B-PER, I-PER, B-L OC, I-L OC, B-ORG and I-ORG, wherein the original label O of the non-entity words is changed into the form of B-O, I-O, B-O is the beginning of the non-entity words, and I-O represents the middle or the end of the non-entity words, and after improvement:
wangchong academician is the founder of the northern university group B-PER I-PER B-O I-O B-O B-ORG I-ORG I-ORG I-ORG I-ORG B-O B-O I-O I-O
It can be seen that, in addition to the entity words taking into account the boundary information, the non-entity boundary information is also added to the corpus annotation.
It is observed that the Arabic numerals and English letters in the corpus of the people's daily newspaper are all in full-angle form, and most Arabic numerals and English letters in the current text adopt half-angle form, so that the Arabic numerals and English letters are unified with the current text. Meanwhile, the date information at the beginning of each line of the daily newspaper corpus of people is removed, so that the interference of frequently occurring numbers on the identification effect is avoided.
For step 206, constructing a named entity recognition model of the two-way long-short time memory network conditional random field fused with the named entity word dictionary specifically comprises:
(1) bi L STM (bidirectional long and short time memory network) layer
The long and short term memory network (L STM) is a variant of the Recurrent Neural Network (RNN), theoretically the RNN can process sequence data with any length, but in fact if the sequence is too long, the problems of gradient disappearance or gradient explosion can occur during optimization, L STM solves the long-term problem of the RNN through a 'gate' mechanism, the L STM unit structure is shown in FIG. 4, the input gate i, the forgetting gate f and the cell state of L STM
Figure BDA0001940557840000111
c, the definition of the output gate o is as follows:
ii=σ(Wihi-1+Uiexi+bi) (3)
fi=σ(Wfhi-1+Ufexi+bf) (4)
Figure BDA0001940557840000121
Figure BDA0001940557840000122
oi=σ(Wohi-1+Uoexi+bo) (7)
hi=oi⊙tanh(ci) (8)
wherein sigma is sigmoid function, ⊙ represents point multiplication, W, U is weight matrix, b is bias, W, U, b is used as network parameter to participate in training, hiIs the state of the hidden layer.
As can be seen from L STM unit structure diagram, the state h of the hidden layeriFor example, predicting a missing word in a sentence requires not only a decision from the foregoing but also consideration of the following.
Figure BDA0001940557840000123
Wherein the content of the first and second substances,
Figure BDA0001940557840000124
and
Figure BDA0001940557840000125
respectively showing the states of the hidden layers of the forward L STM and the backward L STM at the i-th time,
Figure BDA0001940557840000126
indicating a connection
Figure BDA0001940557840000127
And
Figure BDA0001940557840000128
(2) CRF (conditional random field) layer
For the named entity recognition task based on characters, labels among characters in a sentence have strong dependency, the label of one character is likely to determine the label information of the next character, so the dependency between adjacent position labels should be considered, for example, B-L OC cannot follow I-PER, the first character in a sequence is 'B-' but not 'I-', and therefore, in order to make the labeling result more accurate, a CRF layer is accessed on the basis of a Bi L STM layer.
For a given sentence x ═ x1,x2,...,xnThe sentence marked with sequence y ═ y }1,y2,...,ynFraction of }:
Figure BDA0001940557840000129
the score s (x, y) is composed of a mark score P and a transition score A between marks, where A is a transition score matrix and A isijRepresents the transfer score (y) between label i and label j0、yn+1Respectively representing the beginning and ending characters of a sentence), P is a fraction matrix (n × k dimension, k is the number of label types) output by the Bi L STM layer, P isi,jDenotes x ═ x1,x2,...,xnThe ith word in the page is labeled as the score of the jth label. Then x is changed to { x ═ x1,x2,...,xnMark y ═ y1,y2,...,ynThe probability of is:
Figure BDA0001940557840000131
wherein, YxDenotes a sentence x ═ { x1,x2,...,xnAll the corresponding labeling schemes.
In training, training a named entity recognition model by maximizing a log probability function log (p (y | x)); during prediction, a label sequence with the highest score is obtained through a viterbi algorithm (a Viterbi algorithm):
Figure BDA0001940557840000132
(3) named entity recognition model fused with named entity word dictionary
After constructing the naming entity word dictionaryThe dictionary information needs to be converted into features to participate in the training process of the model. Given sentence x ═ x1,x2,...,xnFor sentence x ═ x1,x2,...,xnCharacter x in (1) }iGenerating dictionary feature vector t according to the feature template in the table 1 by combining the naming entity word dictionary and the contexti
TABLE 1
Type Template
2-gram xi-1xi,xixi+1
3-gram xi-2xi-1xi,xixi+1xi+2
4-gram xi-3xi-2xi-1xi,xixi+1xi+2xi+3
5-gram xi-4xi-3xi-2xi-1xi,xixi+1xi+2xi+3xi+4
For each text fragment in Table 1, a binary value is used to indicate whether the fragment is in the named entity dictionary, if anyA word dictionary can be found, denoted 1, otherwise denoted 0. Dictionary feature vector tiCapable of representing a character xiAnd whether the text fragment consisting of its surrounding characters is a named entity word. As shown in FIG. 5, the generation of dictionary feature vectors t from feature templates is illustratediThe process of (2):
if there is a "committee" or "committee" in the dictionary corresponding to the named entity word, the dictionary feature vector t can be expressed as:
0 1 0 1 0 0 0 0
for sentence x ═ x1,x2,...,xnEach character x iniWhile obtaining the embedded vector e of the characterxiAnd dictionary feature vector t generated from the named entity dictionaryiThe original Bi L STM-CRF named entity recognition model is only the embedded vector e of each characterxiAs an input to the network. Since the feature vector based named entity word dictionary can provide valuable information for the named entity recognition model from different aspects, the present invention, as shown in FIG. 6According to the invention, an independent Bi L STM network is added into an original Bi L STM-CRF named entity recognition model, then the outputs of two Bi L STM networks are integrated together, and the final output of the networks is obtained through CRF.
The specific structure of the network is shown in fig. 7, for the sentence x ═ x1,x2,...,xnAnd the states of the hidden layers of the two layers of Bi L STM networks at the time i are respectively as follows:
Figure BDA0001940557840000141
Figure BDA0001940557840000142
wherein e isxiIs the character xiEmbedded vector of, tiIs a dictionary feature vector generated from the named entity dictionary. The states of the two network hiding layers are connected together as inputs to the CRF layer:
Figure BDA0001940557840000143
the other parts of the named entity recognition model are consistent with the original Bi L STM-CRF model.
The technical scheme of the invention adopts a labeling system of B-O, I-O, B-PER, I-PER, B-L OC, I-L OC, B-ORG and I-ORG, not only considers the boundary information of entity words, but also considers the boundary information of non-entity words, improves the learning capacity of the named entity recognition model, fuses a common sense dictionary generated by a Wikipedia knowledge base on the basis of the Bi L STM-CRF named entity recognition method, and greatly improves the recognition performance of the model on unknown words.
In embodiments of the second aspect of the present invention, a system for identifying a named entity is proposed, and fig. 8 shows a schematic block diagram of a system 80 for identifying a named entity according to an embodiment of the present invention. Wherein, this system includes:
a memory 802 for storing a computer program;
a processor 804 for executing a computer program to:
constructing a named entity word dictionary and constructing a named entity recognition corpus; constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus; and acquiring the linguistic data to be recognized, and inputting the linguistic data to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the linguistic data to be recognized.
The named entity recognition system 80 provided by the invention firstly constructs a named entity word dictionary, secondly constructs a named entity recognition corpus, and thirdly constructs a named entity recognition model fusing the named entity word dictionary, namely constructs a named entity recognition system for automatically recognizing and extracting the name, the place name and the mechanism name in the text, wherein the named entity recognition model can be a named entity recognition model fusing the two-way long and short time memory network conditional random field (Bi L STM-CRF) of the dictionary.
Optionally, the processor 804 inputs the corpus to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the corpus to be recognized, which specifically includes: calculating the probability of the identification label of each character in the corpus to be identified by using the named entity identification model, and taking the identification label with the maximum probability as the target identification label of the character; and identifying the named entity of the corpus to be identified according to the target identification label of the character.
In the embodiment, a named entity recognition model is used for obtaining a plurality of recognition tags of each character in the corpus to be recognized and the probability of each recognition tag, the recognition tag with the maximum probability is used as a target recognition tag of the character, and further, the named entity of the corpus to be recognized is recognized according to the target recognition tag of the character, so that the named entity of the corpus to be recognized is accurately recognized.
Optionally, the processor 804 constructs a naming entity word dictionary, which specifically includes: acquiring entries in a preset database; performing font conversion on the entries, and performing duplication removal on the entries subjected to font conversion; calculating the addition value of the reverse file frequency value of the entry and the entry length value of the entry, comparing the addition value with a preset threshold value, and filtering the entries of which the addition value is smaller than the preset threshold value; and constructing a naming entity word dictionary by using the filtered entries.
In the embodiment, an XM L compressed file is downloaded in a preset database (Wikipedia database), and then a python open source code Wikipedia Extractor is used for extracting entries from an XM L file, an open source item open cc is used for converting traditional entries into simplified entries, the converted entries are subjected to de-duplication, the sum of a reverse file frequency value of the entries and an entry length value of the entries is calculated, the entries with the sum smaller than a preset threshold value are filtered, namely, shorter entries are filtered, more long words are reserved, the elimination of common words (the length of the common words is smaller) can be realized, and adverse effects caused by the introduction of the common words into a named entity recognition model are avoided.
Optionally, the processor 804 constructs the named entity identification corpus, which specifically includes: obtaining sample corpora; and marking the named entity words and the non-named entity words in the sample corpus by taking the characters as marking units according to the marking modes of the beginning, the middle or the end, adding the boundary information of the non-named entity words into the marking of the sample corpus, and further constructing the named entity recognition corpus.
In this embodiment, with respect to a BIO (beginning, middle, or end) labeling manner in the related art, the technical solution of the present invention also considers boundary information of non-named entity words when constructing the named entity recognition corpus, so that the boundary information of the non-named entity words is also added to the corpus labeling in addition to the named entity words, so that a named entity recognition model trained by using the named entity recognition corpus is more accurate.
Optionally, the processor 804 constructs the named entity recognition model according to the named entity word dictionary and the named entity recognition corpus, and specifically includes: acquiring an embedded vector of characters in the named entity recognition corpus, and generating a dictionary feature vector of the characters according to the named entity dictionary; training a first bidirectional long-short time memory network model by using an embedded vector, training a second bidirectional long-short time memory network model by using a dictionary feature vector, and training a conditional random field model by using the output of the first bidirectional long-short time memory network model and the output of the second bidirectional long-short time memory network model; and taking the trained first bidirectional long-short time memory network model, the trained second bidirectional long-short time memory network model and the trained conditional random field model as named entity recognition models.
In this embodiment, after the named entity dictionary is constructed, it is necessary to convert the information of the named entity dictionary into features to participate in the model training process, that is, dictionary feature vectors of characters are generated according to the information of the named entity dictionary, and embedded vectors of the characters in the named entity recognition corpus are obtained at the same time. And training a first bidirectional long-short term memory network model by using the embedded vector, training a second bidirectional long-short term memory network model by using the dictionary characteristic vector, and finally obtaining a named entity recognition model by using the output training conditional random field model of the two bidirectional long-short term memory network models.
An embodiment of the third aspect of the present invention proposes a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for identifying a named entity according to any of the embodiments described above.
The present invention provides a computer-readable storage medium, which, when being executed by a processor, implements the steps of the method for identifying a named entity according to any one of the above embodiments, and therefore, includes all the advantages of the method for identifying a named entity according to any one of the above embodiments.
In the description herein, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance unless explicitly stated or limited otherwise; the terms "connected," "mounted," "secured," and the like are to be construed broadly and include, for example, fixed connections, removable connections, or integral connections; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the description herein, the description of the terms "one embodiment," "some embodiments," "specific embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (11)

1. A method for identifying a named entity, comprising:
constructing a named entity word dictionary and constructing a named entity recognition corpus;
constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus;
and acquiring a corpus to be recognized, and inputting the corpus to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the corpus to be recognized.
2. The method according to claim 1, wherein the step of inputting the corpus to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the corpus to be recognized specifically comprises:
calculating the probability of the identification label of each character in the corpus to be identified by using the named entity identification model, and taking the identification label with the maximum probability as the target identification label of the character;
and identifying the named entity of the corpus to be identified according to the target identification label of the character.
3. The method for identifying named entities according to claim 1 or 2, wherein the constructing a dictionary of named entity words specifically comprises:
acquiring entries in a preset database;
performing font conversion on the entries, and performing duplication removal on the entries after the font conversion;
calculating the addition value of the reverse file frequency value of the entry and the entry length value of the entry, comparing the addition value with a preset threshold value, and filtering the entry of which the addition value is smaller than the preset threshold value;
and constructing the naming entity word dictionary by using the filtered entry.
4. The method according to claim 1 or 2, wherein the constructing of the named entity recognition corpus specifically includes:
obtaining sample corpora;
and marking named entity words and non-named entity words in the sample corpus by taking characters as marking units according to the marking modes of the beginning, the middle or the end, adding the boundary information of the non-named entity words into the marking of the sample corpus, and further constructing the named entity recognition corpus.
5. The method according to claim 1 or 2, wherein the constructing the named entity recognition model according to the named entity word dictionary and the named entity recognition corpus specifically comprises:
acquiring an embedded vector of characters in the named entity recognition corpus, and generating dictionary feature vectors of the characters according to the named entity dictionary;
training a first bidirectional long-short time memory network model by using the embedded vector, training a second bidirectional long-short time memory network model by using the dictionary feature vector, and training a conditional random field model by using the output of the first bidirectional long-short time memory network model and the output of the second bidirectional long-short time memory network model;
and taking the trained first bidirectional long-short time memory network model, the trained second bidirectional long-short time memory network model and the trained conditional random field model as the named entity recognition model.
6. A system for identifying named entities, comprising:
a memory for storing a computer program;
a processor for executing the computer program to:
constructing a named entity word dictionary and constructing a named entity recognition corpus;
constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus;
and acquiring a corpus to be recognized, and inputting the corpus to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the corpus to be recognized.
7. The system for identifying a named entity according to claim 6, wherein the processor inputs the corpus to be identified to the named entity identification model to obtain an identification result of the named entity of the corpus to be identified, and specifically includes:
calculating the probability of the identification label of each character in the corpus to be identified by using the named entity identification model, and taking the identification label with the maximum probability as the target identification label of the character;
and identifying the named entity of the corpus to be identified according to the target identification label of the character.
8. The system for identifying named entities of claim 6 or 7, wherein the processor constructs a dictionary of named entity words, comprising in particular:
acquiring entries in a preset database;
performing font conversion on the entries, and performing duplication removal on the entries after the font conversion;
calculating the addition value of the reverse file frequency value of the entry and the entry length value of the entry, comparing the addition value with a preset threshold value, and filtering the entry of which the addition value is smaller than the preset threshold value;
and constructing the naming entity word dictionary by using the filtered entry.
9. The system for identifying a named entity according to claim 6 or 7, wherein the processor constructs the named entity identification corpus specifically including:
obtaining sample corpora;
and marking named entity words and non-named entity words in the sample corpus by taking characters as marking units according to the marking modes of the beginning, the middle or the end, adding the boundary information of the non-named entity words into the marking of the sample corpus, and further constructing the named entity recognition corpus.
10. The system according to claim 6 or 7, wherein the processor constructs the named entity recognition model based on the named entity word dictionary and the named entity recognition corpus, and specifically comprises:
acquiring an embedded vector of characters in the named entity recognition corpus, and generating dictionary feature vectors of the characters according to the named entity dictionary;
training a first bidirectional long-short time memory network model by using the embedded vector, training a second bidirectional long-short time memory network model by using the dictionary feature vector, and training a conditional random field model by using the output of the first bidirectional long-short time memory network model and the output of the second bidirectional long-short time memory network model;
and taking the trained first bidirectional long-short time memory network model, the trained second bidirectional long-short time memory network model and the trained conditional random field model as the named entity recognition model.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for identifying a named entity according to one of claims 1 to 5.
CN201910020553.6A 2019-01-09 2019-01-09 Named entity recognition method, recognition system and computer readable storage medium Pending CN111428501A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910020553.6A CN111428501A (en) 2019-01-09 2019-01-09 Named entity recognition method, recognition system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910020553.6A CN111428501A (en) 2019-01-09 2019-01-09 Named entity recognition method, recognition system and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111428501A true CN111428501A (en) 2020-07-17

Family

ID=71545711

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910020553.6A Pending CN111428501A (en) 2019-01-09 2019-01-09 Named entity recognition method, recognition system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111428501A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881692A (en) * 2020-07-28 2020-11-03 平安科技(深圳)有限公司 Mechanism entity extraction method, system and device based on multiple training targets
CN113255356A (en) * 2021-06-10 2021-08-13 杭州费尔斯通科技有限公司 Entity recognition method and device based on entity word list
CN113553853A (en) * 2021-09-16 2021-10-26 南方电网数字电网研究院有限公司 Named entity recognition method and device, computer equipment and storage medium
CN114548109A (en) * 2022-04-24 2022-05-27 阿里巴巴达摩院(杭州)科技有限公司 Named entity recognition model training method and named entity recognition method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
CN108460013A (en) * 2018-01-30 2018-08-28 大连理工大学 A kind of sequence labelling model based on fine granularity vocabulary representation model
CN108920445A (en) * 2018-04-23 2018-11-30 华中科技大学鄂州工业技术研究院 A kind of name entity recognition method and device based on Bi-LSTM-CRF model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
CN108460013A (en) * 2018-01-30 2018-08-28 大连理工大学 A kind of sequence labelling model based on fine granularity vocabulary representation model
CN108920445A (en) * 2018-04-23 2018-11-30 华中科技大学鄂州工业技术研究院 A kind of name entity recognition method and device based on Bi-LSTM-CRF model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QI ZHANG: "Networks Incorporating Dictionaries for Chinese Word Segmentation", 《THE THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881692A (en) * 2020-07-28 2020-11-03 平安科技(深圳)有限公司 Mechanism entity extraction method, system and device based on multiple training targets
WO2021139239A1 (en) * 2020-07-28 2021-07-15 平安科技(深圳)有限公司 Mechanism entity extraction method, system and device based on multiple training targets
CN113255356A (en) * 2021-06-10 2021-08-13 杭州费尔斯通科技有限公司 Entity recognition method and device based on entity word list
CN113255356B (en) * 2021-06-10 2021-09-28 杭州费尔斯通科技有限公司 Entity recognition method and device based on entity word list
CN113553853A (en) * 2021-09-16 2021-10-26 南方电网数字电网研究院有限公司 Named entity recognition method and device, computer equipment and storage medium
CN114548109A (en) * 2022-04-24 2022-05-27 阿里巴巴达摩院(杭州)科技有限公司 Named entity recognition model training method and named entity recognition method

Similar Documents

Publication Publication Date Title
Du et al. Explicit interaction model towards text classification
CN110263325B (en) Chinese word segmentation system
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN108446271B (en) Text emotion analysis method of convolutional neural network based on Chinese character component characteristics
Pham et al. End-to-end recurrent neural network models for vietnamese named entity recognition: Word-level vs. character-level
CN108255813B (en) Text matching method based on word frequency-inverse document and CRF
CN109960728B (en) Method and system for identifying named entities of open domain conference information
CN111428501A (en) Named entity recognition method, recognition system and computer readable storage medium
CN108287911B (en) Relation extraction method based on constrained remote supervision
CN110413768B (en) Automatic generation method of article titles
CN111274829B (en) Sequence labeling method utilizing cross-language information
CN112800184B (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
CN114969275A (en) Conversation method and system based on bank knowledge graph
CN112464669A (en) Stock entity word disambiguation method, computer device and storage medium
CN112784602A (en) News emotion entity extraction method based on remote supervision
CN112417823A (en) Chinese text word order adjusting and quantitative word completion method and system
CN115269834A (en) High-precision text classification method and device based on BERT
Göker et al. Neural text normalization for turkish social media
CN112818698A (en) Fine-grained user comment sentiment analysis method based on dual-channel model
Passban Machine translation of morphologically rich languages using deep neural networks
CN114970537B (en) Cross-border ethnic cultural entity relation extraction method and device based on multi-layer labeling strategy
Sen et al. Bangla natural language processing: A comprehensive review of classical machine learning and deep learning based methods
Cristea et al. From scan to text. Methodology, solutions and perspectives of deciphering old cyrillic Romanian documents into the Latin script
CN114416991A (en) Method and system for analyzing text emotion reason based on prompt

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200717

RJ01 Rejection of invention patent application after publication