CN111428501A

CN111428501A - Named entity recognition method, recognition system and computer readable storage medium

Info

Publication number: CN111428501A
Application number: CN201910020553.6A
Authority: CN
Inventors: 贾丹丹; 张丹
Original assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Priority date: 2019-01-09
Filing date: 2019-01-09
Publication date: 2020-07-17

Abstract

The invention provides a named entity identification method, a named entity identification system and a computer readable storage medium. The method for identifying the named entity comprises the following steps: constructing a named entity word dictionary and constructing a named entity recognition corpus; constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus; and acquiring the linguistic data to be recognized, and inputting the linguistic data to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the linguistic data to be recognized. By adopting the technical scheme of the invention, the named entity recognition model is obtained by constructing the named entity dictionary training, so that the recognition performance of the named entity recognition model on unknown words is improved, the recognition accuracy of the named entity recognition model is improved, and the named entity of the corpus to be recognized is accurately recognized.

Description

Named entity recognition method, recognition system and computer readable storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a named entity identification method, a named entity identification system and a computer readable storage medium.

Background

With the advent of the information age, the internet has become an indispensable tool in people's life. The popularization of the internet brings profound influences to life styles of people, such as online shopping, social contact, entertainment, navigation and the like, and the internet brings convenience to people and simultaneously generates massive data which are closely related to the life of people. The text data in the mass data occupies a large part, so people urgently need to automatically extract valuable information in the mass unstructured text, and the named entity identification technology can extract key entity information from the text, has high application value and is an important research direction.

Although natural language processing technology has been developed for decades, the research of basic processing module, especially the recognition of Chinese named entity, has not made obvious breakthrough and leap. In the related art, the defects of slow speed, more errors, low efficiency, high cost and the like are exposed when the processing mode of manpower and domain experts faces to the massive natural language data with multiple domains and complications, and the processing mode becomes a bottleneck for limiting the development of natural language processing.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art or the related art.

To this end, one aspect of the present invention is directed to a method for identifying a named entity.

Another aspect of the invention is directed to a system for identifying named entities.

Yet another aspect of the present invention is directed to a computer-readable storage medium.

In view of the above, according to an aspect of the present invention, a method for identifying a named entity is provided, including: constructing a named entity word dictionary and constructing a named entity recognition corpus; constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus; and acquiring the linguistic data to be recognized, and inputting the linguistic data to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the linguistic data to be recognized.

The invention provides a named entity recognition method, which comprises the steps of firstly constructing a named entity word dictionary, secondly constructing a named entity recognition corpus, and thirdly constructing a named entity recognition model fusing the named entity word dictionary, namely constructing a named entity recognition system for automatically recognizing and extracting the names of people, places and mechanism in a text, wherein the named entity recognition model can be a named entity recognition model fusing two-way long-short time memory network conditional random field (Bi L STM-CRF) of the dictionary.

The method for identifying the named entity according to the present invention may further have the following technical features:

in the foregoing technical solution, preferably, the corpus to be recognized is input to the named entity recognition model, and a recognition result of the named entity of the corpus to be recognized is obtained, which specifically includes: calculating the probability of the identification label of each character in the corpus to be identified by using the named entity identification model, and taking the identification label with the maximum probability as the target identification label of the character; and identifying the named entity of the corpus to be identified according to the target identification label of the character.

In the technical scheme, the probability of the identification tag of each character in the corpus to be identified is calculated by using the named entity identification model, the identification tag with the maximum probability is used as the target identification tag of the character, and further, the named entity of the corpus to be identified is identified according to the target identification tag of the character, so that the named entity of the corpus to be identified is accurately identified.

In any of the above technical solutions, preferably, constructing a naming entity word dictionary specifically includes: acquiring entries in a preset database; performing font conversion on the entries, and performing duplication removal on the entries subjected to font conversion; calculating the addition value of the reverse file frequency value of the entry and the entry length value of the entry, comparing the addition value with a preset threshold value, and filtering the entries of which the addition value is smaller than the preset threshold value; and constructing a naming entity word dictionary by using the filtered entries.

In the technical scheme, an XM L (Extensible Markup L anguage) compressed file is downloaded in a preset database (Wikipedia database), then an entry is extracted from an XM L file by utilizing a python open source code Wikipedia extra, a traditional entry is converted into a simplified entry by utilizing an open source item open cc, duplication of the converted entry is removed, the sum of a reverse file frequency value of the entry and an entry length value of the entry is calculated, the entry of which the sum value is smaller than a preset threshold value is filtered, namely, a shorter entry is filtered, more long words are reserved, the elimination of the common words (the length of the common words is smaller) can be realized, and adverse effects caused by the introduction of the common words into a named entity recognition model are avoided.

In any of the above technical solutions, preferably, the constructing of the named entity identification corpus specifically includes: obtaining sample corpora; and marking the named entity words and the non-named entity words in the sample corpus by taking the characters as marking units according to the marking modes of the beginning, the middle or the end, adding the boundary information of the non-named entity words into the marking of the sample corpus, and further constructing the named entity recognition corpus.

In the technical scheme, compared with a BIO (beginning, middle or end) labeling mode in the related technology, the technical scheme of the invention also considers the boundary information of the non-named entity words when constructing the named entity recognition corpus, so that the boundary information of the non-named entity words is also added into the corpus labeling besides the named entity words, and the named entity recognition model trained by utilizing the named entity recognition corpus is more accurate.

In any of the above technical solutions, preferably, the constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus specifically includes: acquiring an embedded vector of characters in the named entity recognition corpus, and generating a dictionary feature vector of the characters according to the named entity dictionary; training a first bidirectional long-short time memory network model by using an embedded vector, training a second bidirectional long-short time memory network model by using a dictionary feature vector, and training a conditional random field model by using the output of the first bidirectional long-short time memory network model and the output of the second bidirectional long-short time memory network model; and taking the trained first bidirectional long-short time memory network model, the trained second bidirectional long-short time memory network model and the trained conditional random field model as named entity recognition models.

In the technical scheme, after a named entity dictionary is constructed, the information of the named entity dictionary needs to be converted into features to participate in the model training process, namely, dictionary feature vectors of characters are generated according to the information of the named entity dictionary, and meanwhile, embedded vectors of the characters in the named entity recognition corpus are obtained. And training a first bidirectional long-short term memory network model by using the embedded vector, training a second bidirectional long-short term memory network model by using the dictionary characteristic vector, and finally obtaining a named entity recognition model by using the output training conditional random field model of the two bidirectional long-short term memory network models.

According to another aspect of the present invention, there is provided a named entity recognition system, comprising: a memory for storing a computer program; a processor for executing a computer program to: constructing a named entity word dictionary and constructing a named entity recognition corpus; constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus; and acquiring the linguistic data to be recognized, and inputting the linguistic data to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the linguistic data to be recognized.

The invention provides a named entity recognition system, which comprises a named entity dictionary, a named entity recognition corpus and a named entity recognition model fused with the named entity dictionary, wherein the named entity recognition model can be a named entity recognition model of a two-way long-short time memory network conditional random field (Bi L STM-CRF) fused dictionary, and can automatically recognize and extract the names, the place names and the mechanism names in a text.

The system for identifying named entities according to the present invention may further have the following technical features:

in the foregoing technical solution, preferably, the processor inputs the corpus to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the corpus to be recognized, and the method specifically includes: calculating the probability of the identification label of each character in the corpus to be identified by using the named entity identification model, and taking the identification label with the maximum probability as the target identification label of the character; and identifying the named entity of the corpus to be identified according to the target identification label of the character.

In the technical scheme, a named entity recognition model is used for obtaining a plurality of recognition tags of each character in the corpus to be recognized and the probability of each recognition tag, the recognition tag with the highest probability is used as a target recognition tag of the character, further, the named entity of the corpus to be recognized is recognized according to the target recognition tag of the character, and accurate recognition of the named entity of the corpus to be recognized is achieved.

In any of the above technical solutions, preferably, the processor constructs a naming entity word dictionary, which specifically includes: acquiring entries in a preset database; performing font conversion on the entries, and performing duplication removal on the entries subjected to font conversion; calculating the addition value of the reverse file frequency value of the entry and the entry length value of the entry, comparing the addition value with a preset threshold value, and filtering the entries of which the addition value is smaller than the preset threshold value; and constructing a naming entity word dictionary by using the filtered entries.

In the technical scheme, an XM L compressed file is downloaded in a preset database (Wikipedia database), then a python open source code Wikipedia Extractor is used for extracting entries from an XM L file, an open source item opencc is used for converting traditional entries into simplified entries, the converted entries are subjected to de-duplication, the sum of a reverse file frequency value of the entries and an entry length value of the entries is calculated, the entries with the sum smaller than a preset threshold value are filtered, namely, shorter entries are filtered, more long words are reserved, the removal of common words (the length of the common words is usually smaller) can be realized, and adverse effects caused by the introduction of the common words into a named entity recognition model are avoided.

In any of the foregoing technical solutions, preferably, the processor constructs the named entity identification corpus, which specifically includes: obtaining sample corpora; and marking the named entity words and the non-named entity words in the sample corpus by taking the characters as marking units according to the marking modes of the beginning, the middle or the end, adding the boundary information of the non-named entity words into the marking of the sample corpus, and further constructing the named entity recognition corpus.

In any of the above technical solutions, preferably, the processor constructs the named entity recognition model according to the named entity word dictionary and the named entity recognition corpus, and specifically includes: acquiring an embedded vector of characters in the named entity recognition corpus, and generating a dictionary feature vector of the characters according to the named entity dictionary; training a first bidirectional long-short time memory network model by using an embedded vector, training a second bidirectional long-short time memory network model by using a dictionary feature vector, and training a conditional random field model by using the output of the first bidirectional long-short time memory network model and the output of the second bidirectional long-short time memory network model; and taking the trained first bidirectional long-short time memory network model, the trained second bidirectional long-short time memory network model and the trained conditional random field model as named entity recognition models.

According to a further aspect of the present invention, a computer-readable storage medium is proposed, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method for identifying a named entity according to any one of the previous claims.

The computer-readable storage medium provided by the present invention, when being executed by a processor, implements the steps of the method for identifying a named entity according to any one of the above technical solutions, so that the computer-readable storage medium includes all the beneficial effects of the method for identifying a named entity according to any one of the above technical solutions.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 illustrates a flow diagram of a method for named entity identification in accordance with an embodiment of the present invention;

FIG. 2 is a flow chart of a Bi L STM-CRF named entity recognition method based on a common sense dictionary according to a specific embodiment of the invention;

FIG. 3 is a flow diagram illustrating the construction of a named entity word dictionary in accordance with an exemplary embodiment of the present invention;

FIG. 4 is a schematic diagram of an embodiment of a long term memory network element architecture;

FIG. 5 is a diagram illustrating a process for generating dictionary feature vectors from feature templates in accordance with an embodiment of the present invention;

FIG. 6 is a diagram illustrating the construction of a named entity recognition model according to one embodiment of the invention;

FIG. 7 illustrates a network diagram of a named entity recognition model in accordance with a specific embodiment of the present invention;

FIG. 8 shows a schematic block diagram of a named entity recognition system of one embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

An embodiment of the first aspect of the present invention provides a method for identifying a named entity, and fig. 1 illustrates a flow diagram of the method for identifying a named entity according to an embodiment of the present invention. Wherein, the method comprises the following steps:

102, constructing a named entity word dictionary and constructing a named entity recognition corpus;

104, constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus;

and 106, acquiring the linguistic data to be recognized, inputting the linguistic data to be recognized into the named entity recognition model, and obtaining the recognition result of the named entity of the linguistic data to be recognized.

Optionally, in step 106, the corpus to be recognized is input into the named entity recognition model, and a recognition result of the named entity of the corpus to be recognized is obtained, which specifically includes: calculating the probability of the identification label of each character in the corpus to be identified by using the named entity identification model, and taking the identification label with the maximum probability as the target identification label of the character; and identifying the named entity of the corpus to be identified according to the target identification label of the character.

In the embodiment, the named entity recognition model is used for calculating the probability of the recognition tag of each character in the corpus to be recognized, the recognition tag with the maximum probability is used as the target recognition tag of the character, and further, the named entity of the corpus to be recognized is recognized according to the target recognition tag of the character, so that the named entity of the corpus to be recognized is accurately recognized.

Optionally, in step 102, constructing a naming entity word dictionary specifically includes: acquiring entries in a preset database; performing font conversion on the entries, and performing duplication removal on the entries subjected to font conversion; calculating the addition value of the reverse file frequency value of the entry and the entry length value of the entry, comparing the addition value with a preset threshold value, and filtering the entries of which the addition value is smaller than the preset threshold value; and constructing a naming entity word dictionary by using the filtered entries.

In the embodiment, an XM L compressed file is downloaded in a preset database (Wikipedia database), and then a python open source code Wikipedia Extractor is used for extracting entries from an XM L file, an open source item open cc is used for converting traditional entries into simplified entries, the converted entries are subjected to de-duplication, the sum of a reverse file frequency value of the entries and an entry length value of the entries is calculated, the entries with the sum smaller than a preset threshold value are filtered, namely, shorter entries are filtered, more long words are reserved, the elimination of common words (the length of the common words is smaller) can be realized, and adverse effects caused by the introduction of the common words into a named entity recognition model are avoided.

Optionally, in step 102, constructing a named entity identification corpus specifically includes: obtaining sample corpora; and marking the named entity words and the non-named entity words in the sample corpus by taking the characters as marking units according to the marking modes of the beginning, the middle or the end, adding the boundary information of the non-named entity words into the marking of the sample corpus, and further constructing the named entity recognition corpus.

In this embodiment, with respect to a BIO (beginning, middle, or end) labeling manner in the related art, the technical solution of the present invention also considers boundary information of non-named entity words when constructing the named entity recognition corpus, so that the boundary information of the non-named entity words is also added to the corpus labeling in addition to the named entity words, so that a named entity recognition model trained by using the named entity recognition corpus is more accurate.

Optionally, in step 104, a named entity recognition model is constructed according to the named entity word dictionary and the named entity recognition corpus, and specifically includes: acquiring an embedded vector of characters in the named entity recognition corpus, and generating a dictionary feature vector of the characters according to the named entity dictionary; training a first bidirectional long-short time memory network model by using an embedded vector, training a second bidirectional long-short time memory network model by using a dictionary feature vector, and training a conditional random field model by using the output of the first bidirectional long-short time memory network model and the output of the second bidirectional long-short time memory network model; and taking the trained first bidirectional long-short time memory network model, the trained second bidirectional long-short time memory network model and the trained conditional random field model as named entity recognition models.

In this embodiment, after the named entity dictionary is constructed, it is necessary to convert the information of the named entity dictionary into features to participate in the model training process, that is, dictionary feature vectors of characters are generated according to the information of the named entity dictionary, and embedded vectors of the characters in the named entity recognition corpus are obtained at the same time. And training a first bidirectional long-short term memory network model by using the embedded vector, training a second bidirectional long-short term memory network model by using the dictionary characteristic vector, and finally obtaining a named entity recognition model by using the output training conditional random field model of the two bidirectional long-short term memory network models.

In the specific embodiment of the invention, a Bi L STM-CRF named entity recognition method based on a common sense dictionary is provided, a named entity recognition system for automatically recognizing and extracting the names of people, places and organizations in a text is constructed, a priori common sense dictionary is fused to assist a model in recognizing unknown words, and the recognition effect of the model on the unknown words is improved, as shown in FIG. 2, the method mainly comprises the following steps:

step 202, constructing a named entity word dictionary according to a Wikipedia knowledge base;

step 204, constructing a named entity identification corpus;

and step 206, constructing a named entity recognition model of the bidirectional long-short time memory network conditional random field (Bi L STM-CRF) fused with the named entity word dictionary.

With respect to step 202, a named entity word dictionary is constructed from the wikipedia knowledge base. The wikipedia knowledge base comprises millions of Chinese entries, wherein the Chinese entries comprise a large number of named entity words, and new entity words are continuously added along with the development of time. Therefore, the Wikipedia entries are filtered to generate a named entity dictionary, and the recognition performance of the named entity recognition model on the unknown words is improved. The specific process of constructing the naming entity word dictionary according to the wikipedia knowledge base is shown in fig. 3:

step 302, downloading Wikipedia XM L data, and downloading an XM L compressed file of the Wikipedia data at an official website of the Wikipedia;

step 304, extracting an entry by the Wikipedia Extractor, and extracting the entry of the Wikipedia from the XM L file by utilizing a python open source code Wikipedia Extractor;

step 306, open cc converts the traditional Chinese vocabulary entries into simplified forms, and the open source project open cc is used for converting the traditional Chinese vocabulary entries into simplified forms;

step 308, duplicate removal, namely, the entry after the simplified body conversion is subjected to duplicate removal;

step 310, filtering the entries, integrating the IDF (Term Frequency Inverse document Frequency value) of the entries and the length of the words, and filtering the entries;

in step 312, a named entity word dictionary is generated.

Many words in the Wikipedia entries are entries of non-named entity words, the entries such as 'society' and 'traffic' are common words and frequently appear in training test linguistic data, and the entries are introduced into a model as named entity words and may have great negative effects on the model. Therefore, filtering of common terms in wikipedia is required. In calculating the IDF value, a document set is required, where the participle corpus of wikipedia entries is used as the document set.

Inverse File frequency value (idf)_i)：

Where D is the document set, | D | represents the total number of documents in the document set, t_iIs an entry I, d_jAre documents in a document collection.

Therefore, the present invention uses a combination of the length of the entry and the IDF value as a basis for filtering Wikipedia entries, assuming that the length of the entry is L EN, L ENIDF (a combination of the length of the entry and the IDF value) is calculated as follows:

LENIDF＝log(LEN)+IDF (2)

by setting a minimum threshold value for L ENIDF of the entry, the common words can be eliminated, and adverse effects caused by introducing the common words into the model are avoided.

For step 204, a named entity identification corpus is constructed. Compared with a general BIO mark form, the named entity recognition corpus constructed by the invention considers the boundary information of the non-entity words, and the specific flow is as follows:

the invention uses part-of-speech tagging corpora of 1 month and 6 days of 1998 of the people's daily newspaper, marks the characters as tagging units according to the form of BIO, and marks the characters as 7 types of identification names of people, place and organization, according to the tagging form of BIO, O, B-PER, I-PER, B-L OC, I-L OC, B-ORG and I-ORG, wherein PER is name identification, L OC is place name identification, ORG is organization name identification, O is non-naming entity word identification, additional identification B indicates that the current unit is the beginning part of the entity, I indicates that the current unit is the middle or ending part of the entity, for example, Wangchua is the originator of the great northern group, B-PER O O B-ORG I-ORG I-ORG I-ORG I-ORG O O O O

In order to consider the boundary information of the words of the non-entity words, the invention provides an improved BIO labeling system which comprises B-O, I-O, B-PER, I-PER, B-L OC, I-L OC, B-ORG and I-ORG, wherein the original label O of the non-entity words is changed into the form of B-O, I-O, B-O is the beginning of the non-entity words, and I-O represents the middle or the end of the non-entity words, and after improvement:

wangchong academician is the founder of the northern university group B-PER I-PER B-O I-O B-O B-ORG I-ORG I-ORG I-ORG I-ORG B-O B-O I-O I-O

It can be seen that, in addition to the entity words taking into account the boundary information, the non-entity boundary information is also added to the corpus annotation.

It is observed that the Arabic numerals and English letters in the corpus of the people's daily newspaper are all in full-angle form, and most Arabic numerals and English letters in the current text adopt half-angle form, so that the Arabic numerals and English letters are unified with the current text. Meanwhile, the date information at the beginning of each line of the daily newspaper corpus of people is removed, so that the interference of frequently occurring numbers on the identification effect is avoided.

For step 206, constructing a named entity recognition model of the two-way long-short time memory network conditional random field fused with the named entity word dictionary specifically comprises:

(1) bi L STM (bidirectional long and short time memory network) layer

The long and short term memory network (L STM) is a variant of the Recurrent Neural Network (RNN), theoretically the RNN can process sequence data with any length, but in fact if the sequence is too long, the problems of gradient disappearance or gradient explosion can occur during optimization, L STM solves the long-term problem of the RNN through a 'gate' mechanism, the L STM unit structure is shown in FIG. 4, the input gate i, the forgetting gate f and the cell state of L STM

c, the definition of the output gate o is as follows:

i_i＝σ(W_ih_i-1+U_ie_xi+b_i) (3)

f_i＝σ(W_fh_i-1+U_fe_xi+b_f) (4)

o_i＝σ(W_oh_i-1+U_oe_xi+b_o) (7)

h_i＝o_i⊙tanh(c_i) (8)

wherein sigma is sigmoid function, ⊙ represents point multiplication, W, U is weight matrix, b is bias, W, U, b is used as network parameter to participate in training, h_iIs the state of the hidden layer.

As can be seen from L STM unit structure diagram, the state h of the hidden layer_iFor example, predicting a missing word in a sentence requires not only a decision from the foregoing but also consideration of the following.

Wherein the content of the first and second substances,

and

respectively showing the states of the hidden layers of the forward L STM and the backward L STM at the i-th time,

indicating a connection

And

(2) CRF (conditional random field) layer

For the named entity recognition task based on characters, labels among characters in a sentence have strong dependency, the label of one character is likely to determine the label information of the next character, so the dependency between adjacent position labels should be considered, for example, B-L OC cannot follow I-PER, the first character in a sequence is 'B-' but not 'I-', and therefore, in order to make the labeling result more accurate, a CRF layer is accessed on the basis of a Bi L STM layer.

For a given sentence x ═ x₁,x₂,...,x_nThe sentence marked with sequence y ═ y }₁,y₂,...,y_nFraction of }:

the score s (x, y) is composed of a mark score P and a transition score A between marks, where A is a transition score matrix and A is_ijRepresents the transfer score (y) between label i and label j₀、y_n+1Respectively representing the beginning and ending characters of a sentence), P is a fraction matrix (n × k dimension, k is the number of label types) output by the Bi L STM layer, P is_i,jDenotes x ═ x₁,x₂,...,x_nThe ith word in the page is labeled as the score of the jth label. Then x is changed to { x ═ x₁,x₂,...,x_nMark y ═ y₁,y₂,...,y_nThe probability of is:

wherein, Y_xDenotes a sentence x ═ { x₁,x₂,...,x_nAll the corresponding labeling schemes.

In training, training a named entity recognition model by maximizing a log probability function log (p (y | x)); during prediction, a label sequence with the highest score is obtained through a viterbi algorithm (a Viterbi algorithm):

(3) named entity recognition model fused with named entity word dictionary

After constructing the naming entity word dictionaryThe dictionary information needs to be converted into features to participate in the training process of the model. Given sentence x ═ x₁,x₂,...,x_nFor sentence x ═ x₁,x₂,...,x_nCharacter x in (1) }_iGenerating dictionary feature vector t according to the feature template in the table 1 by combining the naming entity word dictionary and the context_i：

TABLE 1

Type	Template
		2-gram	x_i-1x_i，x_ix_i+1
3-gram	x_i-2x_i-1x_i，x_ix_i+1x_i+2
		4-gram	x_i-3x_i-2x_i-1x_i，x_ix_i+1x_i+2x_i+3
5-gram	x_i-4x_i-3x_i-2x_i-1x_i，x_ix_i+1x_i+2x_i+3x_i+4

For each text fragment in Table 1, a binary value is used to indicate whether the fragment is in the named entity dictionary, if anyA word dictionary can be found, denoted 1, otherwise denoted 0. Dictionary feature vector t_iCapable of representing a character x_iAnd whether the text fragment consisting of its surrounding characters is a named entity word. As shown in FIG. 5, the generation of dictionary feature vectors t from feature templates is illustrated_iThe process of (2):

if there is a "committee" or "committee" in the dictionary corresponding to the named entity word, the dictionary feature vector t can be expressed as:

0

1

0

1

0

for sentence x ═ x₁,x₂,...,x_nEach character x in_iWhile obtaining the embedded vector e of the character_xiAnd dictionary feature vector t generated from the named entity dictionary_iThe original Bi L STM-CRF named entity recognition model is only the embedded vector e of each character_xiAs an input to the network. Since the feature vector based named entity word dictionary can provide valuable information for the named entity recognition model from different aspects, the present invention, as shown in FIG. 6According to the invention, an independent Bi L STM network is added into an original Bi L STM-CRF named entity recognition model, then the outputs of two Bi L STM networks are integrated together, and the final output of the networks is obtained through CRF.

The specific structure of the network is shown in fig. 7, for the sentence x ═ x₁,x₂,...,x_nAnd the states of the hidden layers of the two layers of Bi L STM networks at the time i are respectively as follows:

wherein e is_xiIs the character x_iEmbedded vector of, t_iIs a dictionary feature vector generated from the named entity dictionary. The states of the two network hiding layers are connected together as inputs to the CRF layer:

the other parts of the named entity recognition model are consistent with the original Bi L STM-CRF model.

The technical scheme of the invention adopts a labeling system of B-O, I-O, B-PER, I-PER, B-L OC, I-L OC, B-ORG and I-ORG, not only considers the boundary information of entity words, but also considers the boundary information of non-entity words, improves the learning capacity of the named entity recognition model, fuses a common sense dictionary generated by a Wikipedia knowledge base on the basis of the Bi L STM-CRF named entity recognition method, and greatly improves the recognition performance of the model on unknown words.

In embodiments of the second aspect of the present invention, a system for identifying a named entity is proposed, and fig. 8 shows a schematic block diagram of a system 80 for identifying a named entity according to an embodiment of the present invention. Wherein, this system includes:

a memory 802 for storing a computer program;

a processor 804 for executing a computer program to:

constructing a named entity word dictionary and constructing a named entity recognition corpus; constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus; and acquiring the linguistic data to be recognized, and inputting the linguistic data to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the linguistic data to be recognized.

The named entity recognition system 80 provided by the invention firstly constructs a named entity word dictionary, secondly constructs a named entity recognition corpus, and thirdly constructs a named entity recognition model fusing the named entity word dictionary, namely constructs a named entity recognition system for automatically recognizing and extracting the name, the place name and the mechanism name in the text, wherein the named entity recognition model can be a named entity recognition model fusing the two-way long and short time memory network conditional random field (Bi L STM-CRF) of the dictionary.

Optionally, the processor 804 inputs the corpus to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the corpus to be recognized, which specifically includes: calculating the probability of the identification label of each character in the corpus to be identified by using the named entity identification model, and taking the identification label with the maximum probability as the target identification label of the character; and identifying the named entity of the corpus to be identified according to the target identification label of the character.

In the embodiment, a named entity recognition model is used for obtaining a plurality of recognition tags of each character in the corpus to be recognized and the probability of each recognition tag, the recognition tag with the maximum probability is used as a target recognition tag of the character, and further, the named entity of the corpus to be recognized is recognized according to the target recognition tag of the character, so that the named entity of the corpus to be recognized is accurately recognized.

Optionally, the processor 804 constructs a naming entity word dictionary, which specifically includes: acquiring entries in a preset database; performing font conversion on the entries, and performing duplication removal on the entries subjected to font conversion; calculating the addition value of the reverse file frequency value of the entry and the entry length value of the entry, comparing the addition value with a preset threshold value, and filtering the entries of which the addition value is smaller than the preset threshold value; and constructing a naming entity word dictionary by using the filtered entries.

Optionally, the processor 804 constructs the named entity identification corpus, which specifically includes: obtaining sample corpora; and marking the named entity words and the non-named entity words in the sample corpus by taking the characters as marking units according to the marking modes of the beginning, the middle or the end, adding the boundary information of the non-named entity words into the marking of the sample corpus, and further constructing the named entity recognition corpus.

Optionally, the processor 804 constructs the named entity recognition model according to the named entity word dictionary and the named entity recognition corpus, and specifically includes: acquiring an embedded vector of characters in the named entity recognition corpus, and generating a dictionary feature vector of the characters according to the named entity dictionary; training a first bidirectional long-short time memory network model by using an embedded vector, training a second bidirectional long-short time memory network model by using a dictionary feature vector, and training a conditional random field model by using the output of the first bidirectional long-short time memory network model and the output of the second bidirectional long-short time memory network model; and taking the trained first bidirectional long-short time memory network model, the trained second bidirectional long-short time memory network model and the trained conditional random field model as named entity recognition models.

An embodiment of the third aspect of the present invention proposes a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for identifying a named entity according to any of the embodiments described above.

The present invention provides a computer-readable storage medium, which, when being executed by a processor, implements the steps of the method for identifying a named entity according to any one of the above embodiments, and therefore, includes all the advantages of the method for identifying a named entity according to any one of the above embodiments.

In the description herein, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance unless explicitly stated or limited otherwise; the terms "connected," "mounted," "secured," and the like are to be construed broadly and include, for example, fixed connections, removable connections, or integral connections; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the description herein, the description of the terms "one embodiment," "some embodiments," "specific embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for identifying a named entity, comprising:

constructing a named entity word dictionary and constructing a named entity recognition corpus;

constructing a named entity recognition model according to the named entity word dictionary and the named entity recognition corpus;

and acquiring a corpus to be recognized, and inputting the corpus to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the corpus to be recognized.

2. The method according to claim 1, wherein the step of inputting the corpus to be recognized into the named entity recognition model to obtain a recognition result of the named entity of the corpus to be recognized specifically comprises:

calculating the probability of the identification label of each character in the corpus to be identified by using the named entity identification model, and taking the identification label with the maximum probability as the target identification label of the character;

and identifying the named entity of the corpus to be identified according to the target identification label of the character.

3. The method for identifying named entities according to claim 1 or 2, wherein the constructing a dictionary of named entity words specifically comprises:

acquiring entries in a preset database;

performing font conversion on the entries, and performing duplication removal on the entries after the font conversion;

calculating the addition value of the reverse file frequency value of the entry and the entry length value of the entry, comparing the addition value with a preset threshold value, and filtering the entry of which the addition value is smaller than the preset threshold value;

and constructing the naming entity word dictionary by using the filtered entry.

4. The method according to claim 1 or 2, wherein the constructing of the named entity recognition corpus specifically includes:

obtaining sample corpora;

and marking named entity words and non-named entity words in the sample corpus by taking characters as marking units according to the marking modes of the beginning, the middle or the end, adding the boundary information of the non-named entity words into the marking of the sample corpus, and further constructing the named entity recognition corpus.

5. The method according to claim 1 or 2, wherein the constructing the named entity recognition model according to the named entity word dictionary and the named entity recognition corpus specifically comprises:

acquiring an embedded vector of characters in the named entity recognition corpus, and generating dictionary feature vectors of the characters according to the named entity dictionary;

training a first bidirectional long-short time memory network model by using the embedded vector, training a second bidirectional long-short time memory network model by using the dictionary feature vector, and training a conditional random field model by using the output of the first bidirectional long-short time memory network model and the output of the second bidirectional long-short time memory network model;

and taking the trained first bidirectional long-short time memory network model, the trained second bidirectional long-short time memory network model and the trained conditional random field model as the named entity recognition model.

6. A system for identifying named entities, comprising:

a memory for storing a computer program;

a processor for executing the computer program to:

7. The system for identifying a named entity according to claim 6, wherein the processor inputs the corpus to be identified to the named entity identification model to obtain an identification result of the named entity of the corpus to be identified, and specifically includes:

8. The system for identifying named entities of claim 6 or 7, wherein the processor constructs a dictionary of named entity words, comprising in particular:

acquiring entries in a preset database;

and constructing the naming entity word dictionary by using the filtered entry.

9. The system for identifying a named entity according to claim 6 or 7, wherein the processor constructs the named entity identification corpus specifically including:

obtaining sample corpora;

10. The system according to claim 6 or 7, wherein the processor constructs the named entity recognition model based on the named entity word dictionary and the named entity recognition corpus, and specifically comprises:

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for identifying a named entity according to one of claims 1 to 5.