CN111079418B

CN111079418B - Named entity recognition method, device, electronic equipment and storage medium

Info

Publication number: CN111079418B
Application number: CN201911078307.2A
Authority: CN
Inventors: 尹坤; 刘权; 陈志刚; 王智国; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-11-06
Filing date: 2019-11-06
Publication date: 2023-12-05
Anticipated expiration: 2039-11-06
Also published as: CN111079418A

Abstract

The embodiment of the invention provides a named entity identification method, a named entity identification device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a text to be identified; determining dictionary feature vectors of each word in the text to be recognized based on a domain dictionary of the corresponding domain of the text to be recognized; inputting the word vector and the dictionary feature vector of each word in the text to be recognized into a named-body recognition model to obtain a named-body recognition result output by the named-body recognition model; the named-body recognition model is trained based on the word vector and dictionary feature vector of each sample word in the sample text and the named-body mark of each sample word. The method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention solve the problem of low recognition accuracy caused by entry collision and improve the accuracy of recognition of the named bodies.

Description

Named entity recognition method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a named object recognition method, device, electronic apparatus, and storage medium.

Background

Named-body recognition (Named Entity Recognition, NER) is an important step in natural language processing and is widely used in tasks such as information extraction, information retrieval, information recommendation, and machine translation. The term "named object" refers to a proper noun having a specific meaning in natural language, such as a name of a person, a name of a place, a name of a organization, and a name of a song.

In the prior art, a recognition method for a named entity is generally a matching method based on a domain dictionary, and a text to be recognized is matched with the domain dictionary of a corresponding domain to determine the named entity in the text to be recognized. However, there may be conflicts with terms in the domain dictionary, resulting in less accuracy in recognition of the named volumes.

Disclosure of Invention

The embodiment of the invention provides a named object recognition method, a named object recognition device, electronic equipment and a storage medium, which are used for solving the problem of low named object recognition accuracy caused by entry conflicts in a domain dictionary.

In a first aspect, an embodiment of the present invention provides a named entity recognition method, including:

determining a text to be identified;

determining dictionary feature vectors of each word in the text to be recognized based on a domain dictionary of the corresponding domain of the text to be recognized;

Inputting the word vector and the dictionary feature vector of each word in the text to be recognized into a named-body recognition model to obtain a named-body recognition result output by the named-body recognition model;

the named-body recognition model is trained based on word vectors and dictionary feature vectors of each sample word in the sample text and named-body marks of each sample word.

Preferably, the named object recognition model comprises an input coding layer, a dictionary feature selection layer and a label prediction layer;

correspondingly, the step of inputting the word vector and the dictionary feature vector of each word in the text to be recognized into a named-body recognition model to obtain a named-body recognition result output by the named-body recognition model specifically comprises the following steps:

inputting the word vector and the dictionary feature vector of each word in the text to be recognized to the input coding layer to obtain the hidden layer vector of each word output by the input coding layer;

inputting hidden layer vectors and dictionary feature vectors of each word into the dictionary feature selection layer to obtain attention feature vectors of each word output by the dictionary feature selection layer;

and inputting the attention feature vector of each word into the label prediction layer to obtain the named body recognition result output by the label prediction layer.

Preferably, the inputting the hidden layer vector and the dictionary feature vector of each word to the dictionary feature selection layer, to obtain the attention feature vector of each word output by the dictionary feature selection layer, specifically includes:

determining a weight of any word relative to each dictionary feature based on the hidden layer vector and the dictionary feature vector of the any word;

and weighting the dictionary feature vector of any word based on the weight of the any word relative to each dictionary feature to obtain the attention feature vector of any word.

Preferably, the inputting the word vector and the dictionary feature vector of each word in the text to be recognized into a named-body recognition model to obtain a named-body recognition result output by the named-body recognition model, further includes:

training an initial model based on a loss function to obtain the named body recognition model;

wherein the penalty functions include a recognition result penalty function that corresponds to the named-body recognition result and a weight penalty function that corresponds to the weight of a word with respect to each dictionary feature.

Preferably, the domain dictionary comprises dictionaries corresponding to different named-body types;

Correspondingly, the determining the dictionary feature vector of each word in the text to be recognized based on the domain dictionary of the corresponding domain of the text to be recognized specifically comprises the following steps:

determining dictionary features of each word in the text to be recognized corresponding to any named-body type based on a dictionary corresponding to the any named-body type;

a dictionary feature vector for any word is determined based on the dictionary features of the word corresponding to each named-body type.

Preferably, the determining the dictionary feature vector of any word based on the dictionary feature of any word corresponding to each named body type specifically includes:

vectorizing dictionary features of any word corresponding to each named body type to obtain a feature vector of the any word;

and sparsifying the feature vector of any word to obtain the dictionary feature vector of any word.

In a second aspect, an embodiment of the present invention provides a named-body identifying apparatus, including:

a text determining unit for determining a text to be recognized;

the dictionary matching unit is used for determining dictionary feature vectors of each word in the text to be recognized based on a domain dictionary of the corresponding domain of the text to be recognized;

The named-body recognition unit is used for inputting the word vector and the dictionary feature vector of each word in the text to be recognized into a named-body recognition model to obtain a named-body recognition result output by the named-body recognition model;

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a bus, where the processor, the communication interface, and the memory are in communication with each other through the bus, and the processor may invoke logic instructions in the memory to perform the steps of the method as provided in the first aspect.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as provided by the first aspect.

According to the named object recognition method, the named object recognition device, the electronic equipment and the storage medium, provided by the embodiment of the invention, the named object recognition result is output through the named object recognition model, so that the influence of noise vocabulary entries on named object recognition can be weakened, the problem of low recognition accuracy caused by vocabulary entry conflict in the named object recognition method based on the domain dictionary is solved, and the accuracy of named object recognition is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a named entity recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an operation flow of a named entity recognition model according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for calculating attention feature vectors according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a method for determining a dictionary feature vector according to an embodiment of the present invention;

FIG. 5 is a dictionary feature generation diagram provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of sparse dictionary feature vectors according to an embodiment of the present invention;

FIG. 7 is a flowchart of a training method of a named entity recognition model according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a named entity recognition model according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a named entity recognition device according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The voice assistant is an intelligent mobile phone application, and assists the user to perform related operations through intelligent interaction between intelligent dialogue and instant question and answer. Semantic understanding serves as one of the important links, and serves to make the machine understand and understand the user's query. Semantic understanding of various proprietary domains generally involves two tasks: intent recognition and NER recognition. In the industry, while the early proprietary-domain NER systems had tremendous application, a significant amount of manpower and material resources were required to write rules and discover features. In recent years, with the hot trend of deep learning, the application of deep learning to NER tasks has also made tremendous progress. However, the general NER methods all require a large amount of supervised training data, which is extremely expensive and requires a large amount of manpower and material resources.

One great difficulty with the proprietary field of NER is that training data is sparse and difficult to obtain. To solve this problem, extensive research is being initiated on how to use external knowledge. These methods can be broadly divided into two categories:

based on a language model: the NER model is enhanced using a word-level language model, or enhanced using a word-level language model. Because the corpus in the special field has fewer occurrence times in the general scene, the language model based on the general corpus training has limited solving capability.

Based on a domain dictionary model: such as Lattice LSTM that smartly applies dictionary features, or an unsupervised NER model based on a domain dictionary. However, the above methods ignore conflicts between domain dictionaries, and do not consider the influence of noise entries on the model.

In this regard, the embodiment of the invention provides a named entity recognition method, which can be used in the field of television broadcasting to determine names of actors or films to be watched by a user according to voice data of the user, so as to play corresponding videos for the user, and can also be applied in the fields of intelligent home control, information retrieval of voice assistants, information recommendation and the like.

Fig. 1 is a flow chart of a named entity recognition method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

at step 110, text to be identified is determined.

The text to be recognized, that is, the text to be recognized, is a text to be recognized by a named-body recognition, and the text to be recognized may be a text directly input by a user, or may be a text obtained by performing voice recognition on voice data input by the user, or may be obtained by performing text recognition on an image input by the user by applying text recognition technologies such as OCR (Optical Character Recognition ) and the like, which is not particularly limited in the embodiment of the present invention.

Step 120, determining a dictionary feature vector of each word in the text to be recognized based on the domain dictionary of the corresponding domain of the text to be recognized.

Specifically, the domain dictionary corresponds to a domain of text to be recognized, for example, in a television on-demand scene, the text to be recognized is used for video on demand, the domain dictionary corresponds to a video domain, and the entries contained in the domain dictionary may include entries of actor names, entries of movie types, and the like. The domain dictionary is pre-constructed based on the vocabulary entries of the corresponding domain, for example, the vocabulary entries of the video domain can be obtained from the video website through a crawler, so as to construct the domain dictionary.

After the text to be recognized is determined, the text to be recognized can be matched with a domain dictionary in the corresponding domain, and then dictionary feature vectors of each word in the text to be recognized are obtained. Here, for any word, the dictionary feature vector of the word is a vector corresponding to the dictionary feature of the word, where the dictionary feature of the word may be used to represent whether the word hits an entry in the domain dictionary, and when the word hits an entry in the domain dictionary, the dictionary feature may also be used to represent a position of the word in the hit entry, or a type of the word hit by the word, and the embodiment of the present invention is not limited in this particular manner.

Step 130, inputting the word vector and the dictionary feature vector of each word in the text to be recognized into a named-body recognition model to obtain a named-body recognition result output by the named-body recognition model;

the named-body recognition model is trained based on the word vector and dictionary feature vector of each sample word in the sample text and the named-body mark of each sample word.

Specifically, when matching is performed based on the domain dictionary, any word may miss any term, may hit one term, and may hit a plurality of terms. When the plurality of entries is hit, the word may actually correspond to only one of the plurality of entries, i.e., only one of the plurality of entries is the correctly hit entry. For example, the text to be recognized is "i want to see movie without a lane", and the term "no lane" and the term "lane" are hit correspondingly in the domain dictionary, so that for the word "lane", two terms are hit, of the two terms, the term "no lane" is the correctly hit term, and the term "lane" is the noise term. This problem is typically caused by entry conflicts within the domain dictionary.

In order to solve the problem, a named-body recognition model is adopted in the embodiment of the invention, and a named-body recognition result is determined and output through the named-body recognition model.

When a plurality of entries are hit, the word has a plurality of dictionary features, and the dictionary feature vector of the word is a vector containing a plurality of dictionary features. The named-body recognition model can determine the correlation between different dictionary features and the word semantics based on the word semantics, further select the dictionary features with the word semantics closer to each other from a plurality of dictionary features, and improve the named-body recognition precision.

The named entity recognition result obtained by the method is used for indicating the named entity in the text to be recognized, and can also be used for indicating the type of the named entity in the text to be recognized, and the embodiment of the invention is not particularly limited.

Before step 130 is performed, a named object recognition model may also be trained in advance, and specifically, the named object recognition model may be trained in the following manner: first, a large number of sample texts are collected, and dictionary feature vectors of each sample word in the sample texts are determined based on a domain dictionary of the domain to which the sample texts correspond. In addition, a naming body mark of each sample word in the sample text is determined in a manual labeling mode, wherein the naming body mark can represent whether the sample word is a naming body or not, and can also represent the position of the sample word in the naming body, or the type of the corresponding naming body of the sample word, and the like. And training the initial model based on the word vector, the dictionary feature vector and the named-body mark of each sample word in the sample text, so as to obtain the named-body recognition model.

According to the method provided by the embodiment of the invention, the named body recognition result is output through the named body recognition model, so that the influence of noise entries on named body recognition can be weakened, the problem of low recognition accuracy caused by entry collision in the named body recognition method based on the domain dictionary is solved, and the accuracy of named body recognition is improved.

Based on the above embodiment, in the method, the named-body recognition model determines, based on the attention mechanism, an attention feature vector corresponding to a dictionary feature vector of each word in the text to be recognized, and outputs a named-body recognition result based on the attention feature vector of each word.

Specifically, when a plurality of entries are hit, the word has a plurality of dictionary features, and the dictionary feature vector of the word is a vector containing a plurality of dictionary features. In the named-body recognition model, the attention mechanism can determine the correlation between different dictionary features and the word semantics based on the word semantics, further adjust the weights of the different dictionary features in the dictionary feature vectors, and calculate to obtain the attention feature vectors with the adjusted weights, wherein the attention feature vectors have bias relative to the weights of the different dictionary features when indicating that the word corresponds to hit a plurality of entries, so that the dictionary features closer to the word semantics are better highlighted, and the named-body recognition precision is improved.

Based on any of the above embodiments, in the method, the named-body recognition model includes an input encoding layer, a dictionary feature selection layer, and a label prediction layer. Fig. 2 is a schematic operation flow diagram of a named entity recognition model according to an embodiment of the present invention, as shown in fig. 2, step 130 specifically includes:

step 131, inputting the word vector and the dictionary feature vector of each word in the text to be recognized into the input coding layer to obtain the hidden layer vector of each word output by the input coding layer.

Specifically, the input encoding layer is used for analyzing the semantics of the corresponding word in the text to be recognized based on the word vector and the dictionary feature vector of each word, and outputting the hidden layer vector of each word. Here, the hidden layer vector is obtained by encoding a word vector and a dictionary feature vector for each word in the text to be recognized. The input encoding layer may be a BI-directional long-short-time memory network BI-LSTM for encoding.

Step 132, the hidden layer vector and the dictionary feature vector of each word are input to the dictionary feature selection layer, and the attention feature vector of each word output by the dictionary feature selection layer is obtained.

Specifically, the dictionary feature selection layer is used for calculating the correlation between different dictionary features in the dictionary feature vector and the hidden layer vector through the attention mechanism based on the hidden layer vector and the dictionary feature vector of any word input, so as to adjust the weights of the different dictionary features in the dictionary feature vector, and output the adjusted attention feature vector.

And step 133, inputting the attention feature vector of each word into the label prediction layer to obtain a named body recognition result output by the label prediction layer.

Specifically, the label prediction layer is configured to analyze probabilities that each word corresponds to different dictionary features based on the attention feature vector of each input word, further obtain dictionary features corresponding to each actual word, and output a named-body recognition result. Here, the label prediction layer may be a structure of a BI-directional long and short term memory network BI-lstm+conditional random field CRF.

The method provided by the embodiment of the invention combines the hidden layer vector and the dictionary feature vector to calculate the attention feature vector, is beneficial to weakening the influence of noise entry on named body recognition and improving the accuracy of named body recognition.

Based on any of the above embodiments, fig. 3 is a flowchart of a method for calculating an attention feature vector according to an embodiment of the present invention, as shown in fig. 3, in which step 132 specifically includes:

step 1321, determining weights for any word relative to each dictionary feature based on the hidden layer vector and the dictionary feature vector for the word.

Specifically, the dictionary feature vector includes vectors corresponding to a plurality of dictionary features, and based on the hidden layer vector and the vector corresponding to any dictionary feature in the dictionary feature vector, the correlation between the hidden layer vector and the vector corresponding to the dictionary feature can be obtained based on the attention mechanism. After the correlation between the hidden layer vector and the vector corresponding to each dictionary vector in the dictionary feature vectors is obtained, the weight of each dictionary vector can be determined. Here, the higher the correlation, the larger the correspondence weight, and the lower the correlation, the smaller the correspondence weight.

Step 1322, weighting the dictionary feature vector of the word based on the weight of the word relative to each dictionary feature, obtains the attention feature vector of the word.

Specifically, after the weight of the word relative to each dictionary feature is obtained, the vector corresponding to each dictionary feature included in the dictionary feature vector of the word can be weighted, and the sum of the vector weights corresponding to each dictionary feature is used as the attention feature vector of the word.

Based on any one of the above embodiments, in the method, step 1321 specifically includes:

the weight of any word relative to each dictionary feature is determined based on the following formula:

in the weigth _i The method comprises the steps that the weight of an ith word in a text to be recognized relative to each dictionary feature is calculated, and i is a positive integer; h is a _i And f' _i Respectively an hidden layer vector and a dictionary feature vector of the ith word; w is used for h _i Mapping to and f' _i Delta () is a dirac function in the same dimension.

From the above formula, for any word, the correlation between the semantic meaning of the word and each dictionary feature can be obtained by multiplying the hidden layer vector of the word by the dictionary feature vector, and the weight of the word relative to each dictionary feature can be obtained by processing the correlation between the voice of the word and each dictionary feature by the dirac function on the basis of the correlation. Here, the sum of the weights of the word with respect to each dictionary feature is 1.

Step 1322 specifically includes:

the attention feature vector of the i-th word is calculated based on the following formula:

wherein f _i For the attention feature vector of the i-th word in the text to be recognized, M is the total number of dictionary features set in advance.

As can be seen from the above equation, for any word, the attention feature vector of the word is obtained by multiplying the weight of the word relative to each dictionary feature by the vector of the corresponding dictionary feature in the dictionary feature vector of the word, and the attention feature vector is the result of weighted summation of the vectors of each dictionary feature according to the weight.

Based on any of the above embodiments, the method further includes, before step 130: training the initial model based on the loss function to obtain a named body recognition model; wherein the penalty functions include recognition result penalty functions corresponding to the recognition results of the named volumes and weight penalty functions corresponding to the weights of the words relative to each dictionary feature.

Specifically, in the training process of the named-body recognition model, because of the correlation between the dictionary features in the dictionary feature vector of the sample word and the named-body marks of the sample word, in order to enable the weights of the dictionary features to learn towards the direction of the named-body marks, the correlation can be used to supervise the learning of the dictionary feature weights in the model, thereby defining the weight loss function. In the embodiment of the invention, the sum of the recognition result loss function and the weight loss function can be used as the loss function for model training.

Further, a weight loss function is defined as follows:

in the above formula, M is the number of dictionary features characterized by the dictionary feature vector. y is _i Whether each dictionary feature represented by the dictionary feature vector representing the ith sample word is the same as the named-body label of that word, if so, y _i The position of the corresponding dictionary feature in the dictionary is set to be 1, and y is different _i The location of the corresponding dictionary feature is set to 0. For example, in the "I want to see the Liu Dehua model without a lane", the dictionary feature represented by the dictionary feature vector of the sample word "de" is "i_artist", and the named body of the sample word "de" is labeled as "i_artist", which are identical, so that the value of y corresponding to the dictionary feature i_artist of the sample word "de" is 1, whereas the value of y corresponding to the other dictionary feature "e_artist" of the sample word "de" is 0, which is different from the named body label "i_artist", so that the value of y corresponding to the dictionary feature e_artist of the sample word "de" is 0; weight (weight) _i The weight of the i-th word with respect to each dictionary feature. So in the model training stage, in order to make the loss function loss _weight Approaching in a small direction, when y _i When the value of y corresponding to the current dictionary is 1, namely when the current dictionary feature and the named body mark are the same, the corresponding weight is high _i A large directional approximation should be made.

Assume that the recognition result loss function is:

where N is the number of words of the sample text, p (y _i |s _i ) For representing model-based prediction y of the ith sample word _i And the nomenclature body label s _i The same probability. The loss function that trains the initial model is thus obtained as:

wherein alpha is a preset parameter for regulating loss _weight Preferably, α is 0.25.

In the existing named object recognition method based on the domain dictionary, only whether the vocabulary entry is hit is considered, but the types of the domain dictionary are ignored, for example, in the vertical domain of music, "Liu Dehua" and "forgetting water" are all exclusive domain words, but "Liu Dehua" is singer name, "forgetting water" is song name, the types of the vocabulary entry and the model are different, and the information of the vocabulary entry of different types brought to the model is obviously different. Based on any of the above embodiments, in the method, the domain dictionary includes dictionaries corresponding to different named-body types.

Here, in the field dictionary, the terms of different named-body types are stored in the dictionary of the corresponding named-body type, for example, in the field of television on demand, an actor dictionary, a movie name dictionary, a movie type dictionary, and the like may be divided according to the named-body type, wherein the actor dictionary contains terms Liu Dehua, liang Chaowei, zeng Zhiwei, and the like, the movie name dictionary contains terms without a lane, and the like, and the movie type dictionary contains term crimes, police, and the like.

Fig. 4 is a flowchart of a method for determining a dictionary feature vector according to an embodiment of the present invention, as shown in fig. 4, step 120 specifically includes:

step 121, determining dictionary features of each word in the text to be recognized corresponding to the named-body type based on the dictionary corresponding to any named-body type.

Here, any word corresponds to a lexicon feature of any named-body type, which characterizes whether the word hits an entry of the command-body type, and which also characterizes the position of the word in the hit entry when the word hits the named-body type entry.

Step 122, determining a dictionary feature vector for any word based on the dictionary features of the word corresponding to each named-body type.

Specifically, after the dictionary feature of any word relative to each named-body type is obtained, the dictionary feature vector of the word can be obtained by vectorizing and splicing the dictionary feature of the word relative to each named-body type.

According to the method provided by the embodiment of the invention, the dictionary feature vector contains the named-body type of the hit entry, so that richer information is provided for named-body identification, and the accuracy of named-body identification is improved.

Based on any of the above embodiments, the method in step 122 specifically includes: vectorizing dictionary features of any word corresponding to each named body type to obtain a feature vector of the word; and sparsifying the feature vector of the word to obtain the dictionary feature vector of the word.

Specifically, after obtaining dictionary features of any word relative to each named-body type, vectorizing the dictionary features to obtain vectorized dictionary features represented by 0 or 1, and then stitching the vectorized dictionary features relative to each named-body type to obtain feature vectors of the word.

Fig. 5 is a schematic diagram of dictionary feature generation provided in the embodiment of the present invention, as shown in fig. 5, the text to be recognized is "i want to see the no-lane" decorated by Liu Dehua, and based on the domain dictionary, the terms "Liu Dehua" and "Liu De" in the actor dictionary and the terms "no-lane" and "lane" in the movie name dictionary are determined for the text to be recognized. Thus, the dictionary characteristic corresponding to the 'Liu' word is B_artist, which indicates that 'Liu' is the beginning character of an actor name; dictionary features corresponding to the "de" word are i_artist and e_artist, respectively representing the "de" as an intermediate character of an actor name and the "de" as an end character of an actor name; the dictionary feature corresponding to "Hua" is E_artist, which means that "Hua" is the ending character of an actor name. The dictionary feature corresponding to the "none" word is b_name, which indicates that "none" is the beginning character of a movie name; the dictionary feature corresponding to the "space" word is i_name, which indicates that "space" is an intermediate character of a movie name; the dictionary features corresponding to the "track" words are e_name and s_name, respectively, indicating that "track" is the ending character of a movie name and "track" is a single character of a movie name.

Assuming that the number of dictionary features is M, for any word, a full 0 vector of length M is initialized, and then a 1 is set at the dictionary feature to which the word corresponds. For a "De" word, vectorizing its dictionary features relative to each named body type to obtain a feature vector for the word as shown in the following table:

dictionary feature list	B_artist	I_artist	E_artist	S_artist	……	……	S_name
								Dictionary feature index location	0	1	2	3	……	……	M-1
Feature vector of "De	0	1	1	0	0	0	0

Since the feature vector thus obtained contains only 0 and 1, the information amount is too large, it is necessary to perform a thinning process on the feature vector, and apply the feature vector after the thinning process as a dictionary feature vector to recognition of a named object.

Further, the feature vector sparsification processing may be implemented by:

initializing a shape as [ M, label _dim ]Where M is the number of dictionary features, label _dim As a dimension of the dictionary feature vector, the embodiment of the present invention marks the matrix as L.

Fig. 6 is a schematic diagram of the sparse dictionary feature vector according to the embodiment of the present invention, and as shown in fig. 6, the feature vector is multiplied by a matrix L to obtain the sparse dictionary feature vector. In FIG. 6, the dictionary feature vector of the "De" word is [ M, label ] _dim ]Corresponding columns of dictionary features i_artist and e_artist in the matrix are circles filled with diagonal lines, indicating that the two columns have values, and the remaining blank circles are 0.

On this basis, the matrix in fig. 6 may be further flattened into a one-dimensional column vector, so as to facilitate stitching of word vectors corresponding to the word, which is not particularly limited in the embodiment of the present invention.

Based on any one of the above embodiments, fig. 7 is a flow chart of a training method of a named object recognition model according to an embodiment of the present invention, as shown in fig. 7, where the method includes:

first, a domain dictionary is established.

And secondly, collecting a large amount of sample texts to be recognized in advance, and taking the sample texts as training sample data of a named body recognition model.

And then, performing field dictionary feature matching on each sample text to be identified in the training sample data to obtain dictionary features of the sample text. In general, the dictionary feature matching refers to querying all possible entries in the domain dictionary by a character string matching method, and assuming that a sample text is "i want to see a Liu Dehua channel," after the character string matching of the domain dictionary, 2 entries can be hit: liu Dehua in actor dictionary, no-break in movie dictionary.

Next, based on the sample text and the dictionary characteristics of the sample text, a word vector of each sample word in the sample text and the dictionary characteristic vector of each sample word are determined, while the named-body label of each word is labeled.

And finally, constructing and training a named-body recognition model according to the word vector, the dictionary feature vector and the named-body mark of each sample word in the sample text.

Based on any of the above embodiments, fig. 8 is a schematic structural diagram of a named-body recognition model according to an embodiment of the present invention, where, as shown in fig. 8, the named-body recognition model includes an input encoding layer, a dictionary feature selection layer, and a label prediction layer. The input coding layer inputs a word vector and a dictionary feature vector of each word, namely a sparse vector of the feature vector; the input coding layer comprises a BI-LSTM network, and the word vector and dictionary feature vector of each word are coded and the hidden layer vector of each word is output; the dictionary feature selection layer is used for calculating the correlation between different dictionary features in the dictionary feature vector and the hidden layer vector through an attention mechanism based on the hidden layer vector and the dictionary feature vector of any word, and adjusting the weights of the different dictionary features in the dictionary feature vector so as to select important dictionary features and reduce the influence of noise dictionary features; the label prediction layer outputs the probability that each word corresponds to each dictionary feature, and selects the dictionary feature with the largest probability value as the named body recognition result of the current word.

Based on any of the above embodiments, fig. 9 is a schematic structural diagram of a named entity recognition device according to an embodiment of the present invention, and as shown in fig. 9, the device includes a text determining unit 910, a dictionary matching unit 920, and a named entity recognition unit 930;

wherein, the text determining unit 910 is configured to determine a text to be recognized;

the dictionary matching unit 920 is configured to determine a dictionary feature vector of each word in the text to be recognized based on a domain dictionary of the domain corresponding to the text to be recognized;

the named-body recognition unit 930 is configured to input a word vector and a dictionary feature vector of each word in the text to be recognized into a named-body recognition model, so as to obtain a named-body recognition result output by the named-body recognition model;

The device provided by the embodiment of the invention outputs the named body recognition result through the named body recognition model, and can weaken the influence of noise entries on named body recognition, thereby overcoming the problem of low recognition accuracy caused by entry collision in the named body recognition method based on the domain dictionary and improving the accuracy of named body recognition.

Based on any one of the above embodiments, in the apparatus, the named object recognition model includes an input encoding layer, a dictionary feature selection layer, and a label prediction layer;

correspondingly, the named-body recognition unit 930 includes:

the coding subunit is used for inputting the word vector and the dictionary feature vector of each word in the text to be recognized into the input coding layer to obtain the hidden layer vector of each word output by the input coding layer;

the feature selection subunit is used for inputting the hidden layer vector and the dictionary feature vector of each word to the dictionary feature selection layer to obtain the attention feature vector of each word output by the dictionary feature selection layer;

and the prediction subunit is used for inputting the attention feature vector of each word into the label prediction layer to obtain the named body recognition result output by the label prediction layer.

Based on any of the above embodiments, in the apparatus, the feature selection subunit includes:

the weight adjusting module is used for determining the weight of any word relative to each dictionary feature based on the hidden layer vector and the dictionary feature vector of the any word;

and the weighting module is used for weighting the dictionary feature vector of any word based on the weight of the any word relative to each dictionary feature to obtain the attention feature vector of any word.

Based on any of the above embodiments, the apparatus further comprises:

the model training unit is used for training the initial model based on the loss function to obtain the named body recognition model;

Based on any of the above embodiments, in the apparatus, the domain dictionary includes dictionaries corresponding to different named-body types;

correspondingly, the dictionary matching unit 920 includes:

a dictionary feature determining subunit, configured to determine, based on a dictionary corresponding to any named-body type, a dictionary feature of each word in the text to be recognized corresponding to the any named-body type;

and the dictionary vector determination subunit is used for determining the dictionary feature vector of any word based on the dictionary feature of the word corresponding to each named body type.

Based on any one of the above embodiments, in the apparatus, the dictionary vector determination subunit is specifically configured to:

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 10, the electronic device may include: a processor 1010, a communication interface (Communications Interface) 1020, a memory 1030, and a communication bus 1040, wherein the processor 1010, the communication interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may call logic instructions in memory 1030 to perform the following methods: determining a text to be identified; determining dictionary feature vectors of each word in the text to be recognized based on a domain dictionary of the corresponding domain of the text to be recognized; inputting the word vector and the dictionary feature vector of each word in the text to be recognized into a named-body recognition model to obtain a named-body recognition result output by the named-body recognition model; the named-body recognition model is trained based on word vectors and dictionary feature vectors of each sample word in the sample text and named-body marks of each sample word.

Further, the logic instructions in the memory 1030 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods provided by the above embodiments, for example, comprising: determining a text to be identified; determining dictionary feature vectors of each word in the text to be recognized based on a domain dictionary of the corresponding domain of the text to be recognized; inputting the word vector and the dictionary feature vector of each word in the text to be recognized into a named-body recognition model to obtain a named-body recognition result output by the named-body recognition model; the named-body recognition model is trained based on word vectors and dictionary feature vectors of each sample word in the sample text and named-body marks of each sample word.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A named entity recognition method, comprising:

determining a text to be identified;

the named body recognition model is obtained by training based on a word vector and a dictionary feature vector of each sample word in the sample text and a named body mark of each sample word;

the named-body recognition model determines attention feature vectors corresponding to dictionary feature vectors of each word in the text to be recognized based on an attention mechanism, and outputs named-body recognition results based on the attention feature vectors of each word, wherein the attention mechanism is used for determining correlations between different dictionary features and the semantics of any word based on the semantics of any word, adjusting weights of different dictionary features in the dictionary feature vectors based on the correlations, and calculating the attention feature vectors with adjusted weights.

2. The named-body recognition method of claim 1, wherein the named-body recognition model comprises an input encoding layer, a dictionary feature selection layer, and a label prediction layer;

3. The named-body recognition method according to claim 2, wherein the inputting the hidden layer vector and the dictionary feature vector of each word to the dictionary feature selection layer, to obtain the attention feature vector of each word output by the dictionary feature selection layer, specifically comprises:

4. The named-body recognition method according to claim 3, wherein the step of inputting the word vector and the dictionary feature vector of each word in the text to be recognized into a named-body recognition model to obtain a named-body recognition result output by the named-body recognition model further comprises the steps of:

5. The named-body recognition method of claim 1, wherein the domain dictionary comprises dictionaries corresponding to different named-body types;

6. The named-body recognition method of claim 5, wherein the determining the dictionary feature vector of any word based on the dictionary feature of the any word corresponding to each named-body type specifically comprises:

7. A named entity recognition device, comprising:

a text determining unit for determining a text to be recognized;

8. The named entity recognition device of claim 7, wherein the named entity recognition model comprises an input encoding layer, a dictionary feature selection layer, and a label prediction layer;

correspondingly, the named-body recognition unit comprises:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the named entity recognition method of any one of claims 1 to 6 when the program is executed by the processor.

10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the steps of the named-body recognition method according to any one of claims 1 to 6.