CN110502738A

CN110502738A - Chinese name entity recognition method, device, equipment and inquiry system

Info

Publication number: CN110502738A
Application number: CN201810482265.8A
Authority: CN
Inventors: 胡于响; 张帆; 姜飞俊
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-05-18
Filing date: 2018-05-18
Publication date: 2019-11-26

Abstract

The embodiment of the present invention provides a kind of Chinese name entity recognition method and equipment, this method comprises: obtaining sentence to be identified；Coded treatment is carried out to each word in sentence, to obtain the corresponding coding vector of each word；Classification and Identification is carried out to the corresponding coding vector of each word, with the corresponding entity class of each word of determination, the Chinese name entity for including in sentence is determined according to the corresponding entity class of each word.This programme avoids influence of the participle mistake to the classification results of entity class, improves identification accuracy by the Classification and Identification to sentence to be identified progress word rank.

Description

Chinese named entity recognition method, device, equipment and query system

Technical Field

The invention relates to the technical field of internet, in particular to a method, a device, equipment and a query system for identifying a Chinese named entity.

Background

Chinese named entity recognition is a basic problem in the field of natural language processing and belongs to the category of sequence labeling problems. Briefly, the problem of Chinese named entity recognition is to identify and classify entities in a text sequence that are of interest to us, such as names of people, places, and organizations. The Chinese named entity recognition technology is an indispensable component of various natural language processing technologies such as information extraction, information retrieval, machine translation, question and answer systems and the like.

The meaning of the Chinese named entity recognition is explained by taking the application of a question-answering system as an example: when a user proposes a consultation statement, user intention identification is carried out, and the user intention identification decides which service is called to respond to the consultation of the user; chinese named entity recognition is then performed on the advisory statement to identify the specific category of entities contained therein, such as time, place, etc., which correspond to the query keyword and which can be used as input for the invoked service, whereby the service makes an output response based on the query keyword.

For example, the user's advisory statements are: in Hangzhou in tomorrow, the recognition result of the user intention shows that the user intention is to inquire weather. Assuming that the predefined entity category corresponding to the weather query service includes time and place, two entities, namely tomorrow and Hangzhou, are obtained by performing entity identification on the statement, the two entities are used as query keywords to call the weather query service, and a corresponding weather query result can be obtained to serve as an output response.

Currently, methods for Chinese named entity recognition can be divided into two broad categories: rule-based methods and statistical-based methods. The rule-based method is to manually establish Chinese entity identification rules, so that the cost is high, the expression habits of different users are different, the expression of the same entity can be expressed in various ways, the manually defined rules cannot be covered comprehensively, and the accuracy of identification results is poor. Generally, a corpus is required for training in a statistical-based method, and commonly used methods include Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), neural networks, and the like. However, these statistical methods are all based on the word segmentation result, that is, the entity category corresponding to each word segmentation is identified by taking the word segmentation result as a unit, but if the word segmentation is wrong, especially if the entity word is segmented to a non-entity word, the identification result will be wrong.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, an apparatus, a device and a query system for identifying a chinese named entity, so as to improve accuracy of a result of identifying the chinese named entity.

In a first aspect, an embodiment of the present invention provides a method for identifying a chinese named entity, which is applied to a server, and the method includes:

obtaining a sentence to be identified;

coding each word in the statement to obtain a coding vector corresponding to each word;

classifying and identifying the coding vectors to determine an entity class corresponding to each word;

and determining the Chinese named entity contained in the sentence according to the entity category corresponding to each word.

In a second aspect, an embodiment of the present invention provides a device for identifying a named entity in chinese, including:

the acquisition module is used for acquiring the sentences to be identified;

the coding module is used for coding each word in the statement to obtain a coding vector corresponding to each word;

the classification identification module is used for performing classification identification on the coding vectors so as to determine the entity class corresponding to each word;

and the determining module is used for determining the Chinese named entity contained in the sentence according to the entity category corresponding to each word.

In a third aspect, an embodiment of the present invention provides an electronic device, including a first processor and a first memory, where the first memory is used to store one or more computer instructions, and when the one or more computer instructions are executed by the first processor, the method for identifying a chinese named entity in the first aspect is implemented. The electronic device may further comprise a first communication interface for communicating with other devices or a communication network.

An embodiment of the present invention provides a computer storage medium for storing a computer program, where the computer program is executed by a computer to implement the method for identifying a named entity in chinese according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides a method for identifying a chinese named entity, which is applied to a user terminal, and the method includes:

receiving a sentence to be recognized input by a user;

and sending the sentence to be recognized to a server so that the server performs coding processing on each word in the sentence and determines the Chinese named entity contained in the sentence to be recognized according to the entity category corresponding to each word obtained by classifying and recognizing the coding vector obtained by the coding processing.

In a fifth aspect, an embodiment of the present invention provides a device for identifying a named entity in chinese, including:

the receiving module is used for receiving the sentence to be recognized input by the user;

and the sending module is used for sending the statement to be recognized to a server so that the server performs coding processing on each word in the statement and determines the Chinese named entity contained in the statement to be recognized according to the entity category corresponding to each word obtained by classifying and recognizing the coding vector obtained by the coding processing.

In a sixth aspect, an embodiment of the present invention provides an electronic device, including a second processor and a second memory, where the second memory is used to store one or more computer instructions, and where the one or more computer instructions, when executed by the second processor, implement the method for identifying a chinese named entity in the fourth aspect. The electronic device may further comprise a second communication interface for communicating with other devices or a communication network.

An embodiment of the present invention provides a computer storage medium for storing and storing a computer program, where the computer program, when executed by a computer, implements the method for identifying a named entity in chinese according to the fourth aspect.

In a seventh aspect, an embodiment of the present invention provides a method for identifying a chinese named entity, where the method is applied to a server, and the method includes:

obtaining a sentence to be identified;

removing interference words in the statement;

coding the rest words to obtain coding vectors corresponding to the rest words;

classifying and identifying the coding vectors to determine entity classes corresponding to the rest words respectively;

and determining the Chinese named entities contained in the sentence according to the entity categories corresponding to the rest characters.

In an eighth aspect, an embodiment of the present invention provides a query system, including:

a user terminal and a server;

the user terminal is used for responding to a query statement input by a user, sending the query statement to the server and receiving a query response sent by the server;

the server is used for determining a service program corresponding to the query statement; coding each word in the query statement to obtain a coding vector corresponding to each word, classifying and identifying the coding vectors to determine an entity category corresponding to each word, and determining a Chinese named entity contained in the query statement according to the entity category corresponding to each word; and querying the service program by taking the Chinese named entity as a query keyword to obtain the query response.

In the method for identifying a named entity in chinese provided by the embodiment of the present invention, after receiving a sentence to be identified, a server performs coding processing for each word in the sentence, that is, performs coding processing at a word level, thereby obtaining a coding vector corresponding to each word, and sends the obtained coding vector of each word to a classifier obtained by training in advance for classification and identification, thereby obtaining an entity category corresponding to each word, wherein the entity category indicates whether each word corresponds to a certain entity category set in advance, and specifically corresponds to which entity category, and thereby words of the same adjacent entity category are spliced together to obtain the named entity in chinese corresponding to the entity category. According to the scheme, the sentence to be recognized is classified and recognized in a word level, so that the influence of word segmentation errors on the classification result of the entity category is avoided, and the recognition accuracy is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for identifying a named entity in Chinese according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for identifying a named entity in Chinese according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a word encoding process according to an embodiment of the present invention;

FIG. 4 is a flowchart of another method for identifying a named entity in Chinese according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a local semantic information extraction process according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a process for extracting local semantic information and global association information according to an embodiment of the present invention;

FIG. 7 is a partial diagram illustrating a process for identifying a named entity in Chinese according to an embodiment of the present invention;

FIG. 8 is a flowchart illustrating a method for identifying a named entity in Chinese according to an embodiment of the present invention;

FIG. 9 is an interaction flow chart of a method for identifying a named entity in Chinese according to an embodiment of the present invention;

FIG. 10 is a schematic structural diagram of a Chinese named entity recognition device according to an embodiment of the present invention;

FIG. 11 is a schematic structural diagram of an electronic device corresponding to the Chinese named entity recognition apparatus shown in FIG. 10;

FIG. 12 is a schematic structural diagram of another apparatus for identifying a named entity according to an embodiment of the present invention;

FIG. 13 is a schematic structural diagram of an electronic device corresponding to the Chinese named entity recognition apparatus shown in FIG. 12;

fig. 14 is a schematic composition diagram of a query system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and "a" and "an" generally include at least two, but do not exclude at least one, unless the context clearly dictates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.

In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

The following illustrates the problems of the traditional Chinese named entity recognition process based on word segmentation:

taking a weather query scenario as an example, in the scenario, a user often needs to query weather at a certain place of a certain day, and thus two entity categories, namely time and place, are involved, and therefore, the server needs to actively identify the two entity categories included in a sentence of a user question, namely, identify the Chinese named entities corresponding to the two entity categories. In fact, the problem of identifying the named entities in Chinese is a sequence labeling problem, and the labeling of the categories of the named entities in Chinese can be carried out by adopting a BIO labeling system. Specifically, the following table 1 is taken as an example, namely, the following marks are required:

table 1:

asking questions

Tomorrow (tomorrow)

Wulu (Wulu)

All-grass of Chinese woodruff

Is/are as follows

Weather (weather)

How to do

O

B-day

B-Place

I-Place

O

The first line in the table is the result of word segmentation of the user sentence, and the second line is the labeling result, i.e. the Chinese named entity type recognition result.

A BIO labeling system is adopted, wherein B represents the beginning of an entity, I represents the middle of the entity, and O is meaningless, namely does not belong to the entity. Thus, "tomorrow" is a Chinese named entity corresponding to the category of time (day) and "Ulu wood Qi" is a Chinese named entity corresponding to the category of Place (Place). After the user sentence is labeled as above, i.e. Chinese named entity recognition is performed, the Chinese named entities corresponding to two categories of time and place are known: tomorrow and Wuluqin, further, for example, a weather query service can be called, the two named Chinese entities are used as query keywords to perform query, so as to obtain the weather condition of Wuluqin tomorrow, and the following responses can be made:

q, asking for how to ask how to feel like the weather

A. The Wulu wood in the tomorrow turns fine, and the temperature is 12 to 20 ℃.

As illustrated in Table 1 above, conventional Chinese named entity recognition is based on word segmentation, i.e., the word segmentation results are labeled with categories as in Table 1. However, if the word segmentation is wrong, especially the entity word is segmented into a non-entity word, a phenomenon that the word cannot be recalled occurs, for example, the following scenario is illustrated in table 2:

table 2:

please note that

Inquiring day

Wulu (Wulu)

All-grass of Chinese woodruff

Is/are as follows

Weather (weather)

How to do

O

B-Place

I-Place

O

Assuming that the word segmenter divides the 'asking tomorrow' into a plurality of words, the Chinese named entity recognition process, namely the Chinese named entity labeling process based on the word segmentation cannot label the 'tomorrow' as B-day, so that a correct classification result cannot be obtained.

Based on the above, the embodiment of the invention provides a Chinese named entity recognition method based on characters, which can overcome the influence of the condition of word segmentation and wrong segmentation on the accuracy of a recognition result. The identification method is described below with reference to some embodiments as follows.

Fig. 1 is a flowchart of a method for identifying a chinese named entity according to an embodiment of the present invention, which may be executed by a server. As shown in fig. 1, the method comprises the steps of:

101. and acquiring the sentence to be recognized.

102. And carrying out coding processing on each word in the statement to obtain a coding vector corresponding to each word.

103. And carrying out classified identification on the coding vector corresponding to each word so as to determine the entity class corresponding to each word.

104. And determining the Chinese named entity contained in the sentence according to the entity category corresponding to each word.

In real life, users may have various query requirements, such as weather query, path query, query for size during online shopping, and the like. Thus, the user terminal may have installed therein one or more applications similar to the voice assistant, which may be a functional component in an APP or a separate application, such as an intelligent voice search engine.

When a user needs to trigger a certain query, the user can trigger the input of a query statement through manual input, voice input, image recognition, scanning of a two-dimensional code and the like, and the query statement can be regarded as the statement to be recognized.

After receiving the sentence to be identified, the server needs to identify the Chinese named entity contained in the sentence, that is, the sentence is subjected to Chinese named entity tagging. In combination with the query scenario, it is equivalent to extract the query keyword contained in the query keyword, so as to query the corresponding service program, such as the service program providing the weather query function, based on the query keyword, and obtain the weather query response to feed back to the user.

It is understood that the sentence may be subjected to Chinese named entity recognition based on a recognition model obtained by training in advance, and the recognition model may be a model obtained by performing supervised training on a neural network such as a convolutional neural network through a large number of training samples.

In the embodiment of the invention, the Chinese named entity recognition of the statement is carried out in a word level, namely, the Chinese named entity contained in the statement is determined by recognizing the entity category corresponding to each word in the statement.

Therefore, after receiving the sentence, the server may perform encoding processing on each word in the sentence by using the identification model, so as to obtain an encoding vector corresponding to each word, further perform classification and identification on the obtained encoding vector of each word, so as to determine an entity category corresponding to each word, and finally, may splice together adjacent words having the same entity category to obtain a chinese named entity corresponding to the entity category.

In particular, the recognition model can be functionally subdivided to include an encoder (also referred to as an encoding module) and a classifier (also referred to as a classification module). The encoder outputs the encoded vector corresponding to each word, and then inputs the encoded vector to the classifier, so that the classifier is used to perform classification recognition on the input encoded vectors to determine the classification output vector corresponding to each word, that is, the output of the classifier is the classification output vector corresponding to each word, wherein different elements in each classification output vector correspond to preset probability values of different entity classes, for example, assuming that the output vector corresponding to a word is (0.9,0.1,0), and assuming that the three elements respectively correspond to the entity class A, B, C, it is stated that the probability of the word corresponding to the entity class a is 0.9, and the probability of the word corresponding to the entity class B is 0.1. Optionally, the entity category a corresponding to the maximum probability value may be used as the entity category identification result of the word. However, in practical applications, optionally, in order to improve the accuracy of the recognition result, the classification output vector of each word may be further input to the CRF model, so as to determine a category sequence with the highest transition probability score according to the transition matrix of the CRF model and the classification output vector corresponding to each word, where the category sequence is the entity category recognition result of each word finally corresponding to the sentence to be recognized, that is, the mth category in the category sequence is the entity category corresponding to the mth word in the sentence, and the value range of m is 1 to the total number of words included in the sentence. The CRF model is a functional module located behind the classifier, and relevant parameters such as a transition matrix and the like of the CRF model can be obtained in a training phase. CRF takes into account dependencies between categories of words. The specific principle of CRF is not described in detail in the embodiments of the present invention, and the embodiment only illustrates the meaning of obtaining the highest category sequence: assuming that three words of C1, C2 and C3 are sequentially included in the sentence to be recognized, and sequentially corresponding classification output vectors are (p1, p2, p3), (p4, p5, p6), (p7, p8 and p9), assuming that elements in the classification output vectors respectively correspond to the entity class A, B, C, assuming that the class sequences with the highest transition probability scores are p1, p4 and p8, finally determining that C1 corresponds to the entity class a, C2 corresponds to the entity class a and C3 corresponds to the entity class C.

According to the scheme, the sentence to be recognized is classified and recognized in a word level, so that the influence of word segmentation errors on the classification result of the entity category is avoided, and the recognition accuracy is improved.

Fig. 2 is a flowchart of another method for identifying a named entity in chinese according to an embodiment of the present invention, as shown in fig. 2, the method may include the following steps:

201. and acquiring the sentence to be recognized.

202. And performing word segmentation processing on the sentences.

203. Aiming at any character i in a sentence, coding any character i to obtain a character vector, coding a participle to which any character i belongs to obtain a word vector, coding a position number of any character i in the sentence to obtain a position vector, and splicing the character vector, the word vector and the position vector to obtain a coding vector corresponding to any character i.

204. And classifying and identifying the coding vector corresponding to each word to determine the entity category corresponding to each word, and determining the Chinese named entity contained in the sentence according to the entity category corresponding to each word.

In this embodiment, in order to further improve the accuracy of the recognition result of the chinese named entity, in the process of performing word-level coding processing on the sentence to be recognized, the context relationship of each word in the sentence is also considered.

Specifically, after receiving the sentence to be recognized, the server may perform word segmentation on the sentence first, and then may input the sentence after word segmentation into the recognition model for subsequent processing. In which, firstly, each word in the sentence is encoded.

In the encoding process for each word, alternatively, the encoding process for each word may be performed in three dimensions: the character itself, the word segmentation to which the character belongs, and the position number of the character in the sentence.

This encoding process is exemplified in conjunction with fig. 3, which assumes in fig. 3 that the sentence to be recognized is "i want to leave", and that the word segmentation result is "i/want/leave". Moreover, it is assumed that the theoretical chinese named entity recognition result of the sentence is "left" for recognizing the entity corresponding to the entity type "endPoint" contained therein, that is, theoretically, the chinese named entity tagging result of the sentence should be: i (O) wants to go (O) and leave (I-endPoint) under (B-endPoint), wherein the leave is a start word corresponding to the endPoint of the entity type, and the leave is an end word corresponding to the endPoint of the entity type, namely, a Chinese named entity corresponding to the endPoint of the entity type is left.

As can be seen from fig. 3, based on the word segmentation result, the word segmentation corresponding to the word "i" is "i", and therefore, the corresponding coding vector is formed by splicing the coding results of the word segmentation "i", the word "i" and the position number "1", and similarly, the coding processes of "want" and "down" are similar. For the word "go", the corresponding code vector is formed by splicing the coding results of the participle "go", the word "go" and the position number "3", and similarly, for the word "stay", the corresponding code vector is formed by splicing the coding results of the participle "go", the word "stay" and the position number "4". In the figure, three rectangular blocks with different styles are respectively used for representing word vectors obtained by word segmentation coding, word vectors obtained by word coding and position vectors obtained by position number coding.

To illustrate the above splicing process, assuming that any word is represented by char, and the coding vector corresponding to the word is represented by embedding (char), embedding (char) means splicing processing, word is a word vector corresponding to the word, char means a word vector, and position means a position vector. That is, assuming that the word vector is N1 dimensions, the word vector is N2 dimensions, and the position vector is N3 dimensions, the spliced encoding vector is N1+ N2+ N3 dimensions.

The word segmentation, word and position vector can be coded by using a coding model such as word2vec, one-hot and the like.

For any character, in the process of coding the character, the word segmentation to which the character belongs is coded, so that the correlation between the characters can be well established, the related characters are more similar, the position numbers of the related characters are coded, and the interference of the same characters on the recognition result can be prevented.

Therefore, after the encoding processing is carried out on each word, the encoding vector corresponding to each word can be sent to a classifier for classification and identification, so as to determine the entity category corresponding to each word. In the process of word coding processing, the related information of multiple dimensionalities of the word is considered, so that the accuracy of the recognition result is improved.

Fig. 4 is a flowchart of another method for identifying a named entity in chinese according to an embodiment of the present invention, as shown in fig. 4, the method may include the following steps:

401. and acquiring the sentence to be recognized.

402. And performing word segmentation processing on the sentences.

403. Aiming at any character i in a sentence, coding any character i to obtain a character vector, coding a participle to which any character i belongs to obtain a word vector, coding a position number of any character i in the sentence to obtain a position vector, and splicing the character vector, the word vector and the position vector to obtain a coding vector corresponding to any character i.

404. And extracting the local semantic information vector corresponding to each word by using a convolutional neural network model, wherein the width of a convolutional kernel adopted by the convolutional neural network model is equal to the dimension of the coding vector, and the height of the convolutional kernel is greater than or equal to 2.

In order to further improve the accuracy of the recognition result of the named entity in Chinese, the influence of the context information of each word on the category recognition result corresponding to the word is also considered, for understanding the influence of the context, for example, if a sentence is asked which songs? were sung by Zhou Jie Lun on Beijing Sing and another sentence is inquired about the TV plays performed by Zhou Jie Lun, and two categories of singer and actor are preset.

Therefore, after obtaining the coding vector corresponding to each word, it is also possible to further extract the context information corresponding to each word, that is, the local semantic information, and the extraction result of the local semantic information is referred to as a local semantic information vector, that is, the sentence semantic information is expressed in a vector form.

Specifically, a convolutional neural network model can be used to extract a local semantic information vector corresponding to each word, wherein the convolutional neural network model adopts a convolutional kernel whose width is equal to the dimension, i.e., the length, of the coding vector, and whose height is greater than or equal to 2. The extraction process of the local semantic information vector may be understood with reference to fig. 5, as shown in fig. 5, a plurality of convolution kernels are illustrated in the figure, a convolutional neural network may include a plurality of hidden layers or may also be referred to as convolutional layers, each convolutional layer may include one or more convolution kernels, and the number of the convolution kernels may be set according to an actual scene or experience. The width of the convolution kernel in the figure is the length of the code vector corresponding to the word, and the height of the convolution kernel is 3. In fact, the height theoretically should be greater than or equal to 2, but in general it can take an odd value greater than or equal to 3, for example 3, 5, 7. The size of this value determines how far left and right can be traversed, centered on a word. For example, when the height value is 3, only 1 word is traversed on both the left and right sides, and when the height value is 5, 2 words are traversed on both the left and right sides. It should be noted that for the first word and the last word, there is no word on the left or right, so 0 is needed to be complemented. After the convolution operation is finished, the coding vector of each word is convolved to obtain the deeper semantic information of the word, the part of semantic information is mainly local, so the part of semantic information is called local semantic information, and the convolution result is called a local semantic information vector.

In an alternative embodiment, after obtaining the local semantic information vector corresponding to each word, classification and identification may be performed on the obtained coding vector and local semantic information vector corresponding to each word to determine the entity category corresponding to each word.

405. And determining a global associated information vector corresponding to each word according to the attention weight between the words.

406. And performing classified recognition on the coding vector, the local semantic information vector and the global association information vector corresponding to each word by using a classifier to determine a classified output vector corresponding to each word, wherein different elements in the classified output vector correspond to the preset probability values of different entity classes.

407. And determining a category sequence with the highest transition probability score according to the transition matrix of the CRF model and the classification output vector corresponding to each word, wherein the mth category in the category sequence is the entity category corresponding to the mth word in the sentence.

Wherein, the value range of m is 1 to the total number of words contained in the statement.

408. And determining the Chinese named entity contained in the sentence according to the entity category corresponding to each word.

In order to further improve the accuracy of the classification result, after the local semantic information vector corresponding to each word is obtained, the global association information vector corresponding to each word can be further extracted. For a word, the global association information vector corresponding to the word reflects the influence degree, namely the weight, of other words in the sentence to be recognized on the class recognition result of the word. Still taking the example of "which songs are sung by zhou jiron on beijing concert" as an example, for the determination of the category corresponding to the entity of zhou jiron, it can be understood that the word "song" has a larger influence on the determination result of the category than other words, and therefore, when determining the category corresponding to the entity of zhou jiron, if the weight of the song on the identification of the category can be increased, it has a positive effect on the accuracy of the final identification result.

Based on this, after obtaining the local semantic information vector corresponding to each word, optionally, according to the local semantic information vector corresponding to each word, the global association information vector corresponding to each word may be determined according to the attention weight between the words. Specifically, for any word i in the sentence, attention weights between any word i and other words can be determined firstly; then, carrying out normalization processing on each determined attention weight; and then carrying out weighted summation processing on the local semantic information vector of any word i according to each attention weight after normalization processing, wherein the result of the weighted summation processing is a global associated information vector corresponding to any word i.

In order to intuitively understand the process of determining the office association information vector, the following formula is combined for specific explanation:

specifically, for any word i in a sentence, a global association information vector c corresponding to any word i is determined according to the following formula_i：

Wherein, a_ij＝softmax(e_ij),

Wherein h is_iA local semantic information vector, h, for any word i_jLocal semantic information vector, e, for any word j in a sentence_ijThe attention weight between any word i and any word j is an inner product operator, K is the total number of words contained in the statement, T is a vector transpose operator, and softmax is a normalization operator.

It should be noted that j may take any value from 1 to K, i.e., j may be equal to i.

In practical application, for any word i, it can be according to e_ijThe attention weights between the other words and the word i are obtained, and the attention weights form a coefficient sequencePerforming softmax operation on the weight sequence, wherein the processed weight sequence is used for c_iAnd (4) calculating.

Taking "i want to leave" as an example shown in fig. 3, when global information of the "next" word is calculated, it can be found through attention weight calculation that the "next" word is greatly influenced by the "going" word, so that more information (with greater weight) of the "going" is applied to the category identification process of the "next".

It is understood that the above-mentioned extraction of the global correlation information can also be implemented based on a convolutional neural network, because the above-mentioned operations such as convolutional operation can be conveniently implemented based on a convolutional kernel structure in the convolutional neural network.

However, in practical application, the convolutional neural network may have a problem that a back propagation gradient disappears, and therefore, in the embodiment of the present invention, in order to improve the efficiency of identifying the chinese named entity class, a deep residual error network (which may be regarded as a special convolutional neural network) may be used to extract the local semantic information vector and the global association information vector.

The extraction process of the local semantic information vector and the global correlation information vector by the deep residual network can be understood by combining fig. 6. The depth residual error network mainly comprises a local semantic information extraction layer for extracting a local semantic information vector and a global correlation information extraction layer for extracting a global correlation information vector, wherein x represents a coding vector corresponding to a certain word, f (x) represents a process for extracting the local semantic information vector through a convolution operation process of a convolution kernel, and g (y) represents a process for extracting the global correlation information vector through the convolution operation process.

Based on this, after the local semantic information vector and the global correlation information vector of each word are extracted through the depth residual error network, equivalently, the coding vector, the local semantic information vector and the global correlation information vector corresponding to each word are organized together, and the organized result is embodied as an output vector of the depth residual error network, namely z indicated in fig. 6.

The process of classifying and recognizing the output vector of the deep residual error network can be referred to fig. 7, where it is assumed in fig. 7 that the received sentence to be recognized is "i want to leave", the received sentence is input to the deep residual error network after obtaining the coding vector of each word (it is assumed in the figure that i want to leave these five words respectively representing the respective corresponding coding vectors), the output vector of the deep residual error network is the vector z in fig. 6 corresponding to each word, these vectors are further input to a full connection layer, which is the trained classifier, the classifier outputs the classification output vector corresponding to each word, different elements in the classification output vector correspond to preset probability values of different entity classes, and then each classification output vector is input to a CRF layer (i.e., a CRF model), thereby finally outputting a class sequence with the highest transition probability score, where the mth class in the class sequence is the entity class corresponding to the mth word in the sentence, thus, the entity category corresponding to each word is obtained, and then adjacent words with the same entity category are spliced together, namely, the Chinese named entity corresponding to the entity category is obtained, so that the Chinese named entity identification of the statement is completed, and the category sequence corresponds to the Chinese named entity labeling result in fig. 7. As can be seen from the labeling result, the sentence includes an entity word corresponding to the category endPoint: leaving behind.

Fig. 8 is a flowchart of another method for identifying a named entity in chinese according to an embodiment of the present invention, as shown in fig. 8, the method may include the following steps:

801. and acquiring the sentence to be recognized.

802. And removing the interference words in the statement.

803. And carrying out coding processing on the rest words to obtain coding vectors corresponding to the rest words respectively.

804. And carrying out classified identification on the coded vectors to determine entity classes corresponding to the rest words respectively.

805. And determining the Chinese named entities contained in the sentence according to the entity categories corresponding to the rest characters.

Different from the foregoing embodiment in which encoding processing is performed on each word in the sentence to be recognized, in this embodiment, encoding processing may be performed on only a part of words in the sentence to be recognized, so that the amount of calculation for encoding processing may be reduced, and recognition efficiency may be improved.

The words and characters with certain parts of speech can be preset to belong to interfering characters, such as virtual words, adjectives, digital words, pronouns and the like. Therefore, the method can remove the interfering words in the sentence to be recognized by performing part-of-speech recognition on the sentence to be recognized, wherein the part-of-speech recognition process can be realized by referring to the prior related technology.

The following briefly describes, with reference to the embodiment shown in fig. 9, an execution logic of the method for identifying a named entity in chinese according to the embodiment of the present invention in practical application.

Fig. 9 is an interaction flowchart of a method for identifying a chinese named entity according to an embodiment of the present invention, as shown in fig. 9, the method includes the following steps:

901. and the user terminal receives the sentence to be recognized input by the user.

The sentence to be recognized may be, for example, a sentence triggered by the user to inquire about weather conditions of a certain day and a certain place.

902. And the user terminal sends the sentence to be identified to the server.

And sending the statement to be recognized to a server so that the server performs coding processing on each word in the statement and performs classification recognition on a coding vector obtained through the coding processing to determine an entity category corresponding to each word.

In an optional embodiment, after receiving the sentence to be recognized, the user terminal device may further remove the interfering word in the sentence to be recognized, and then sequentially send the remaining words in the sentence to be recognized after the interfering word is removed to the server, so that the server performs encoding processing on the remaining words and performs classification and recognition on the encoded vector obtained by the encoding processing to determine the entity category corresponding to the remaining words. The interference words may be specific types of words such as pronouns, adjectives, and numerics in the imaginary words and the real words.

By filtering the interference words of the sentence to be recognized, the calculation amount of subsequent coding processing and other processes can be reduced.

903. The server determines a service corresponding to the statement.

The server may determine a service program corresponding to the sentence by extracting a keyword from the sentence and matching the extracted keyword with a keyword included in a keyword database corresponding to each preset service program, for example, a weather query service program.

904. The server carries out coding processing on each word in the received sentence to obtain a coding vector corresponding to each word, carries out classification and identification on the coding vector to determine the entity category corresponding to each word, and determines the Chinese named entity contained in the sentence according to the entity category corresponding to each word.

905. The server queries the service program with the Chinese named entity as a query key to obtain a query response.

Assuming that the user's sentence is "how much the Hangzhou weather in Mingtian", the recognition result is to recognize two categories of entities contained therein: and the query keywords are the two entities of the time category, tomorrow and the place category, Hangzhou, so that the weather query service is input to obtain the weather query result of tomorrow.

906. The server sends the query response to the user terminal.

The Chinese named entity recognition device of one or more embodiments of the present invention is described in detail below. Those skilled in the art will appreciate that these chinese named entity recognition devices can each be constructed using commercially available hardware components configured through the steps taught in this scheme.

Fig. 10 is a schematic structural diagram of a device for identifying a named entity in chinese according to an embodiment of the present invention, as shown in fig. 10, the device includes: the device comprises an acquisition module 11, a coding module 12, a classification identification module 13 and a determination module 14.

And the obtaining module 11 is configured to obtain a sentence to be identified.

And the encoding module 12 is configured to perform encoding processing on each word in the sentence to obtain an encoding vector corresponding to each word.

And the classification identification module 13 is configured to perform classification identification on the coding vectors to determine an entity class corresponding to each word.

And the determining module 14 is configured to determine the chinese named entity included in the sentence according to the entity category corresponding to each word.

Optionally, the classification identifying module 13 may be configured to: classifying and identifying the coding vectors by using a classifier to determine a classification output vector corresponding to each word, wherein different elements in the classification output vector correspond to the preset probability values of different entity classes; determining a category sequence with the highest transition probability score according to a transition matrix of a CRF model and a classification output vector corresponding to each word, wherein the mth category in the category sequence is an entity category corresponding to the mth word in the statement, and the value range of m is 1 to the total word number contained in the statement.

Optionally, the encoding module 12 may be configured to: performing word segmentation processing on the sentence; aiming at any character i in the sentence, coding any character i to obtain a character vector, coding a participle to which any character i belongs to obtain a word vector, and coding a position number of any character i in the sentence to obtain a position vector; and splicing the word vector, the word vector and the position vector to obtain a coding vector corresponding to any word i.

Optionally, the apparatus may further include: an extraction module 15.

The extraction module 15 may be configured to: and extracting the local semantic information vector corresponding to each word by using a convolutional neural network model, wherein the width of a convolutional kernel adopted by the convolutional neural network model is equal to the dimension of the coding vector, and the height of the convolutional kernel is greater than or equal to 2.

Accordingly, the classification identification module 13 may be configured to: and carrying out classified identification on the coding vector and the local semantic information vector to determine the entity category corresponding to each word.

Optionally, the extraction module 15 may be further configured to: and determining the global associated information vector corresponding to each word according to the attention weight between the words.

Accordingly, the classification identification module 13 may be configured to: and carrying out classified identification on the coding vector, the local semantic information vector and the global correlation information vector so as to determine the entity category corresponding to each word.

In the process of determining the global association information vector corresponding to each word according to the attention weight between the words, the extraction module 15 may be configured to: for any word i in the statement, determining attention weight between the word i and other words; carrying out normalization processing on each determined attention weight; and carrying out weighted summation processing on the local semantic information vector of any word i according to each attention weight after normalization processing, wherein the result of the weighted summation processing is a global associated information vector corresponding to any word i.

Optionally, the extraction module 15 may be configured to: and extracting the local semantic information vector and the global correlation information vector through a depth residual error network.

Accordingly, the classification identification module 13 may be configured to: and performing classified identification on the output vector of the depth residual error network to determine the entity class corresponding to each word.

Optionally, the apparatus may further include: the service processing module is used for determining a service program corresponding to the statement; and querying the service program by taking the determined Chinese named entity contained in the sentence as a query keyword to obtain a query response.

The apparatus shown in fig. 10 can perform the method of the embodiment shown in fig. 1 to 9, and reference may be made to the related description of the embodiment shown in fig. 1 to 9 for a part not described in detail in this embodiment. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to fig. 9, and are not described herein again.

Having described the internal functions and structure of the apparatus for identifying a named entity in chinese, in one possible design, the structure of the apparatus for identifying a named entity in chinese may be implemented as an electronic device, which may be a server, as shown in fig. 11, and may include: a first processor 21 and a first memory 22. Wherein the first memory 22 is used for storing a program that supports an electronic device to execute the method for identifying a chinese named entity provided in the embodiments shown in fig. 1 to 9, and the first processor 21 is configured to execute the program stored in the first memory 22.

The program comprises one or more computer instructions which, when executed by the first processor 21, are capable of performing the steps of:

obtaining a sentence to be identified;

and carrying out classified identification on the coding vectors to determine the corresponding category of each word.

Optionally, the first processor 21 is further configured to perform all or part of the steps in the foregoing embodiments shown in fig. 1 to 9.

The electronic device may further include a first communication interface 23, which is used for the electronic device to communicate with other devices or a communication network.

In addition, an embodiment of the present invention provides a computer storage medium for storing computer software instructions for an electronic device, which includes a program for executing the method for identifying a named entity in chinese according to the embodiment of the method shown in fig. 1 to 9.

Fig. 12 is a schematic structural diagram of another apparatus for recognizing a named entity in chinese according to an embodiment of the present invention, as shown in fig. 12, the apparatus includes: a receiving module 31 and a transmitting module 32.

The receiving module 31 is configured to receive a sentence to be recognized input by a user.

The sending module 32 is configured to send the sentence to be recognized to a server, so that the server performs encoding processing on each word in the sentence, and determines the chinese named entity included in the sentence to be recognized according to the entity category corresponding to each word obtained by performing classification and recognition on the encoding vector obtained by the encoding processing.

Optionally, the apparatus further comprises: and the filtering module is used for removing the interference words in the sentence to be identified. Correspondingly, the sending module 32 is specifically configured to: and sequentially sending the words left in the sentence to be identified after the interference words are removed to the server.

Optionally, the receiving module 31 may be further configured to: and receiving a query response sent by the server, wherein the query response is obtained by the server by using the Chinese named entity as a query keyword to query a service program corresponding to the sentence to be identified.

The apparatus shown in fig. 12 can execute the method of the embodiment shown in fig. 9, and reference may be made to the related description of the embodiment shown in fig. 9 for a part of this embodiment that is not described in detail. The implementation process and technical effect of the technical solution are described in the embodiment shown in fig. 9, and are not described herein again.

Having described the internal functions and structure of the apparatus for identifying a named entity in chinese, in one possible design, the structure of the apparatus for identifying a named entity in chinese may be implemented as an electronic device, which may be a user terminal, as shown in fig. 13, and may include: a second processor 41 and a second memory 42. Wherein the second memory 42 is used for storing programs that support an electronic device to execute the chinese named entity recognition method provided in the embodiment shown in fig. 9, and the second processor 41 is configured to execute the programs stored in the second memory 42.

The program comprises one or more computer instructions which, when executed by the second processor 41, are capable of performing the steps of:

receiving a sentence to be recognized input by a user;

and sending the statement to be recognized to a server so that the server performs coding processing on each word in the statement and performs classification recognition on a coding vector obtained by the coding processing to determine the category corresponding to each word.

Optionally, the second processor 41 is further configured to perform all or part of the steps in the foregoing embodiment shown in fig. 9.

The electronic device may further include a second communication interface 43 for communicating with other devices or a communication network.

In addition, an embodiment of the present invention provides a computer storage medium for storing computer software instructions for an electronic device, which includes a program for executing the method for identifying a named entity in chinese according to the embodiment of the method shown in fig. 9.

Fig. 14 is a schematic composition diagram of a query system according to an embodiment of the present invention, and as shown in fig. 14, the query system includes: user terminal and server.

The user terminal is used for responding to a query statement input by a user, sending the query statement to the server, and receiving a query response sent by the server.

The query statement may be a statement to be identified in the foregoing embodiment, and the execution process of the user terminal and the server may refer to the relevant description in the foregoing embodiment, which is not described herein again. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A Chinese named entity recognition method is characterized by comprising the following steps:

obtaining a sentence to be identified;

2. The method of claim 1, further comprising:

performing word segmentation processing on the sentence;

the encoding each word in the sentence to obtain an encoding vector corresponding to each word includes:

aiming at any character i in the sentence, coding any character i to obtain a character vector, coding a participle to which any character i belongs to obtain a word vector, and coding a position number of any character i in the sentence to obtain a position vector;

and splicing the word vector, the word vector and the position vector to obtain a coding vector corresponding to any word i.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

extracting a local semantic information vector corresponding to each word by using a convolutional neural network model, wherein the width of a convolutional kernel adopted by the convolutional neural network model is equal to the dimension of the coding vector, and the height of the convolutional kernel is greater than or equal to 2;

the classifying and identifying the encoding vectors to determine the entity category corresponding to each word includes:

and carrying out classified identification on the coding vector and the local semantic information vector to determine the entity category corresponding to each word.

4. The method of claim 3, further comprising:

determining a global associated information vector corresponding to each word according to attention weights among the words;

and carrying out classified identification on the coding vector, the local semantic information vector and the global correlation information vector so as to determine the entity category corresponding to each word.

5. The method of claim 4, wherein the determining the global association information vector corresponding to each word according to the attention weight between the words comprises:

for any word i in the statement, determining attention weight between the word i and other words;

carrying out normalization processing on each determined attention weight;

and carrying out weighted summation processing on the local semantic information vector of any word i according to each attention weight after normalization processing, wherein the result of the weighted summation processing is a global associated information vector corresponding to any word i.

6. The method of claim 4, further comprising:

extracting the local semantic information vector and the global correlation information vector through a depth residual error network;

the classifying and identifying the encoding vector, the local semantic information vector and the global association information vector to determine an entity category corresponding to each word includes:

and performing classified identification on the output vector of the depth residual error network to determine the entity class corresponding to each word.

7. The method according to any one of claims 1 to 6, wherein the classifying and identifying the encoding vectors to determine the entity class corresponding to each word comprises:

classifying and identifying the coding vectors by using a classifier to determine a classification output vector corresponding to each word, wherein different elements in the classification output vector correspond to the preset probability values of different entity classes;

determining a category sequence with the highest transition probability score according to a transition matrix of a CRF model and a classification output vector corresponding to each word, wherein the mth category in the category sequence is an entity category corresponding to the mth word in the statement, and the value range of m is 1 to the total word number contained in the statement.

8. The method according to any one of claims 1 to 6, further comprising:

determining a service program corresponding to the statement;

and querying the service program by taking the determined Chinese named entity contained in the sentence as a query keyword to obtain a query response.

9. A Chinese named entity recognition method is characterized by comprising the following steps:

obtaining a sentence to be identified;

removing interference words in the statement;

coding the rest words to obtain coding vectors corresponding to the rest words;

10. A chinese named entity recognition device, comprising:

the acquisition module is used for acquiring the sentences to be identified;

11. An electronic device, comprising: a memory, a processor; wherein,

the memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, implement the chinese named entity recognition method according to any one of claims 1 to 8.

12. A Chinese named entity recognition method is characterized by comprising the following steps:

receiving a sentence to be recognized input by a user;

13. The method of claim 12, further comprising:

removing interference words in the sentence to be identified;

the sending the sentence to be recognized to a server includes:

and sequentially sending the words left in the sentence to be identified after the interference words are removed to the server.

14. The method of claim 12, further comprising:

and receiving a query response sent by the server, wherein the query response is obtained by the server by using the Chinese named entity as a query keyword to query a service program corresponding to the sentence to be identified.

15. A chinese named entity recognition device, comprising:

16. A query system, comprising:

a user terminal and a server;