CN116702780A

CN116702780A - Chinese named entity recognition method, device, electronic equipment and storage medium

Info

Publication number: CN116702780A
Application number: CN202310728860.6A
Authority: CN
Inventors: 刘羲; 马英宁; 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-09-05

Abstract

The invention relates to the medical field and the natural language processing field, and discloses a Chinese named entity recognition method, which comprises the following steps: generating a forward graph according to all nodes, a first edge, a second edge and a third edge of a character sequence of a text to be recognized by a preset entity dictionary, and extracting vectors from the forward graph by using a feature extraction module of a preset graph neural network to obtain a first vector feature of the forward graph; and identifying the first vector features by using an identification module of the graph neural network to obtain a named entity result of the text to be identified. The method is applied to the field of digital medical treatment, and the first vector representation of the forward graph is processed by combining the graph neural network and the entity dictionary to obtain the named entity result of the text to be identified, so that the accuracy rate of identifying the named entity in the medical electronic medical record or the medical text can be improved, and unnecessary troubles of medical retrieval or medical data archiving are reduced.

Description

Chinese named entity recognition method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of medical treatment and natural language processing, and in particular, to a method and apparatus for identifying a Chinese named entity, an electronic device, and a storage medium.

Background

Named entity recognition (NER, named Entity Recognition) is a technique that identifies named entities from text that are related to specified information. With the rapid development of internet technology, the medical document in the digital medical field gradually realizes the electronic document, and the medical electronic document is special and can have a large number of medical nouns and professional terms.

The Chinese embedded feature (embedded) of the medical electronic document is typically identified using a character granularity based identification method to derive a medical-related named entity.

For example, in the medical field, a patient's medical electronic document needs to be documented in a database of a medical institution, and because the existing identification method is inaccurate in identification, for example, the "patient has flat lower abdomen after rectal surgery, no abdominal pain, abdominal distension and discomfort, no varicose vein on the abdominal wall" is found, and the possibly obtained named entities identify "operation", "abdominal pain", "abdominal distension" and the like, the problem of misplacing the documented medical electronic document frequently occurs. However, the character granularity-based recognition method does not need Chinese word segmentation and processing steps considering word segmentation boundaries in the recognition process, and each Chinese character in input is directly represented as a token, so that the problem of fuzzy entity boundaries of medical nouns and professional terms in medical electronic documents is easily caused.

Disclosure of Invention

In view of the above, it is necessary to provide a method for identifying a Chinese named entity, which aims to solve the technical problem of fuzzy entity boundaries in the named entity identification method based on character granularity in the prior art.

The invention provides a Chinese named entity recognition method, which comprises the following steps:

acquiring a character sequence of a text to be recognized, taking each character contained in the character sequence as a node, and generating a first edge of a forward graph to be constructed according to a preset relationship between adjacent nodes in the character sequence;

generating a second edge of the forward graph to be constructed according to a preset entity dictionary, a first node of the character sequence and preset relations of other nodes except the first node, and generating a third edge of the forward graph to be constructed according to the entity dictionary, a last node of the character sequence and preset relations of other nodes except the last node;

generating the forward graph according to all the nodes, the first edge, the second edge and the third edge, and extracting vectors from the forward graph by using a feature extraction module of a preset graph neural network to obtain a first vector feature of the forward graph;

And identifying the first vector features by utilizing an identification module of the graph neural network to obtain a named entity result of the text to be identified.

Optionally, before the acquiring the character sequence of the text to be recognized, the method further includes:

and cutting characters of the text to be recognized to obtain a character sequence of the text to be recognized.

Optionally, the generating the first edge of the forward graph to be constructed according to the preset relationship between the adjacent nodes in the character sequence includes:

calculating a first similarity between every two adjacent nodes;

and if the calculated value of the first similarity is larger than a threshold value of corresponding vocabulary of a preset vocabulary, generating the first edge between the adjacent nodes.

Optionally, the entity dictionary is obtained according to the following method:

performing text labeling on the unlabeled text data set to obtain a labeled text data set;

and performing word segmentation and screening on the marked text data set to obtain more than one phrase, and performing vector embedding operation on all phrases to obtain the entity dictionary.

Optionally, the generating the second edge of the forward graph to be constructed according to the preset entity dictionary, the preset relation between the first node of the character sequence and other nodes except the first node includes:

And calculating the second similarity of the first node and the other nodes, and if the calculated value of the second similarity is larger than the threshold value of the phrase corresponding to the entity dictionary, generating the second edge between the first node and the corresponding node.

Optionally, the generating the third side of the forward graph to be constructed according to the entity dictionary, the preset relationship between the last node of the character sequence and other nodes except the last node includes:

and calculating the third similarity between the last node and the other nodes, and if the calculated value of the third similarity is larger than the threshold value of the phrase corresponding to the entity dictionary, generating the third side between the last node and the corresponding node.

Optionally, the vector extraction of the forward graph by using a feature extraction module of a preset graph neural network, to obtain a first vector feature of the forward graph, includes:

constructing a relay node of the forward graph by using a self-attention feature extraction layer of the feature extraction module, and updating the forward graph by using the relay node to generate a reverse graph;

And splicing the vector features of the forward graph and the vector features of the reverse graph by using a feature fusion layer of the feature extraction module to obtain the first vector features.

In order to solve the above problems, the present invention further provides a device for identifying a chinese named entity, the device comprising:

the first generation module is used for acquiring a character sequence of a text to be recognized, taking each character contained in the character sequence as a node, and generating a first edge of a forward graph to be constructed according to a preset relationship between adjacent nodes in the character sequence;

the second generating module is used for generating a second edge of the forward graph to be constructed according to a preset entity dictionary, a preset relation between a first node of the character sequence and other nodes except the first node, and generating a third edge of the forward graph to be constructed according to a preset relation between the entity dictionary, a last node of the character sequence and other nodes except the last node;

the extraction module is used for generating the forward graph according to all the nodes, the first edge, the second edge and the third edge, and extracting vectors from the forward graph by utilizing a characteristic extraction module of a preset graph neural network to obtain a first vector characteristic of the forward graph;

And the output module is used for identifying the first vector characteristic by utilizing the identification module of the graph neural network to obtain a named entity result of the text to be identified.

In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores a Chinese named entity recognition program executable by the at least one processor, the Chinese named entity recognition program being executable by the at least one processor to enable the at least one processor to perform the Chinese named entity recognition method described above.

In order to solve the above problems, the present invention further provides a computer readable storage medium, where a chinese named entity recognition program is stored, where the chinese named entity recognition program may be executed by one or more processors to implement the above chinese named entity recognition method.

Compared with the prior art, the method and the device have the advantages that the character sequence of the text to be recognized is obtained, each character contained in the character sequence is used as a node, and the first edge of the forward graph to be constructed is generated according to the preset relation between the adjacent nodes in the character sequence; generating a second edge of the forward graph to be constructed according to a preset entity dictionary, a first node of the character sequence and preset relations of other nodes except the first node, and generating a third edge of the forward graph to be constructed according to the entity dictionary, a last node of the character sequence and preset relations of other nodes except the last node; the forward graph is generated according to all nodes, the first edge, the second edge and the third edge. When the text to be recognized is converted into the forward graph, the vocabulary information of the entity dictionary is merged into the forward graph, so that the spatial characteristics on the forward graph and the relationship characteristics of adjacent nodes can be accurately presented.

Vector extraction is carried out on the forward graph by using a preset graph neural network feature extraction module, so that a first vector feature of the forward graph is obtained; and identifying the first vector features by utilizing an identification module of the graph neural network to obtain a named entity result of the text to be identified. By utilizing the characteristics of the graph neural network, the fusion information of the forward graph and the entity dictionary is learned, and the problems that the word information is lacking in a character granularity-based recognition method and word segmentation boundaries are caused under the word granularity in the prior art are effectively solved.

The accuracy rate of identifying the named entities in the medical electronic medical record or the medical text can be improved, and unnecessary troubles of medical retrieval or medical data archiving are reduced.

Drawings

FIG. 1 is a flowchart illustrating a method for identifying a Chinese named entity according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a device for identifying Chinese named entities according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an electronic device for implementing a method for identifying a Chinese named entity according to an embodiment of the present invention;

the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the description of "first", "second", etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implying an indication of the number of technical features being indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

Along with the rapid development of the digital medical field, the invention provides a Chinese named entity recognition method which can be applied to the digital medical field, and a forward graph is generated by combining a character sequence of a text to be recognized and a preset entity dictionary, so that each node of the forward graph is ensured to represent a nested phrase of the entity dictionary, and the spatial characteristics on the forward graph and the relation characteristics of adjacent nodes can be accurately presented.

The feature extraction module of the graph neural network has the interaction function capable of processing entities; the identification module of the graph neural network has strong characterization capability and takes global information into consideration, so that the function of labeling bias is solved, and the first vector representation of the forward graph is processed to obtain a named entity result, so that the accuracy of the entity boundary can be improved.

In the medical field, entity references related to medical clinic are identified and extracted from the plain text document of the electronic medical record and classified into predefined categories, so that the accuracy of identifying the named entities in the medical electronic medical record or the medical text can be improved, and unnecessary troubles of medical retrieval or medical data archiving can be reduced.

Referring to fig. 1, a flow chart of a method for identifying a chinese named entity according to an embodiment of the invention is shown. The method is performed by an electronic device.

In this embodiment, the method for identifying the Chinese named entity includes:

s1, acquiring a character sequence of a text to be recognized, taking each character contained in the character sequence as a node, and generating a first edge of a forward graph to be constructed according to a preset relation between adjacent nodes in the character sequence.

In this embodiment, the text to be identified refers to a document to be identified by a chinese named entity or a chinese electronic medical record of a database of a medical institution or an insurance institution; the text to be identified may include documents of various fields. The way to obtain the text to be recognized includes databases of various fields, which are not limited herein. For example, in the medical institution a, after a doctor inputs the symptoms of a patient into a system by using a computer, the symptoms of the patient are generated into a text to be recognized by a preset computer program;

or the documents with medical papers published in the network in the digital medical field can be obtained from a database of institutions such as insurance/medical treatment.

In this embodiment, text to be identified corresponding to the text picture may be obtained by performing text character recognition on the text content of the text picture.

And converting the text content of the text to be recognized into a character sequence, acquiring each character of the character sequence as each node of the forward graph to be constructed, and calculating a preset relationship (for example, the preset relationship is similarity) between adjacent nodes to generate a first edge of the forward graph to be constructed.

In one embodiment, before the obtaining the character sequence of the text to be recognized, the method further comprises:

In one embodiment, the performing character cutting on the text to be recognized to obtain a character sequence of the text to be recognized includes:

preprocessing the text to be identified to obtain a text sentence;

and cutting each word in the text sentence in turn to obtain the character sequence.

The preprocessing comprises the steps of utilizing a preset word segmentation algorithm to segment the text to be recognized, and deleting the stop words obtained by the word segmentation to obtain text sentences.

The dictionary library of the word segmentation algorithm can also be an algorithm for segmenting words based on a wikipedia general field dictionary, and the word segmentation algorithm is not described herein. And when the processing step of deleting the stop words after word segmentation, inquiring the stop words corresponding to the word segmentation through a dictionary library of a word segmentation algorithm, and deleting the stop words.

The stop words are that in text retrieval, certain words or words are automatically filtered before or after processing natural language data (or text) in order to save storage space and improve searching efficiency. The word segmentation processed by the word segmentation algorithm can achieve better natural language processing effect, and help the calculation mechanism to solve complex Chinese language.

And cutting each character in the text sentence in turn to obtain a plurality of independent characters, and taking all the characters as a character sequence of the text to be recognized.

For example, if the text to be recognized is "small yellow, you are cold," then the text is character-cut to obtain a text character sequence of "small- > yellow- >, - > you- > have- > feel- > cap- >? ".

Or the text to be identified is 'no abdominal pain, abdominal distention and discomfort', and the text is subjected to character cutting to obtain a text character sequence of 'no- > abdominal- > pain- > abdominal- > distention- > non- > discomfort'.

Therefore, the character cutting is carried out on the text to be recognized so as to cut all the characters in the text to be recognized, and the recognition capability of the character sensitivity in the text to be recognized is improved.

In one embodiment, the generating the first edge of the forward graph to be constructed according to the preset relationship between the adjacent nodes in the character sequence includes:

calculating a first similarity between every two adjacent nodes;

Calculating first similarity (for example, the first similarity is cosine similarity) of every two adjacent nodes in the character sequence, and obtaining values of a plurality of first similarities;

Inquiring the corresponding vocabulary from a preset vocabulary according to each two calculated adjacent nodes, and if the calculated value of the first similarity of each two adjacent nodes is larger than the threshold value of the corresponding vocabulary, generating a connected first edge between the adjacent nodes.

The vocabulary is constructed based on a prefix dictionary, i.e. a text, and the dictionary can be modified or added in a self-defined manner by unregistered new words which cannot be segmented in the application of the text. The vocabulary is the same as the text. Txt, and one word occupies one line; each row is divided into three parts: the words, word frequency (can be omitted) and part of speech (can be omitted) are separated by spaces, and when the word frequency is omitted, automatic calculation is used to ensure that the word frequency of the word is separated, so that the accuracy of judging whether the adjacent nodes have relevance is improved.

S2, generating a second edge of the forward graph to be constructed according to a preset relation between a preset entity dictionary and a first node of the character sequence and other nodes except the first node, and generating a third edge of the forward graph to be constructed according to a preset relation between the entity dictionary and a last node of the character sequence and other nodes except the last node.

In this embodiment, the preset entity dictionary is constructed based on a large number of labeled text data, and the labeled text data is screened for phrases having a critical meaning.

The second edge is generated according to a preset relationship (for example, the preset relationship is a similarity) between the first node and other nodes except the first node. And generating a third side according to the preset relation between the last node and other nodes except the last node.

For example, according to a physical dictionary preset by a medical institution, there are manually labeled entities (including 6 major classes of diseases, diagnoses, examinations, operations, medicines, anatomical sites, and the like, and then the major classes include a plurality of minor classes) in the physical dictionary.

According to the fact that the character sequence of each entity of the entity dictionary preset and constructed by the medical institution is- "no- > abdominal- > pain- > abdominal- > expansion- > non- > -fit", a forward diagram to be constructed of the text to be recognized is generated.

The forward graph contains knowledge maps of symptom keywords of medical texts, information retrieval and text understanding information. That is, any two non-adjacent nodes in the character sequence are connected through the second edge and the third edge, the second edge and the third edge represent the characteristics of the word groups in the entity dictionary which are possibly existed in the text to be recognized, and the accuracy of judging whether the any two non-adjacent nodes have relevance is improved.

In one embodiment, the entity dictionary is obtained as follows:

By collecting a preset number (for example, 10 ten thousand) of unlabeled text data as an unlabeled text data set, a small amount of unlabeled text data can be labeled manually, text semantic characterization of a small amount of manually labeled text data is learned according to a depth network model, and then text labeling is carried out on other unlabeled text data by using the trained depth network model, so that a labeled text data set is obtained;

word segmentation and screening are carried out on the marked text data set by using a preset corpus (such as UGC) to obtain word groups with key meaning, type prediction is carried out on the word groups by using a preset prediction model (such as AutoNER), and vector embedding operation is carried out on the types to construct an entity dictionary, so that the retrieval property and the practicability of the entity dictionary are expanded.

After the entity dictionary is obtained, the use mode of the entity dictionary is selected according to the task characteristics and the data characteristics.

Sometimes, quality evaluation, screening and classification are required to be performed on entity dictionaries, for example, entities with high ambiguity form a dictionary, entities with low ambiguity form a dictionary, new words form a dictionary, and the like, so that the dictionary is improved to adapt to different practical scenes.

Quality assessment of the dictionary, including dictionary size, dictionary accuracy, coverage, entity ambiguity, entity frequency.

The application mode can be combined into the NER model for use, can be matched independently, and can be combined with the model after being matched.

In one embodiment, the generating the second edge of the forward graph to be constructed according to the preset entity dictionary, the preset relation between the first node of the character sequence and other nodes except the first node includes:

In one embodiment, the generating the third side of the forward graph to be constructed according to the entity dictionary, the preset relationship between the last node of the character sequence and other nodes except the last node includes:

That is, any two non-adjacent nodes in the character sequence are connected through the second edge and the third edge, the second edge and the third edge represent the characteristics of the word groups in the entity dictionary which are possibly existed in the text to be recognized, and the accuracy of judging whether the any two non-adjacent nodes have relevance is improved.

And S3, generating the forward graph according to all the nodes, the first edge, the second edge and the third edge, and extracting vectors from the forward graph by using a characteristic extraction module of a preset graph neural network to obtain a first vector characteristic of the forward graph.

In this embodiment, the forward graph is generated according to all the nodes, the first edge, the second edge and the third edge, that is, the text to be recognized is converted into a forward graph/directed graph, each word of the text to be recognized corresponds to one node of the forward graph, and each edge is connected with the first word and the last word of one word.

The state of the i-th node of the forward graph represents the character of the i-th character in the character sequence, and the state of each edge represents the character of a potential phrase that may appear in the entity dictionary.

The graphic neural network is trained from a large number of training samples of a medical institution (for example, the training samples comprise 5 ten thousand medical records), can be oriented to medical entity identification and attribute extraction of Chinese electronic medical records, and can identify and extract entity references related to medical clinic from given electronic medical record plain text documents and classify the entity references into predefined categories.

The feature extraction module is a graph neural network capable of processing interaction information between entities, the graph neural network can adopt a GNN graph neural network, the GNN graph neural network can extract spatial features (the spatial features comprise global features and local features) on a forward graph, then the relationship features of adjacent nodes are found out, and the spatial features and the relationship features are convolved to obtain a first vector representation.

In one embodiment, the vector extraction of the forward graph by the feature extraction module of the preset graph neural network, to obtain a first vector feature of the forward graph, includes:

The feature extraction module comprises at least one self-attention feature extraction layer and at least one feature fusion layer; the self-attention feature extraction layer consists of M parallel feature extraction sub-networks SA based on different distances of multi-head self-attention, and each feature extraction sub-network SA consists of an H-head self-attention model.

Extracting the relation features of adjacent nodes of the forward graph by using a self-attention feature extraction layer of the feature extraction module, selecting the node with the largest numerical value of the relation features as a global relay node of the forward graph, wherein the relay node can capture the spatial features (the spatial features comprise global features and local features) of the forward graph;

connecting each side of the forward graph with each node by using a relay node to gather information of all sides and nodes as global information, so that the global information is used for eliminating boundary ambiguity between words of a text to be identified;

Due to the existence of the global relay node, any two non-adjacent nodes in the forward graph are second-order neighbors of each other, and non-local information of each other can be received through two node updates, namely, the global context information and the local information of the forward graph are captured by the relay node;

and transposing the graph structure of the forward graph to obtain a reverse graph with all sides reversed, and convoluting and splicing vector features corresponding to the forward graph and the reverse graph by using a feature fusion layer of a feature extraction module to obtain a first vector feature of the forward graph.

In the prior art, a named entity recognition method based on character granularity generally uses an RNN model, and when the models of longest match, shortest match and the like are used, the technical problem of entity boundary blurring can be encountered. In step S3, the graph structure of the forward graph is processed through the graph neural network capable of processing the interaction information between the entities, so that the first vector representation of the forward graph can be effectively obtained, and the restriction of the RNN model sequence structure can be broken by utilizing the graph structure of the forward graph, so that the characters and dictionary words are more fully associated. The method solves the technical problem of fuzzy entity boundary in the named entity identification method based on character granularity in the prior art.

And S4, identifying the first vector features by utilizing an identification module of the graph neural network to obtain a named entity result of the text to be identified.

In this embodiment, the preset recognition module of the graph neural network includes a long-short-term memory layer and a conditional random field layer, the long-short-term memory layer adopts a deep neural network LSTM model with a text feature extraction function, and the conditional random field layer adopts a deep neural network CRF model with a text decoding function.

For example, if the text to be identified is "no abdominal pain, abdominal distention and discomfort", the text is subjected to character cutting to obtain a text character sequence of "no- > abdominal pain- > abdominal- > distention- > non- > fit", the first vector feature of the forward graph is identified by using the identification module of the graph neural network, so that the named entity result of the text to be identified is "no abdominal pain", "no abdominal distention and discomfort", and the electronic medical record of the patient is archived in the correct classification of the database of the medical institution according to the obtained named entity result, so that the accuracy and speed for identifying the named entity in the medical electronic medical record or the medical text can be improved, and unnecessary troubles of medical retrieval or medical data archiving are reduced.

The STM model may include at least one LSTM cell, each LSTM cell being formed by a combination of a forward LSTM layer and a reverse LSTM layer, each LSTM cell including an input gate, an output gate, a forget gate, and a memory cell. In the identification process of the first vector representation, information can be added or removed through a gate structure of the LSTM model, and different neural networks can determine which related information to remember or forget through the gate structure on the unit state, so that vocabulary information of the text to be identified is obtained. In other embodiments, other models with text feature extraction functionality may be employed, and are not limited in this regard.

The CRF model can solve the problem of prediction errors of entity labeling, and fully considers the combination relation among labels. In other embodiments, other models with text decoding capabilities may be employed, not limited herein.

In one embodiment, the identifying module includes a long-short term memory layer and a conditional random field layer, and the identifying module that uses the graph neural network identifies the first vector feature to obtain a named entity result of the text to be identified includes:

and extracting the vocabulary information from the first vector representation by using the long-short-term memory layer, inputting the obtained vocabulary information into the conditional random field layer for decoding, and obtaining a named entity result of the text to be identified.

In one embodiment, the extracting the vocabulary information from the first vector representation by using the long-short term memory layer includes:

performing sequence splitting on the first vector representation by using a first network of the long-period and short-period memory layer to obtain a forward sequence and a reverse sequence;

and carrying out characterization processing on the forward sequence and the reverse sequence by using a second network of the long-short-term memory layer to obtain vocabulary information.

In one embodiment, the sequence splitting of the first vector representation by the first network of the long-short-term memory layer to obtain a forward sequence and a reverse sequence includes:

forward sequencing the forward LSTM layer of the first network input by the first vector representation to obtain the forward sequence;

and reversely sequencing the reverse LSTM layer input by the first vector representation to the first network to obtain the reverse sequence.

The first network refers to a first LSTM unit of a long-term memory layer, the LSTM unit is formed by combining a forward LSTM layer and a reverse LSTM layer, and each LSTM unit comprises an input gate, an output gate, a forget gate and a memory unit.

Let c denote the memory unit of the first network, x be the input gate of the first network, f be the forget gate of the first network, h be the output gate of the first network. Constructing a forward LSTM layer, taking the first vector representation as input, and carrying out forward sequencing on the first vector representation to obtain a forward sequence; similarly, a reverse LSTM layer is constructed, the first vector representation is used as input, and reverse sequencing is carried out on the first vector representation, so that a forward sequence is obtained.

In one embodiment, the characterizing the forward sequence and the reverse sequence by using the second network of the long-short-term memory layer to obtain lexical information includes:

respectively carrying out weight calculation on the forward sequence and the reverse sequence by using a weight matrix of the second network to obtain a weight value represented by the first vector;

and selecting a first vector representation corresponding to the weight value larger than or equal to a preset first threshold value as vocabulary information.

The second network is a Multi-head attention network, which is a special structure embedded in the machine learning model, based on which strong features can be given greater weight, whereas weak features are given smaller weight, even 0 weight.

The weight matrix of the second network is Q, K and V, and Q, K and V are input query, key, value vectors respectively.

After the LSTM model splits the first vector representation into a sequence of 2 independent hidden states, the weight calculation of the forward semantic features of the forward sequence or the historical information represented by the first vector is performed by using a weight matrix of the second network, the weight calculation of the reverse semantic features of the reverse sequence or the historical information represented by the first vector is performed by using a weight matrix of the second network, and the first vector representation corresponding to the weight value being greater than or equal to a preset first threshold (for example, the first threshold is 0.5) is selected as text semantic information.

Based on the recognition module combining the first network and the second network (LSTM+multi-head) and on the premise of considering the context of the entity context, entity key information of the text to be recognized can be extracted by giving different weights, and irrelevant information in the text to be recognized is ignored, so that the efficiency of obtaining the feature expression of the text to be recognized is improved.

In one embodiment, the inputting the obtained vocabulary information into the conditional random field layer for decoding to obtain a named entity result of the text to be recognized includes:

labeling the vocabulary information by using a decoder of the conditional random field layer to obtain a label of the character;

and calculating the score of the label, and selecting the label larger than a preset second threshold value from the score result as the recognition result of the text to be recognized.

During labeling, the probability of each phrase corresponding to the text to be recognized is tested by using a state probability matrix of a conditional random field layer, whether the combination relationship of labels accords with a labeling principle (for example, the labeling principle is a BIO label, the BIO label labels each element (word) as 'B-X', 'I-X', or 'O', wherein 'B-X' indicates that a segment where the element is located belongs to an X type and the element is at the beginning of the segment, 'I-X' indicates that the segment where the element is located belongs to an X type and the element is at the middle position of the segment, and 'O' indicates that the element is not of any type) is verified, so that whether the combination between adjacent labeling entities is a correct label or an error label is judged. Thus, in the conditional random field layer fully considers the combination relations between tags, not a single optimal value is considered, but a global optimal value of the text composed of all the group words.

In step S4, the long-term memory layer and the conditional random field layer of the identification module are used, that is, the LSTM model is combined with the CRF model. The characterization capability of the forward graph is obtained by utilizing the LSTM model, global information can be considered by utilizing the CRF model, and the problem of labeling bias is solved.

In the steps S1-S4, according to the technical scheme of combining the graphic neural network and the entity dictionary, when the text to be recognized is converted into the forward graph, the vocabulary information of the entity dictionary is fused into the forward graph, and the fusion information of the forward graph and the entity dictionary is learned by utilizing the characteristics of the graphic neural network, so that the problem that the vocabulary information is lacking in a recognition method based on character granularity and the word segmentation boundary problem is solved effectively in the prior art.

Fig. 2 is a schematic block diagram of a device for identifying chinese named entities according to an embodiment of the invention.

The Chinese named entity recognition device 100 of the present invention may be installed in an electronic apparatus. Depending on the implementation, the chinese named entity recognition device 100 may include a first generation module 110, a second generation module 120, an extraction module 130, and an output module 140. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.

In the present embodiment, the functions concerning the respective modules/units are as follows:

the first generating module 110 is configured to obtain a character sequence of a text to be recognized, and generate a first edge of a forward graph to be constructed according to a preset relationship between adjacent nodes in the character sequence by using each character included in the character sequence as a node;

a second generating module 120, configured to generate a second edge of the forward graph to be constructed according to a preset entity dictionary, a preset relationship between a first node of the character sequence and other nodes except the first node, and generate a third edge of the forward graph to be constructed according to a preset relationship between the entity dictionary, a last node of the character sequence and other nodes except the last node;

the extracting module 130 is configured to generate the forward graph according to all the nodes, the first edge, the second edge, and the third edge, and perform vector extraction on the forward graph by using a feature extracting module of a preset graph neural network to obtain a first vector feature of the forward graph;

and the output module 140 is configured to identify the first vector feature by using an identification module of the neural network to obtain a named entity result of the text to be identified.

calculating a first similarity between every two adjacent nodes;

In one embodiment, the entity dictionary is obtained as follows:

Fig. 3 is a schematic structural diagram of an electronic device for implementing a method for identifying a chinese named entity according to an embodiment of the present invention.

In the present embodiment, the electronic device 1 includes, but is not limited to, a memory 11, a processor 12, and a network interface 13, which are communicatively connected to each other via a system bus, wherein the memory 11 stores a chinese named entity recognition program 10, and the chinese named entity recognition program 10 is executable by the processor 12. Fig. 3 shows only the electronic device 1 with the components 11-13 and the chinese named entity recognition program 10, it will be appreciated by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.

Wherein the storage 11 comprises a memory and at least one type of readable storage medium. The memory provides a buffer for the operation of the electronic device 1; the readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1; in other embodiments, the nonvolatile storage medium may also be an external storage device of the electronic device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. In this embodiment, the readable storage medium of the memory 11 is generally used for storing an operating system and various application software installed in the electronic device 1, for example, storing codes of the chinese named entity recognition program 10 in an embodiment of the present invention. Further, the memory 11 may be used to temporarily store various types of data that have been output or are to be output.

Processor 12 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with other devices, etc. In this embodiment, the processor 12 is configured to execute the program code or process data stored in the memory 11, for example, execute the chinese named entity recognition program 10.

The network interface 13 may comprise a wireless network interface or a wired network interface, the network interface 13 being used for establishing a communication connection between the electronic device 1 and a terminal (not shown).

Optionally, the electronic device 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The chinese named entity recognition program 10 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 12, may implement:

Specifically, the specific implementation method of the processor 12 to the above-mentioned chinese named entity recognition program 10 may refer to the description of the related steps in the corresponding embodiment of fig. 1, which is not repeated herein.

Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable medium may be nonvolatile or nonvolatile. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).

The computer readable storage medium stores a chinese named entity recognition program 10, where the chinese named entity recognition program 10 may be executed by one or more processors, and the specific implementation of the computer readable storage medium is substantially the same as the embodiments of the above-mentioned chinese named entity recognition method, and will not be described herein.

In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A method for identifying a chinese named entity, the method comprising:

2. The method for identifying a chinese named entity of claim 1, wherein prior to said obtaining a character sequence of text to be identified, the method further comprises:

3. The method for identifying a chinese named entity according to claim 1, wherein generating a first edge of a forward graph to be constructed according to a preset relationship between adjacent nodes in the character sequence comprises:

calculating a first similarity between every two adjacent nodes;

4. The method of claim 1, wherein the entity dictionary is obtained by:

5. The method of claim 1, wherein generating the second edge of the forward graph to be constructed according to the preset relation between the preset entity dictionary, the first node of the character sequence, and other nodes except the first node, comprises:

6. The method for identifying a chinese named entity according to claim 1, wherein generating the third side of the forward graph to be constructed according to the entity dictionary, the last node of the character sequence, and the preset relationship between the last node and other nodes except for the last node comprises:

7. The method for identifying a chinese named entity according to claim 1, wherein the performing vector extraction on the forward graph by using a feature extraction module of a preset graph neural network to obtain a first vector feature of the forward graph comprises:

8. A chinese named entity recognition device, the device comprising:

9. An electronic device, the electronic device comprising:

at least one processor; the method comprises the steps of,

the memory stores a chinese named entity recognition program executable by the at least one processor to enable the at least one processor to perform the chinese named entity recognition method of any one of claims 1 to 7.

10. A computer readable storage medium having stored thereon a chinese named entity recognition program executable by one or more processors to implement the chinese named entity recognition method of any one of claims 1 to 7.