WO2023109436A1 - 词性感知嵌套命名实体识别方法、系统、设备和存储介质 - Google Patents

词性感知嵌套命名实体识别方法、系统、设备和存储介质 Download PDF

Info

Publication number
WO2023109436A1
WO2023109436A1 PCT/CN2022/133113 CN2022133113W WO2023109436A1 WO 2023109436 A1 WO2023109436 A1 WO 2023109436A1 CN 2022133113 W CN2022133113 W CN 2022133113W WO 2023109436 A1 WO2023109436 A1 WO 2023109436A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
word
speech
graph
node
Prior art date
Application number
PCT/CN2022/133113
Other languages
English (en)
French (fr)
Inventor
仇晶
周玲
郭晨
陈豪
林杨
顾钊铨
田志宏
贾焰
方滨兴
Original Assignee
广州大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州大学 filed Critical 广州大学
Publication of WO2023109436A1 publication Critical patent/WO2023109436A1/zh
Priority to US18/520,629 priority Critical patent/US20240111956A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present invention relates to the technical field of natural language processing and knowledge map construction, in particular to a part-of-speech-aware nested named entity recognition method, system, computer equipment and storage medium based on a heterogeneous graph attention neural network.
  • Named Entity Recognition is one of the basic tasks in the process of constructing knowledge graphs in the field of natural language processing. It is mainly used to extract entities with specific meaning in knowledge graphs. Important basic tools in application fields such as syntactic analysis, machine translation, and metadata annotation for Semantic Web play an important role in the process of natural language processing technology becoming practical. In the actual natural language sequence, there is a nested entity phenomenon in which an entity contains one or more entities, such as the text "Activation of the cd28 surface receptor provides", “cd28 surface” is a protein type entity, and "cd28 surface receptor " is also a protein type entity.
  • Nested named entity recognition Nested NER
  • Nested NER is also an important and difficult problem in named entity recognition tasks. Its function is to identify nested entities in text. The key to corresponding recognition is how to determine the boundaries of entities and The category of the predicted entity.
  • Existing nested entity recognition methods are mainly divided into three categories: (1) extracting entities in natural language by designing text matching rules, such as matching entities in text by manually writing rules by domain experts; (2) feature engineering-based Supervised learning methods, such as predicting text categories in text sequences by designing feature templates and combining Viterbi algorithm; (3) Deep learning methods based on entity spans, such as using neural networks to extract character-level features of text Deep learning, and an exhaustive candidate entity method that directly enumerates subsequences that may be entities, and then predicts the subsequences.
  • the existing technology can solve the problem of nested entity recognition to a certain extent, it also has obvious defects.
  • the second type of method belongs to the statistical machine learning method, which is easily affected by the distribution of text corpus and has poor generalization ability; although the deep learning method in the third type of method can extract the character and word features of the text, it has many learning parameters and complex calculations.
  • the degree is high, and the exhaustive candidate entity method increases the time complexity of the model, and the simple enumeration of text subsequences is not conducive to improving the performance of the model.
  • the purpose of the present invention is to provide a part-of-speech-aware nested named entity recognition method, system, device and storage medium, by using heterogeneous graph representation learning for nested entity recognition, introducing part-of-speech knowledge to initialize text features, combined with design A hole random walk algorithm based on the part-of-speech path based on sampling to obtain more neighbor node information.
  • DGL Deep Graph Library
  • an embodiment of the present invention provides a part-of-speech-aware nested named entity recognition method, the method comprising the following steps:
  • the text word data includes text sequence ID, part-of-speech category, word frequency and word vector representation;
  • Adopt BiLSTM model to carry out feature extraction to described text word data, obtain corresponding text word depth feature, and according to described text word depth feature, each text word of described to-be-recognized text is initialized as corresponding graph node;
  • the word vector representation of the text to be decoded is decoded and marked to obtain a nested named entity recognition result.
  • step of obtaining the text word data of the text to be recognized includes:
  • a corresponding text sequence ID is set for each text word
  • a word vector representation of each text word in the text to be recognized is generated through the BERT model.
  • the step that obtains corresponding text word depth feature comprises:
  • Adopt BiLSTM model to carry out feature extraction to described text word initial feature, obtain described text word depth feature; Described text word depth feature is expressed as:
  • x i , F( xi ) and h( xi ) respectively represent the text word data, text word initial feature and text word depth feature of the i-th text word; and Respectively represent the text sequence ID, part-of-speech category, word frequency and word vector representation in the i-th text word data.
  • V represents a node set composed of different part-of-speech text words, and the value of each node is a text word depth feature
  • E represents an edge set composed of nodes
  • Ov represents a set of part-of-speech types of nodes
  • Path represents a preset part-of-speech path, And includes verb and noun paths, noun modifiers and noun paths, conjunctions and noun paths, and verb modifiers and verb paths.
  • the step of updating the text word depth features of the graph nodes in the text heterogeneous graph through the attention mechanism includes:
  • neighbor node sampling is performed on each graph node in each preset part-of-speech path to obtain a corresponding neighbor node set;
  • the node information is integrated for the neighbor node set of each graph node in each preset part-of-speech path, and the corresponding graph node representation is obtained; the graph node representation is:
  • v represents the graph node in the i-th preset part-of-speech path, and the value is the corresponding text word depth feature;
  • k indicates the number of attention heads;
  • exp( ⁇ ) represents the exponential function with e as the base;
  • LeakyReLU( ⁇ ) represents the activation function;
  • u T is the weight matrix of the edge;
  • the word frequency and word vector representation of the corresponding graph node in the text heterogeneous graph are updated.
  • the step of sampling neighbor nodes for each graph node in each preset part-of-speech path to obtain a corresponding neighbor node set includes:
  • the preset sampling probability and the sampling stop condition in the graph node sequence, a number of neighbor nodes corresponding to each graph node in the preset part-of-speech path are randomly obtained with an integral multiple of the basic sampling interval as the moving step size, and the corresponding graph nodes are obtained.
  • the neighbor node set; the sampling stop condition is that the total number of neighbor nodes obtained by sampling meets the preset number requirement, and the number of nodes corresponding to each part-of-speech category in the neighbor node meets the preset ratio requirement.
  • the step of decoding and labeling the word vector representation of the text to be decoded to obtain a nested named entity recognition result includes:
  • the improved LSTM unit is used to perform boundary detection on the first labeled text word vector representation, and it is judged whether there is an entity boundary word in the first labeled text word vector representation; the improved LSTM unit is passed on the output hidden layer of the LSTM unit Increase the multi-layer perceptron MLP to get;
  • the first labeled text word vector indicates that there are entity boundary words
  • the first labeled text word vector representations between adjacent entity boundary words are merged to obtain a second labeled text word vector representation
  • the The second tagged text word vector indicates that the decoding tag and boundary detection are performed, and the next round of entity recognition iteration is started, otherwise, the iteration is stopped, and the named entity recognition result is used as the forward named entity recognition result;
  • the word vector representation of the fourth labeled text is decoded and labeled by using a conditional random field to obtain the nested named entity recognition result.
  • an embodiment of the present invention provides a part-of-speech-aware nested named entity recognition system, the system comprising:
  • Preprocessing module for obtaining the text word data of text to be recognized;
  • Said text word data comprises text sequence ID, part of speech category, word frequency and word vector representation;
  • Node initialization module for adopting BiLSTM model to carry out feature extraction to described text word data, obtain corresponding text word depth feature, and according to described text word depth feature, each text word of described to-be-recognized text is initialized as corresponding graph node;
  • a graph construction module configured to construct a text heterogeneous graph of the text to be recognized according to the transfer relationship between each graph node;
  • a node update module configured to update the text word depth features of graph nodes in the text heterogeneous graph through an attention mechanism according to the text heterogeneous graph and the preset part-of-speech path;
  • the feature optimization module is used to extract the features of all graph nodes in the updated text heterogeneous graph by using the BiLSTM model to obtain the word vector representation of the text to be decoded;
  • the result generating module is used to decode and mark the word vector representation of the text to be decoded to obtain a nested named entity recognition result.
  • an embodiment of the present invention also provides a computer device, including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the above method is implemented when the processor executes the computer program A step of.
  • an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above method are implemented.
  • the above-mentioned application provides a part-of-speech-aware nested named entity recognition method, system, computer equipment and storage medium.
  • the BiLSTM model is used to perform text word data.
  • Feature extraction obtains the text word depth feature, and initializes each text word of the text to be recognized as a corresponding graph node according to the text word depth feature, constructs a text heterogeneous graph of the text to be recognized according to the preset part-of-speech path, and updates the graph through the attention mechanism
  • the text word data of the node and then use the BiLSTM model to extract the features of all the graph nodes of the text heterogeneous graph, and after obtaining the text word vector representation to be decoded, use the conditional random field to decode and label, and obtain the technology of nested named entity recognition results plan.
  • this part-of-speech-aware nested named entity recognition method uses heterogeneous graph representation learning for nested entity recognition, introduces part-of-speech knowledge to initialize text features, and combines a designed sampling to obtain more neighbors
  • the hole random walk algorithm based on the part-of-speech path of node information, relying on the DGL (Deep Graph Library) framework, effectively recognizes common entities and nested entities, improves the accuracy and learning efficiency of nested named entity recognition, and further Improve the performance benefits of nested named entity recognition models.
  • FIG. 1 is a schematic diagram of an application scenario of a part-of-speech-aware nested named entity recognition method in an embodiment of the present invention
  • Fig. 2 is a schematic structural diagram of a part-of-speech-aware nested named entity recognition model PANNER in an embodiment of the present invention
  • FIG. 3 is a schematic flow diagram of a part-of-speech-aware nested named entity recognition method in an embodiment of the present invention
  • Fig. 4 is a text heterogeneous graph including multiple different part-of-speech nodes
  • Fig. 5 is an example diagram of bottom-up forward decoding and top-down reverse decoding in an embodiment of the present invention
  • Fig. 6 is a schematic structural diagram of an LSTM unit used for boundary detection improvement in an embodiment of the present invention.
  • Figure 7 is a graph of the performance of different sampling algorithms and the time-consuming comparison results of the same number of sampling nodes
  • FIG. 8 is a schematic diagram of the influence of the number of sampling nodes on the performance and time consumption of the PANNER model in an embodiment of the present invention.
  • Fig. 9 is a schematic diagram of the impact of the word vector dimension on the performance and time consumption of the PANNER model in the embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a part-of-speech-aware nested named entity recognition system in an embodiment of the present invention.
  • Fig. 11 is an internal structure diagram of a computer device in an embodiment of the present invention.
  • the nested named entity recognition method provided by the present invention can be applied to a terminal or a server as shown in FIG. 1 .
  • the terminal can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices
  • the server can be realized by an independent server or a server cluster composed of multiple servers.
  • the server can adopt the part-of-speech perceptual nested named entity recognition method provided by the present invention to complete the common named entities and nested entities in the corresponding text corpus according to the part-of-speech perceptual nested entity recognition model shown in Figure 2.
  • Accurate and effective recognition of named entities and apply the final common named entity and nested named entity recognition results to other learning tasks on the server, or transmit them to the terminal for reception and use by end users.
  • a part-of-speech-aware nested named entity recognition method comprising the following steps:
  • the text word data of the text to be identified includes text sequence ID, part-of-speech category, word frequency and word vector representation; wherein, the document corpus to be identified is any English text sequence that requires nested named entity recognition , the corresponding text word data is the data obtained after preprocessing each text word in the text to be recognized, which is the data feature required for subsequent nested named entity recognition.
  • the text sequence ID and part-of-speech category of the corresponding text word data are in It will not change in the subsequent training and learning, but the word frequency and word vector representation will be continuously updated.
  • the step of obtaining the text word data of the text to be recognized includes:
  • a corresponding text sequence ID is set for each text word
  • part-of-speech tag adopts existing English text part-of-speech tagging tools, such as nltk or Stanfordnlp, can realize the part-of-speech tagging of the text to be recognized, and there are no specific restrictions here; in principle, the part-of-speech categories can be divided according to the existing part-of-speech categories of English words, but considering that there are too many neutral words , in order to provide follow-up learning efficiency, this embodiment classifies the part of speech of English words and obtains the categories shown in Table 1; the word frequency indicates the frequency of occurrence of each text word; assuming that there are N different words in the text to
  • the first group is a collection of all nouns, including singular and plural nouns and proper nouns;
  • the second group is a collection of verbs, including their basic forms, third-person singular, past tense, etc.;
  • the third group is the modification of nouns words, including cardinal numerals, adjectives, comparatives and superlatives of adjectives, etc.;
  • the fourth group is verb modifiers, including adverbs, determiners, etc.;
  • the fifth group is relative words, including modal verbs, conjunctions, prepositions, etc.;
  • the six groups are article identifiers, including commas, periods, and paragraph separators, etc. This group will be removed in practical applications.
  • a word vector representation of each text word in the text to be recognized is generated through the BERT model.
  • the BERT (Bidirectional Encoder Representation from Transformers) model is an NLP pre-training technology, that is, the Encoder of the bidirectional Transformer, which can be used to train the vector representation of the text sequence, or the vector representation of each word in the text sequence Indicates training.
  • the BERT model is selected as the word vector of each word in the text to be recognized Carry out effective training, and then obtain the word vector representation in the text word data, and the word vector representation will be updated accordingly in the subsequent neighbor node sampling random batch training.
  • the BiLSTM model uses the BiLSTM model to perform feature extraction on the text word data to obtain corresponding text word depth features, and according to the text word depth features, initialize each text word of the text to be recognized as a corresponding graph node; wherein , the BiLSTM model is a bidirectional LSTM network encoder.
  • This embodiment uses the model to perform forward feature extraction and reverse feature extraction on text word data, and splicing the obtained forward features and reverse features to obtain corresponding text word depth features. Proceed as follows:
  • the text sequence ID, part of speech, word frequency and word vector representation of each text word data are spliced and integrated to obtain the initial features of the text word; wherein, the initial feature of the text word is the text sequence id, part of speech category, word frequency and word vector representation d after splicing Dimensional word vectors, where the text sequence ID and part of speech each occupy 1 dimension, and the word frequency and word vector representations occupy d-2 dimensions, correspondingly expressed as:
  • x i and F( xi ) represent the text word data and the text word initial feature of the i-th text word respectively; and Respectively represent the text sequence ID, part-of-speech category, word frequency and word vector representation in the i-th text word data.
  • Adopt BiLSTM model to carry out feature extraction to described text word initial feature, obtain described text word depth feature; Described text word depth feature is expressed as:
  • h(xi ) respectively represent the text word depth features corresponding to the i-th text word x i , that is, the hidden layer vector of the preprocessed text word of text word x i encoded by BiLSTM, and the implementation process of the corresponding BiLSTM model is as follows :
  • F( xi ) represents the i-th text word in the text to be recognized; h i-1 , f i , i i , C i and o i
  • w and b are three gates and three states respectively weights and bias terms.
  • Concat( ⁇ ) represents the function of concatenating two vectors horizontally by row.
  • normalization processing can be further performed on the text word depth feature h i obtained above.
  • the text heterogeneous graph of the text to be recognized is composed of nodes (Vertex) and edges (Edge), and there are different Types of nodes and edges (at least one of the nodes and edges has multiple types), common in knowledge graph scenarios, and the simplest way to deal with heterogeneous information is to use one-hot encoding type information and stitch it into the original representation of the node superior.
  • the text heterogeneous graph of the text to be recognized constituted by each graph node initialized by the text word depth feature hi is expressed as:
  • V represents a node set composed of text words of different parts of speech, and the value of each node is a text word depth feature
  • E represents an edge set composed of nodes, and the product of the word co-occurrence frequency and word frequency of the node on the edge The logarithm of the ratio is used as the weight of the edge
  • Ov represents the part-of-speech type set of the node
  • Path represents the preset part-of-speech path, and includes verbs and noun paths, noun modifiers and noun paths, conjunctions and noun paths, and verb modifiers and verbs path.
  • the preset part-of-speech path can follow the steps according to the actual recognized text content: count the proportion of words in the part-of-speech category in the corpus, select the part-of-speech category with the highest proportion of words and a high degree of overlap with the named entity, and use it as the center to rationally design the corresponding Part-of-speech path, thereby introducing grammatical knowledge, relying on Deep Graph Library (DGL) to create heterogeneous graphs, which facilitates the subsequent calculation of heterogeneous graphs.
  • DGL Deep Graph Library
  • the text sequence in the GENIA data set is selected. Considering that the nouns in the data set account for the highest proportion and the degree of overlap between nouns and named entities is high, the preset part-of-speech paths shown in Table 3 are designed centering on nouns.
  • Path_id path Path_node 1 Verbs-Nouns 2-1 2
  • Noun modifier-Nouns 3-1 Verbs modifier-Verbs 4-2
  • the four paths in Table 3 are all centered on nouns, including: preset part-of-speech path 1 (relationship between verbs and nouns), preset part-of-speech path 2 (relationship between noun modifiers and nouns), preset part-of-speech path 3 ( The relationship between verb modifiers and verbs) and the preset part-of-speech path 4 (the relationship between connectives and nouns).
  • the nodes that need to be updated are nodes on the part-of-speech path.
  • the construction of the heterogeneous graph relies on the Deep Graph Library framework, and the original data format is ⁇ src,edg,dst>, where src is the source node, edg is the edge, and dst is the destination node.
  • the node's initial text word features include node type (part of speech category), node position number (text sequence ID), node occurrence frequency (word frequency), and word vector representation obtained by BERT pre-training. Nodes will be continuously updated in subsequent training;
  • the initial features of include the edge weight (the logarithm of the ratio of the word co-occurrence frequency of the node on the edge to the product of the word frequency), which will be continuously updated in the subsequent training.
  • this embodiment designs a hole random walk algorithm based on part-of-speech paths
  • the neighbor nodes of the node are sampled, and according to the sampled neighbor node set, the attention mechanism is used to calculate the node representation of the corresponding graph node to update the text word depth feature.
  • the step of updating the text word depth features of the graph nodes in the text heterogeneous graph through the attention mechanism includes:
  • each preset part-of-speech path perform neighbor node sampling on each graph node in each preset part-of-speech path to obtain a corresponding neighbor node set; wherein, according to the graph node sequence, each preset part-of-speech path
  • the graph node performs neighbor node sampling, and the steps of obtaining the corresponding neighbor node set include:
  • the basic sampling interval is the hole in the above-mentioned hole random walk algorithm, which means that there will be a length equal to the preset
  • the interval between the number of nodes on the part-of-speech path. Randomness means that the selection of node sampling is random.
  • the preset sampling probability and the sampling stop condition in the graph node sequence, a number of neighbor nodes corresponding to each graph node in the preset part-of-speech path are randomly obtained with an integral multiple of the basic sampling interval as the moving step size, and the corresponding graph nodes are obtained.
  • the neighbor node set; the sampling stop condition is that the total number of neighbor nodes obtained by sampling meets the preset number requirement, and the number of nodes corresponding to each part-of-speech category in the neighbor node meets the preset ratio requirement.
  • the stop condition of the neighbor sampling process of graph nodes in each preset path node is: the total number of neighbor nodes obtained reaches the preset number, and the number of other types of neighbor nodes also reaches a specific proportion.
  • the preset probability p of sampling in this embodiment is randomly generated, and the proportion of nodes of different part-of-speech categories of neighboring nodes is the proportion of each part-of-speech category in the original corpus.
  • the following attention mechanism is used to determine the influence of neighbor nodes on the current node.
  • the importance method updates each graph node reasonably and effectively.
  • the node information is integrated for the neighbor node set of each graph node in each preset part-of-speech path, and the corresponding graph node representation is obtained; the graph node representation is:
  • v represents the graph node in the i-th preset part-of-speech path, and the value is the corresponding text word depth feature;
  • k indicates the number of attention heads;
  • exp( ⁇ ) represents the exponential function with e as the base;
  • LeakyReLU( ⁇ ) represents the activation function;
  • u T is the weight matrix of the edge, the weight The matrix will be continuously updated during the sampling process as the word frequency and word co-
  • the word frequency and word vector representation of the corresponding graph node in the text heterogeneous graph are updated.
  • v f is the collection of all nodes in the heterogeneous graph, including nodes on the part-of-speech path and node P not on the part-of-speech path; the process of feature extraction of graph nodes by the BiLSTM model can be found in the previous article to obtain the text of each text word The process of word depth features will not be repeated here;
  • the nested named entity recognition result is obtained by jointly decoding the named entities through layer-by-layer decoding combined with bottom-up and top-down methods.
  • the step of decoding and labeling the word vector representation of the text to be decoded to obtain a nested named entity recognition result includes:
  • conditional random field is used to decode and label the word vector representation of the text to be decoded to obtain the named entity recognition result and the first labeled text word vector representation; wherein, the conditional random field CRF model is as follows:
  • y represents the label
  • f j represents the feature function
  • represents the weight of the feature function
  • the improved LSTM unit is used to perform boundary detection on the first labeled text word vector representation, and it is judged whether there is an entity boundary word in the first labeled text word vector representation; the improved LSTM unit is passed on the output hidden layer of the LSTM unit It is obtained by adding a multi-layer perceptron MLP; among them, the improved LSTM unit is shown in Figure 6. Compared with the LSTM unit, two nonlinear activation layers and a softmax fully connected layer classifier are added on the basis of the output hidden layer.
  • the LSTM unit introduced into the multi-layer perceptron MLP is used to identify the boundary word of the text, and the boundary information is integrated into the hidden layer vector, which provides a reliable and effective method for subsequent common entity recognition and nested entity recognition. basis.
  • the first labeled text word vector indicates that there are entity boundary words
  • the first labeled text word vector representations between adjacent entity boundary words are merged to obtain a second labeled text word vector representation
  • the The second labeled text word vector indicates that the decoding label and boundary detection are performed, and the next round of entity recognition iteration is started, otherwise, the iteration is stopped, and the named entity recognition result is used as the forward named entity recognition result; wherein, the adjacent
  • the merging method of the first labeled text word vector representation between the entity boundary words can be selected according to actual application requirements.
  • a one-dimensional convolutional neural network Conv1d with a convolution kernel of 2 is preferably used.
  • the size of the sliding window n can be determined according to the number of text words between the actual detected entity boundary words, that is, using a one-dimensional convolutional neural network to combine the entity boundary words and their intermediate sequences to obtain a sequence composed of several words
  • the corresponding second labeled text word vector indicates that the corresponding text region starting range is [t, t+n], and the one-dimensional convolutional neural network is expressed as:
  • Conv1d( ) is a one-dimensional convolutional neural network.
  • the process of forward named entity recognition results obtained through the above steps is referred to as the bottom-up decoding process.
  • the bottom-up forward decoding process In order to reduce the recognition error, after the bottom-up forward decoding is completed, the following top-down Inverse decoding process for skew correction.
  • the reverse filling is to use the one-dimensional convolutional neural network Conv1d to refill the current decoded sequence, so that the total sequence length is consistent with the previous layer, and then get the same as the previous
  • the text word vector representation corresponding to one round of entity recognition iteration is the same length as the third labeled text word vector representation.
  • the process of reverse stuffing and re-decoding can be referred to simply as a top-down decoding process.
  • the word vector representation of the fourth labeled text is decoded and labeled by using a conditional random field to obtain the nested named entity recognition result.
  • the accuracy of nested named entity recognition is effectively guaranteed through layer-by-layer decoding combined with bottom-up and top-down joint decoding.
  • the embodiment of the present application initializes the text word data by introducing part-of-speech knowledge, and uses the BiLSTM model to extract text word depth features based on the text word data, initializes it as a graph node, and constructs text heterogeneity corresponding to the text to be recognized based on the grammatical relationship Figure, combined with a design of a random walk algorithm based on part-of-speech paths based on part-of-speech paths to obtain more neighbor node information, and relying on the DGL (Deep Graph Library) framework, it can effectively identify and process ordinary entities and nested entities, and improve nested naming While improving the accuracy and learning efficiency of entity recognition, it further improves the performance advantages of the nested named entity recognition model.
  • DGL Deep Graph Library
  • the nested named entity recognition method of the present invention is applied to the corresponding synthesis of the GENIA data set.
  • the results are better than those of the similar named body recognition models.
  • the running performance and time consumption of the hole random walk algorithm for sampling neighbor nodes and the entire nested named entity recognition model PANNER of the present application were verified, and the results shown in Figures 7-9 were obtained, further verifying the present invention
  • the learning efficiency and operating performance of the model have greater advantages compared with the existing models.
  • a part-of-speech-aware nested named entity recognition system comprising:
  • Preprocessing module 1 is used to obtain text word data of text to be recognized; said text word data includes text sequence ID, part-of-speech category, word frequency and word vector representation;
  • Node initialization module 2 for adopting BiLSTM model to carry out feature extraction to described text word data, obtain corresponding text word depth feature, and according to described text word depth feature, each text word of described to-be-recognized text is initialized as corresponding the graph node;
  • a graph construction module 3 configured to construct a text heterogeneous graph of the text to be recognized according to the transfer relationship between each graph node;
  • the node update module 4 is used to update the text word depth features of the graph nodes in the text heterogeneous graph through an attention mechanism according to the text heterogeneous graph and the preset part-of-speech path;
  • the feature optimization module 5 is used to extract the features of all graph nodes in the updated text heterogeneous graph by using the BiLSTM model to obtain the word vector representation of the text to be decoded;
  • the result generating module 6 is configured to decode and label the word vector representation of the text to be decoded to obtain a nested named entity recognition result.
  • each module in the above-mentioned part-of-speech-aware nested named entity recognition system can be fully or partially realized by software, hardware and a combination thereof.
  • the above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.
  • Fig. 11 shows an internal structural diagram of a computer device in an embodiment, and the computer device may specifically be a terminal or a server.
  • the computer device includes a processor, a memory, a network interface, a display and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer programs.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.
  • FIG. 11 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation on the computer equipment on which the solution of this application is applied.
  • the specific computing equipment More or fewer components than shown in the figures may be included, or certain components may be combined, or have the same component arrangement.
  • a computer device including a memory, a processor, and a computer program stored in the memory and operable on the processor, and the steps of the above method are implemented when the processor executes the computer program.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above method are implemented.
  • the part-of-speech-aware nested named entity recognition method realizes that after obtaining the text word data of the text to be recognized , use the BiLSTM model to extract the text word data to obtain the text word depth feature, and initialize each text word of the text to be recognized as the corresponding graph node according to the text word depth feature, and construct the text difference of the text to be recognized according to the preset part of speech path Composition, update the text word data of the graph nodes through the attention mechanism, and then use the BiLSTM model to extract the features of all the graph nodes of the text heterogeneous graph, and obtain the vector representation of the text to be decoded, then use the conditional random field to decode and label,
  • the technical solution for obtaining the results of nested named entity recognition is to use heterogeneous graph representation learning for nested entity recognition, introduce part-of-speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

一种词性感知嵌套命名实体识别方法、系统、设备和存储介质,通过获取待识别文本的文本词数据后,采用BiLSTM模型对文本词数据进行特征提取得到文本词深度特征,并根据文本词深度特征将待识别文本的各个文本词初始化为对应的图节点,根据预设词性路径构建待识别文本的文本异构图,通过注意力机制更新图节点的文本词数据,再采用BiLSTM模型对文本异构图的所有图节点进行特征提取,得到待解码文本词向量表示后,采用条件随机场进行解码和标注,得到嵌套命名实体识别结果的方法,能够对普通实体和嵌套实体进行精准有效的识别处理,提高嵌套命名实体识别效率的同时,进一步提升嵌套命名实体识别模型的性能优势。

Description

词性感知嵌套命名实体识别方法、系统、设备和存储介质 技术领域
本发明涉及自然语言处理及知识图谱构建技术领域,特别是涉及一种基于异构图注意力神经网络的词性感知嵌套命名实体识别方法、系统、计算机设备和存储介质。
背景技术
命名实体识别(NER,Named Entity Recognition)是自然语言处理领域构造知识图谱过程中的基本任务之一,其主要用于对构成知识图谱中具有特定意义的实体进行抽取,是信息提取、问答系统、句法分析、机器翻译、面向Semantic Web的元数据标注等应用领域的重要基础工具,在自然语言处理技术走向实用化的过程中占有重要的地位。实际的自然语言序列中存在一个实体包含一个或多个实体的嵌套实体现象,如文本“Activation of the cd28 surface receptor provides”,“cd28 surface”是一个Protein类型的实体,而“cd28 sur face receptor”也是一个Protein类型的实体,嵌套命名实体识别(Nested NER)也是命名实体识别任务中重难点问题,其作用在于识别出文本中的嵌套实体,对应识别的关键在于如何确定实体的边界和预测实体的类别。
现有嵌套实体识别方法主要分为三类:(1)通过设计文本匹配规则来抽取自然语言中的实体,如通过领域专家手工编写规则来匹配文本中的实体;(2)基于特征工程的有监督学习方法,如通过设计特征模板、结合维特比算法来对文本序列中的文本类别进行预测;(3)基于实体跨度的深度学习的方法,如利用神经网络来提取文本的字符级特征的深度学习,以及直接枚举出可能为实体的子序列,然后对子序列进行预测的穷举候选实体方法。现有技术虽然能够在一定程度上解决嵌套实体识别问题,但也存在着显而易见的缺陷,如第一类方法中领域语言专家手工编写规则,非常耗时耗力,且领域间可迁移性差;第二类方法属于统计机器学习方法易受文本语料分布情况的影响,泛化能力较差;第三类 方法中的深度学习法虽然能够抽取文本的字符和单词特征,但学习参数多,计算复杂度较高,而穷举候选实体法更是增加了模型的时间复杂度,且简单的枚举文本子序列不利于提升模型性能。
发明内容
本发明的目的是提供一种词性感知嵌套命名实体识别方法、系统、设备和存储介质,通过将异构图表示学习用于嵌套实体识别,引入词性知识对文本特征进行初始化,结合设计的一种采样获取更多邻居节点信息的基于词性路径的空洞随机游走算法,依托DGL(Deep Graph Library)框架,通过异构图对普通实体和嵌套实体进行有效识别处理,提高嵌套命名实体识别的精准性和学习效率的同时,进一步提升嵌套命名实体识别模型的性能优势。
为了实现上述目的,有必要针对上述技术问题,提供了一种词性感知嵌套命名实体识别方法、系统、计算机设备和存储介质。
第一方面,本发明实施例提供了一种词性感知嵌套命名实体识别方法,所述方法包括以下步骤:
获取待识别文本的文本词数据;所述文本词数据包括文本序列ID、词性类别、词频和词向量表示;
采用BiLSTM模型对所述文本词数据进行特征提取,得到对应的文本词深度特征,并根据所述文本词深度特征,将所述待识别文本的各个文本词初始化为对应的图节点;
根据各个图节点间的转移关系,构建所述待识别文本的文本异构图;
根据所述文本异构图和预设词性路径,通过注意力机制更新所述文本异构图中图节点的文本词深度特征;
采用BiLSTM模型对更新后的文本异构图内所有图节点进行特征提取,得到待解码文本词向量表示;
对所述待解码文本词向量表示进行解码和标注,得到嵌套命名实体识别结果。
进一步地,所述获取待识别文本的文本词数据的步骤包括:
按照所述待识别文本内各个文本词的位置顺序,给各个文本词设置对应的文本序列ID;
对所述待识别文本进行词性标注,并根据词性标注结果,对所述待识别文本中的各个文本词进行词性分类和词频统计,得到对应的词性类别和词频;
通过BERT模型,生成所述待识别文本内各个文本词的词向量表示。
进一步地,所述采用BiLSTM模型对所述文本词数据进行特征提取,得到对应的文 本词深度特征的步骤包括:
将各个文本词数据的文本序列ID、词性、词频和词向量表示进行拼接整合,得到文本词初始特征;
采用BiLSTM模型对所述文本词初始特征进行特征提取,得到所述文本词深度特征;所述文本词深度特征表示为:
h(x i)=BiLSTM(F(x i))
式中,
Figure PCTCN2022133113-appb-000001
其中,x i、F(x i)和h(x i)分别表示第i个文本词的文本词数据、文本词初始特征和文本词深度特征;
Figure PCTCN2022133113-appb-000002
Figure PCTCN2022133113-appb-000003
分别表示第i个文本词数据内的文本序列ID、词性类别、词频和词向量表示。
进一步地,所述待识别文本的文本异构图表示为:
G=(V,E,Ov,Path)
其中,V表示由不同词性文本词构成的节点集合,且各个节点的取值为文本词深度特征;E表示以节点构成的边集合;Ov表示节点的词性类型集合;Path表示预设词性路径,且包括动词与名词路径、名词修饰词与名词路径、连接词与名词路径、以及动词修饰词与动词路径。
进一步地,所述根据所述文本异构图和预设词性路径,通过注意力机制更新所述文本异构图中图节点的文本词深度特征的步骤包括:
根据各个预设词性路径,对所述文本异构图进行深度优先遍历,得到对应的图节点序列;
根据所述图节点序列,对各个预设词性路径中的各个图节点进行邻居节点采样,得到对应的邻居节点集合;
通过注意力机制,对各个预设词性路径中的各个图节点的邻居节点集合进行节点信息整合,得到对应的图节点表示;所述图节点表示为:
Figure PCTCN2022133113-appb-000004
式中,
Figure PCTCN2022133113-appb-000005
Figure PCTCN2022133113-appb-000006
其中,v表示第i条预设词性路径中的图节点,且取值为对应的文本词深度特征;
Figure PCTCN2022133113-appb-000007
表示第i条预设词性路径Path i中图节点对应的邻居节点集合;
Figure PCTCN2022133113-appb-000008
表示第i条词性路径Path i中图节点v的第j个邻居节点;k表示注意力头数;
Figure PCTCN2022133113-appb-000009
表示第i条词性路径Path i中图节点v的第j个邻居节点的权重系数;
Figure PCTCN2022133113-appb-000010
表示图节点v经过k个注意头的注意力计算得到的图节点表示;exp(·)表示以e为底的指数函数;LeakyReLU(·)表示激活函数;u T为边的权重矩阵;
根据所述图节点表示,更新所述文本异构图中对应图节点的词频和词向量表示。
进一步地,所述根据所述图节点序列,对各个预设词性路径中的各个图节点进行邻居节点采样,得到对应的邻居节点集合的步骤包括:
获取所述预设词性路径的节点数目,并根据所述节点数目确定基础采样间隔;
根据预设采样概率和采样停止条件,在所述图节点序列中,以基础采样间隔的整数倍为移动步长,随机获取预设词性路径中各个图节点对应的若干邻居节点,得到对应的所述邻居节点集合;所述采样停止条件为采样得到的邻居节点总数满足预设数目要求,且邻居节点中各个词性类别对应的节点数目满足预设比例要求。
进一步地,所述对所述待解码文本词向量表示进行解码和标注,得到嵌套命名实体识别结果的步骤包括:
采用条件随机场对所述待解码文本词向量表示进行解码和标注,得到命名实体识别结果和第一标注文本词向量表示;
采用改进的LSTM单元对所述第一标注文本词向量表示进行边界检测,判断所述第一标注文本词向量表示是否存在实体边界词;所述改进的LSTM单元通过在LSTM单元的输出隐藏层上增加多层感知机MLP得到;
若所述第一标注文本词向量表示存在实体边界词,则将相邻所述实体边界词之间的所述第一标注文本词向量表示进行合并,得到第二标注文本词向量表示,并对所述第二标注文本词向量表示进行解码标注和边界检测,开始下一轮实体识别迭代,反之,则停止迭代,将所述命名实体识别结果作为正向命名实体识别结果;
根据所述正向实体识别结果对应的文本词向量表示进行逆向填充,得到第三标注文本词向量表示,并将所述第三文本词向量表示与上一轮实体识别迭代对应的文本词向量表示进行合并,得到第四标注文本词向量表示;
采用条件随机场对所述第四标注文本词向量表示进行解码和标注,得到所述嵌套命名实体识别结果。
第二方面,本发明实施例提供了一种词性感知嵌套命名实体识别系统,所述系统包括:
预处理模块,用于获取待识别文本的文本词数据;所述文本词数据包括文本序列ID、词性类别、词频和词向量表示;
节点初始化模块,用于采用BiLSTM模型对所述文本词数据进行特征提取,得到对应的文本词深度特征,并根据所述文本词深度特征,将所述待识别文本的各个文本词初始化为对应的图节点;
图构建模块,用于根据各个图节点间的转移关系,构建所述待识别文本的文本异构图;
节点更新模块,用于根据所述文本异构图和预设词性路径,通过注意力机制更新所述文本异构图中图节点的文本词深度特征;
特征优化模块,用于采用BiLSTM模型对更新后的文本异构图内所有图节点进行特征提取,得到待解码文本词向量表示;
结果生成模块,用于对所述待解码文本词向量表示进行解码和标注,得到嵌套命名实体识别结果。
第三方面,本发明实施例还提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述方法的步骤。
第四方面,本发明实施例还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述方法的步骤。
上述本申请提供了一种词性感知嵌套命名实体识别方法、系统、计算机设备及存储介质,通过所述方法,实现了获取待识别文本的文本词数据后,采用BiLSTM模型对文本词数据进行特征提取得到文本词深度特征,并根据文本词深度特征将待识别文本的各个文本词初始化为对应的图节点,根据预设词性路径构建待识别文本的文本异构图,通过注意力机制更新图节点的文本词数据,再采用BiLSTM模型对文本异构图的所有图节点进行特征提取,得到待解码文本词向量表示后,采用条件随机场进行解码和标注,得到嵌套命名实体识别结果的技术方案。与现有技术相比,该词性感知嵌套命名实体识别方法通过将异构图表示学习用于嵌套实体识别,引入词性知识对文本特征进行初始化,结合设计的一种采样获取更多邻居节点信息的基于词性路径的空洞随机游走算法,依托DGL(Deep Graph Library)框架,对普通实体和嵌套实体进行有效识别处理,提高嵌套命名实体识别的精准性和学习效率的同时,进一步提升嵌套命名实体识别模型的性能优势。
附图说明
图1是本发明实施例中词性感知嵌套命名实体识别方法的应用场景示意图;
图2是本发明实施例中词性感知嵌套命名实体识别模型PANNER的结构示意图;
图3是本发明实施例中词性感知嵌套命名实体识别方法的流程示意图;
图4是包括多种不同词性节点的文本异构图;
图5是本发明实施例中自下而上的正向解码和自上而下的逆向解码的示例图;
图6是本发明实施例中用于边界检测改进的LSTM单元的结构示意图;
图7是不同采样算法的性能和同样采样节点数耗时对比结果图;
图8是本发明实施例中采样节点数目对PANNER模型性能和时间消耗影响的示意图;
图9是本发明实施例中词向量维度对PANNER模型性能和时间消耗影响的示意图;
图10是本发明实施例中词性感知嵌套命名实体识别系统的结构示意图;
图11是本发明实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案和有益效果更加清楚明白,下面结合附图及实施例,对本发明作进一步详细说明,显然,以下所描述的实施例是本发明实施例的一部分,仅用于说明本发明,但不用来限制本发明的范围。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明提供的嵌套命名实体识别方法可以应用于如图1所示的终端或服务器上。其中,终端可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式 可穿戴设备,服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。服务器可以基于待识别英文文本语料,采用本发明提供的词性感知嵌套命名实体识别方法按照图2所示的词性感知嵌套实体识别模型完成对相应文本语料中的普通命名实体和嵌套命名实体的精准有效识别,并将最终得到的普通命名实体和嵌套命名实体识别结果应用于服务器上其他学习任务,或者将其传送至终端,以供终端使用者接收使用。
在一个实施例中,如图3所示,提供了一种词性感知嵌套命名实体识别方法,包括以下步骤:
S11、获取待识别文本的文本词数据;所述文本词数据包括文本序列ID、词性类别、词频和词向量表示;其中,待识别文件语料为任一需要进行嵌套命名实体识别的英文文本序列,对应的文本词数据是对待识别文本中的每个文本词预处理后得到的数据,即为后续嵌套命名实体识别所需的数据特征,对应的文本词数据的文本序列ID和词性类别在后续的训练学习中不会变化,而词频和词向量表示会不断更新。具体地,所述获取待识别文本的文本词数据的步骤包括:
按照所述待识别文本内各个文本词的位置顺序,给各个文本词设置对应的文本序列ID;
对所述待识别文本进行词性标注,并根据词性标注结果,对所述待识别文本中的各个文本词进行词性分类和词频统计,得到对应的词性类别和词频;其中,词性标注采用现有的英文文本词性标注工具,如nltk或Stanfordnlp均可实现对待识别文本的词性标注,此处不作具体限制;词性类别原则上可按照英文单词的现有词性类别进行划分,但考虑到中性词太多,为了提供后续的学习效率,本实施例对英文单词的词性进行分类得到的如表1所示类别;词频表示各个文本词出现的频率;假设待识别文本中共N个不同的词,共C种不同类别,待识别文本的长度为T,其中,第t个文本词出现了f次,其词性分类为c,其中t∈T,c∈C,则该词的文本序列ID:id=t;词性类别cat=c;词频freq=f/T,比如待识别文本为“Guangzhou University is located in Guangzhou,the capital of Guangdong”,则采用词性标注工具获取文本的词性,得到如表2所示的文本序列id、词性、词性类别和词频信息。
表1词性分类表
Figure PCTCN2022133113-appb-000011
Figure PCTCN2022133113-appb-000012
表1中:第一组为所有名词的集合,包括名词和专有名词的单复数;第二组为动词集合,包括其基本形式、第三人称单数、过去式等;第三组为名词的修饰词,包括基数词、形容词、形容词的比较级、最高级等;第四组为动词的修饰词,包括副词、限定词等,第五组为关系词,包括情态动词、连词、介词等;第六组为文章标识符,包括逗号、句号和段落分隔符等,这一组在实际应用中将被移除不用。
表2语料的id、词性、词性类别和词频信息
语料 Guangzhou University is located in the capital of Guangdong
Id 0 1 2 3 4 5 6 7 8
词性 Noun Noun VBS VBP IN DT Noun CC Noun
分组 Noun Noun Verbs Verbs Relations Noun modifier Noun Noun modifier Noun
词频 2/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10
通过BERT模型,生成所述待识别文本内各个文本词的词向量表示。其中,BERT(Bidirectional Encoder Representation from Transformers)模型是一种NLP预训练技术,即双向Transformer的Encoder,其可以用来对文本序列的向量表示进行训练,也可以对文本序列中的每个单词的向量表示进行训练。本实施例中,考虑到待识别文本中文本词的位置与语义是强相关的,并且为了便于后续进行邻居节点采样对应的随机批次训练,选用BERT模型对待识别文本中每个单词的词向量进行有效训练,进而得到文本词数据里的词向量表示,该词向量表示在后续邻居节点采样随机批次训练中会进行相应的更新处理。
S12、采用BiLSTM模型对所述文本词数据进行特征提取,得到对应的文本词深度特征,并根据所述文本词深度特征,将所述待识别文本的各个文本词初始化为对应的图节点;其中,BiLSTM模型为双向LSTM网络编码器,本实施例通过该模型对文本词数据进行正向特征提取和逆向特征提取,并将得到的正向特征和逆向特征拼接得到对应的文本词深度特征,具体步骤如下:
将各个文本词数据的文本序列ID、词性、词频和词向量表示进行拼接整合,得到文本词初始特征;其中,文本词初始特征为文本序列id、词性类别、词频和词向量表示拼接后的d维词向量,其中,文本序列ID和词性各占1维,词频和词向量表示合占d-2维,对应表示为:
Figure PCTCN2022133113-appb-000013
其中,x i和F(x i)分别表示第i个文本词的文本词数据和文本词初始特征;
Figure PCTCN2022133113-appb-000014
Figure PCTCN2022133113-appb-000015
Figure PCTCN2022133113-appb-000016
分别表示第i个文本词数据内的文本序列ID、词性类别、词频和词向量表示。
采用BiLSTM模型对所述文本词初始特征进行特征提取,得到所述文本词深度特征;所述文本词深度特征表示为:
h(x i)=BiLSTM(F(x i))
其中,h(x i)分别表示第i个文本词x i对应的文本词深度特征,即文本词x i的预处理文本词被BiLSTM编码后的隐藏层向量,对应的BiLSTM模型的实施过程如下:
采用BiLSTM网络编码器按正序输入待识别文本编码各个文本词,依照下述公式得到该待识别文本内各个文本词的正向特征
Figure PCTCN2022133113-appb-000017
f i=σ(w f·[h i-1,F(x i)]+b f)
i i=σ(w i·[h i-1,F(x i)]+b i)
Figure PCTCN2022133113-appb-000018
Figure PCTCN2022133113-appb-000019
o i=σ(w o·[h i-1,F(x i)]+b o)
Figure PCTCN2022133113-appb-000020
其中,F(x i)表示待识别文本中第i个文本词;h i-1、f i、i i
Figure PCTCN2022133113-appb-000021
C i和o i
分别表示待识别文本中第i个文本词的前一时刻细胞状态、遗忘门输出、记忆门输出、临时细胞状态、当前细胞状态和输出门;w和b分别为三个门和三个状态的权重和偏置项。
同上所述,再采用BiLSTM网络编码器按逆序输入待识别文本编码各个文本词,得到该待识别文本内各个文本词的逆向特征
Figure PCTCN2022133113-appb-000022
Figure PCTCN2022133113-appb-000023
在采用上述步骤得到文本词的正向特征
Figure PCTCN2022133113-appb-000024
和逆向特征
Figure PCTCN2022133113-appb-000025
后,将二者拼接得到文本词深度特征h i
Figure PCTCN2022133113-appb-000026
式中,Concat(·)表示将两个向量按行横向拼接的函数。为了保证后续使用文本词深度特征处理的高效性,可以进一步对上述得到的文本词深度特征h i进行归一化处理。
S13、根据各个图节点间的转移关系,构建所述待识别文本的文本异构图;其中,异构图如图4所示,由节点(Vertex)和边(Edge)来构成,且存在不同类型的节点和边(节点和边至少有一个具有多种类型),常见于知识图谱的场景,且最简单的处理异构信息的方式是使用独热编码类型信息并拼接在节点的原有表示上。本实施例中,根据由文本词深度特征h i初始化得到的各个图节点所构成的待识别文本的文本异构图表示为:
G=(V,E,Ov,Path)
其中,V表示由不同词性文本词构成的节点集合,且各个节点的取值为文本词深度特征;E表示以节点构成的边集合,且以边上节点的词共现频率与词频之积的比值的对数作为边的权重;Ov表示节点的词性类型集合;Path表示预设词性路径,且包括动词与名词路径、名词修饰词与名词路径、连接词与名词路径、以及动词修饰词与动词路径。需要说明的是,预设词性路径可根据实际识别的文本内容按照步骤:统计语料库中词性类别中单词的比例,选择单词比例最高且与命名实体重合度较高的词性类别,作为中心合理设计对应的词性路径,从而引入语法知识,依托Deep Graph Library(DGL)创建异构图,方便后续进行异构图的计算。本实施例选用的是GENIA数据集中的文本序列,考虑到该数据集中名词占比最高,且名词与命名实体的重合度较高,以名词为中心设计得到表3所示的预设词性路径。
表3以名词为中心设计的预设词性路径
Path_id Path Path_node
1 Verbs-Nouns 2-1
2 Noun modifier-Nouns 3-1
3 Verbs modifier-Verbs 4-2
4 Relations-Nouns 5-1
表3中的4条路径均以名词为中心,包括:预设词性路径1(动词与名词间的关系)、预设词性路径2(名词修饰词与名词的关系)、预设词性路径3(动词修饰词与动词的关系)和预设词性路径4(连接词与名词的关系),在之后的邻居节点采样过程中,需要更新的节点为词性路径上的节点。
异构图的构建依托Deep Graph Library框架,原始数据格式为<src,edg,dst>,其中src为源节点、edg为边、dst为目的节点。节点初始文本词特征包括节点类型(词性类别)、节点位置编号(文本序列ID)、节点出现频率(词频)、以及BERT预训练得到的词向量表示,节点将在之后的训练当中不断更新;边的初始特征包括边的权重(边上节点的词共现频率与词频之积的比值的对数),其也会在之后的训练当中不断更新。
S14、根据所述文本异构图和预设词性路径,通过注意力机制更新所述文本异构图中图节点的文本词深度特征;其中,图节点的文本词深度特征的更新可以理解为更新所有预设词性路径上的节点的文本词深度特征中的词频和词向量表示。为了获取了更多的邻居节点信息,更可靠有效地更新各个预设词性路径上的节点特征,本实施例设计了一种基于词性路径的空洞随机游走算法对各个预设词性路径上的图节点的邻居节点进行采样,并根据采样得到的邻居节点集合,采用注意力机制计算对应图节点的节点表示以更新文本词深度特征。具体地所述根据所述文本异构图和预设词性路径,通过注意力机制更新所述文本异构图中图节点的文本词深度特征的步骤包括:
根据各个预设词性路径,对所述文本异构图进行深度优先遍历,得到对应的图节点序列;
根据所述图节点序列,对各个预设词性路径中的各个图节点进行邻居节点采样,得到对应的邻居节点集合;其中,所述根据所述图节点序列,对各个预设词性路径中的各个图节点进行邻居节点采样,得到对应的邻居节点集合的步骤包括:
获取所述预设词性路径的节点数目,并根据所述节点数目确定基础采样间隔;其中,基础采样间隔即为上述空洞随机游走算法中的空洞,表示在采样时会有一个长度等于预设词性路径上节点数的间隔,随机性是指节点采样选取具有随机性。
根据预设采样概率和采样停止条件,在所述图节点序列中,以基础采样间隔的整数倍为移动步长,随机获取预设词性路径中各个图节点对应的若干邻居节点,得到对应的所述邻居节点集合;所述采样停止条件为采样得到的邻居节点总数满足预设数目要求,且邻居节点中各个词性类别对应的节点数目满足预设比例要求。
具体地,假设文本异构图为G=(V,E,Ov,Path),V为节点集合,E为边集,Ov为节点类型,Path为预设词性路径,且预设词性路径包括2-1、3-1、4-2、5-1;下面以预设词性路径2-1为示例对邻居节点采样过程进行说明:以预设词性路径2-1上的节点为起点对图进行深度优先遍历(Depth First Search,DFS),得到的对应的图节点序列为{2、1、0、4、3、5、6、7、8};得到预设词性路径2-1的长度(节点数目)为2,则确定2为基础采样间隔,对预设词性路径2-1上的节点2的2阶邻居节点(指与节点2相隔节点数为2的整数倍的节点)进行采样,以预设概率p选择节点2的2阶邻居作为节点2的邻居节点,以概率1-p舍弃其2阶邻居,依次为0、3、6和8;对词性路径2-1上的节点1的2阶邻居节点(指与节点1相隔节点数为2的整数倍的节点)进行采样,以概率p选择节点1的2阶邻居作为节点1的邻居节点,以概率1-p舍弃其2阶邻居,依次为4、5、7。需要说明的是,每个预设路径节点中图节点的邻居采样过程停止条件为:得到的邻居节点总 数目达到预设的数目,且其他各类邻居节点数目也达到特定比例。为了保证模型的泛化能力,本实施例中采样的预设概率p为随机产生,且邻居节点不同词性类别的节点占比为原始语料库中各词性类别的比例。
通过上述步骤得到各个预设词性路径中各个图节点的邻居节点集合后,考虑到不同类别的邻居节点词对图节点的影响不一样,故通过下述采用注意力机制来确定邻居节点对当前节点的重要性的方法对各个图节点进行合理有效的更新。
通过注意力机制,对各个预设词性路径中的各个图节点的邻居节点集合进行节点信息整合,得到对应的图节点表示;所述图节点表示为:
Figure PCTCN2022133113-appb-000027
式中,
Figure PCTCN2022133113-appb-000028
Figure PCTCN2022133113-appb-000029
其中,v表示第i条预设词性路径中的图节点,且取值为对应的文本词深度特征;
Figure PCTCN2022133113-appb-000030
表示第i条预设词性路径Path i中图节点对应的邻居节点集合;
Figure PCTCN2022133113-appb-000031
表示第i条词性路径Path i中图节点v的第j个邻居节点;k表示注意力头数;
Figure PCTCN2022133113-appb-000032
表示第i条词性路径Path i中图节点v的第j个邻居节点的权重系数;
Figure PCTCN2022133113-appb-000033
表示图节点v经过k个注意头的注意力计算得到的图节点表示;exp(·)表示以e为底的指数函数;LeakyReLU(·)表示激活函数;u T为边的权重矩阵,该权重矩阵随着边节点词频和词共现频率的变化,在采样过程中会不断更新;
上述空洞随机游走算法DilatedRandomWalk的实现方式如下:
Figure PCTCN2022133113-appb-000034
根据所述图节点表示,更新所述文本异构图中对应图节点的词频和词向量表示。
S15、采用BiLSTM模型对更新后的文本异构图内所有图节点进行特征提取,得到待解码文本词向量表示;其中,待解码文本词向量表示可理解为通过BiLSTM对所有预设词性路径中的图节点
Figure PCTCN2022133113-appb-000035
和不在预设词性路径中的图节点node P进行特征提取得到,可 表示为:
Figure PCTCN2022133113-appb-000036
其中,v f为异构图中所有节点的集合,包括词性路径上的节点和不在词性路径上的节点node P;BiLSTM模型对图节点进行特征提取的过程可参见前文中获取各个文本词的文本词深度特征的过程,此处不再赘述;
S16、对所述待解码文本词向量表示进行解码和标注,得到嵌套命名实体识别结果。其中,嵌套命名实体识别结果,如图5所示,通过逐层解码,结合自下而上和自上而下的方式对命名实体进行联合解码得到。具体地,所述对所述待解码文本词向量表示进行解码和标注,得到嵌套命名实体识别结果的步骤包括:
采用条件随机场对所述待解码文本词向量表示进行解码和标注,得到命名实体识别结果和第一标注文本词向量表示;其中,条件随机场CRF模型如下所示:
Figure PCTCN2022133113-appb-000037
Figure PCTCN2022133113-appb-000038
式中,y表示标签;f j表示特征函数;λ表示特征函数的权重;
Figure PCTCN2022133113-appb-000039
表示归一化因子。
采用改进的LSTM单元对所述第一标注文本词向量表示进行边界检测,判断所述第一标注文本词向量表示是否存在实体边界词;所述改进的LSTM单元通过在LSTM单元的输出隐藏层上增加多层感知机MLP得到;其中,改进的LSTM单元如图6所示,相比于LSTM单元,在输出隐藏层的基础上增加了两个非线性激活层和一个softmax全连接层分类器。本实施例中,使用引入多层感知机MLP的LSTM单元进行对文本进行边界词识别,以及将边界信息融入到隐藏层向量中,为后续的普通实体识别和嵌套实体识别提供了可靠且有效的依据。
若所述第一标注文本词向量表示存在实体边界词,则将相邻所述实体边界词之间的所述第一标注文本词向量表示进行合并,得到第二标注文本词向量表示,并对所述第二标注文本词向量表示进行解码标注和边界检测,开始下一轮实体识别迭代,反之,则停止迭代,将所述命名实体识别结果作为正向命名实体识别结果;其中,将相邻所述实体边界词之间的所述第一标注文本词向量表示的合并方法可以根据实际应用需求进行选择,本实施例优选的采用卷积核为2的一维卷积神经网络Conv1d实现,其滑窗大小n可根据实际检测到的实体边界词之间的文本词数目而定,即使用一维卷积神经网络将实体边界词及其中间的序列进行合并,得到的若干个词构成的序列对应的第二标注文本词向量表示对应的文本区域起始范围为[t,t+n],且该一维卷积神经网络表示为:
Figure PCTCN2022133113-appb-000040
式中,
Figure PCTCN2022133113-appb-000041
Figure PCTCN2022133113-appb-000042
分别为第l次命名实体识别迭代对应的合并前第t个词第一标注文本词向量表示、第t+n个词第一标注文本词向量表示,以及合并后第t个词第二标注文本词向量表示;Conv1d(·)为一维卷积神经网络。
通过上述步骤得到的正向命名实体识别结果的过程简称为自下而上的解码过程,为了减少识别误差,在完成自下而上的正向解码后,添加了下述的自上而下的逆向解码过程,用于纠偏。
根据所述正向实体识别结果对应的文本词向量表示进行逆向填充,得到第三标注文 本词向量表示,并将所述第三文本词向量表示与上一轮实体识别迭代对应的文本词向量表示进行合并,得到第四标注文本词向量表示;其中,逆向填充为采用一维卷积神经网络Conv1d将当前解码后的序列重新进行填充,使得总的序列长度跟上一层一致,进而得到与上一轮实体识别迭代对应的文本词向量表示的长度相同的第三标注文本词向量表示。此处逆向填充再解码的过程可简称为自上而下解码过程。
采用条件随机场对所述第四标注文本词向量表示进行解码和标注,得到所述嵌套命名实体识别结果。本实施例通过逐层解码,结合自下而上和自上而下联合解码的方式,有效保证了嵌套命名实体识别的精准性。
本申请词性感知嵌套命名实体识别法对应的PANNER模型实现流程如下所示:
Figure PCTCN2022133113-appb-000043
本申请实施例通过引入词性知识对文本词数据进行初始化,并采用BiLSTM模型基于文本词数据提取文本词深度特征后,将其初始化为图节点,并基于语法关系构建待识别文本对应的文本异构图,结合设计的一种采样获取更多邻居节点信息的基于词性路径的空洞随机游走算法,依托DGL(Deep Graph Library)框架,对普通实体和嵌套实体进行有效识别处理,提高嵌套命名实体识别的精准性和学习效率的同时,进一步提升嵌套命名实体识别模型的性能优势。
为了验证本发明词性感知嵌套命名实体识别法的技术效果,将上述方法对应的PANNER模型采用负对数似然函数和随机梯度下降(SGD)法对获取的英文文本语料进行训练优化后,基于GENIA数据集进行嵌套命名实体的识别验证,分别进行了与其他不同模型识别结果的比对,本申请PANNER模型在不同实体层上的比对,以及本申请PANNER模型以不同词性为中心的预设词性路径上的对比,得到的结果分别如表4-6所示,从给出的精确度、召回率和F1分数可见本发明的嵌套命名实体识别方法应用于GENIA数据集对应的综合效果都比同类命名体识别模型的有所提升。此外,还对本申请邻居节点采样的空洞随机游走算法及整个嵌套命名实体识别模型PANNER的运行性能和时间消耗作了验证,并得到如图7-9所示的结果,进一步验证了本发明嵌套命名实体识别方法具有较高精准性基础上,模型的学习效率和运行性能与现有模型相比都有较大优势。
表4基于GENIA数据集的实验结果比对
Model HMM CRF LSTM LSTM-CRF BENSC Hypergraph CGN HAN PANNER
Precision(%) 82.15 86.13 84.25 82.46 78.9 74.42 82.18 85.18 84.18
Recall(%) 56.73 58.39 64.29 65.39 72.7 67.58 67.81 70.53 73.98
F1-score(%) 67.74 69.60 72.93 72.94 75.67 70.84 74.31 77.17 78.75
表5本申请PANNER模型基于GENIA数据集在不同实体层上的表现
Layers 1 2 3 4 ALLLayers
Precision(%) 84.13 84.91 84.37 82.46 84.18
Recall(%) 72.23 76.87 73.18 68.39 73.98
F1-score(%) 78.30 80.69 78.38 74.77 78.75
表6本申请PANNER模型以不同词性为中心的预设词性路径上的对比结果
Figure PCTCN2022133113-appb-000044
需要说明的是,虽然上述流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。
在一个实施例中,如图10所示,提供了一种词性感知嵌套命名实体识别系统,所述系统包括:
预处理模块1,用于获取待识别文本的文本词数据;所述文本词数据包括文本序列ID、词性类别、词频和词向量表示;
节点初始化模块2,用于采用BiLSTM模型对所述文本词数据进行特征提取,得到对应的文本词深度特征,并根据所述文本词深度特征,将所述待识别文本的各个文本词初始化为对应的图节点;
图构建模块3,用于根据各个图节点间的转移关系,构建所述待识别文本的文本异构图;
节点更新模块4,用于根据所述文本异构图和预设词性路径,通过注意力机制更新所述文本异构图中图节点的文本词深度特征;
特征优化模块5,用于采用BiLSTM模型对更新后的文本异构图内所有图节点进行特征提取,得到待解码文本词向量表示;
结果生成模块6,用于对所述待解码文本词向量表示进行解码和标注,得到嵌套命名实体识别结果。
关于一种词性感知嵌套命名实体识别系统的具体限定可以参见上文中对于一种词性感知嵌套命名实体识别方法的限定,在此不再赘述。上述一种词性感知嵌套命名实体识别系统中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
图11示出一个实施例中计算机设备的内部结构图,该计算机设备具体可以是终端或服务器。如图11所示,该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示器和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种词性感知嵌套命名实体识别方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域普通技术人员可以理解,图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有同的部件布置。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现上述方法的步骤。
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述方法的步骤。
综上,本发明实施例提供的一种词性感知嵌套命名实体识别方法、系统、计算机设备及存储介质,其词性感知嵌套命名实体识别方法实现了获取待识别文本的文本词数据后,采用BiLSTM模型对文本词数据进行特征提取得到文本词深度特征,并根据文本词深度特征将待识别文本的各个文本词初始化为对应的图节点,根据预设词性路径构建待识别文本的文本异构图,通过注意力机制更新图节点的文本词数据,再采用BiLSTM模型对文本异构图的所有图节点进行特征提取,得到待解码文本词向量表示后,采用条件随机场进行解码和标注,得到嵌套命名实体识别结果的技术方案,通过将异构图表示学习用于嵌套实体识别,引入词性知识对文本特征进行初始化,结合设计的一种采样获取更多 邻居节点信息的基于词性路径的空洞随机游走算法,依托DGL(Deep Graph Library)框架,对普通实体和嵌套实体进行有效识别处理,提高嵌套命名实体识别的精准性和学习效率的同时,进一步提升嵌套命名实体识别模型的性能优势。
本说明书中的各个实施例均采用递进的方式描述,各个实施例直接相同或相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。需要说明的是,上述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种优选实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本技术领域的普通技术人员来说,在不脱离本发明技术原理的前提下,还可以做出若干改进和替换,这些改进和替换也应视为本申请的保护范围。因此,本申请专利的保护范围应以所述权利要求的保护范围为准。

Claims (10)

  1. 一种词性感知嵌套命名实体识别方法,其特征在于,所述方法包括以下步骤:
    获取待识别文本的文本词数据;所述文本词数据包括文本序列ID、词性类别、词频和词向量表示;
    采用BiLSTM模型对所述文本词数据进行特征提取,得到对应的文本词深度特征,并根据所述文本词深度特征,将所述待识别文本的各个文本词初始化为对应的图节点;
    根据各个图节点间的转移关系,构建所述待识别文本的文本异构图;
    根据所述文本异构图和预设词性路径,通过注意力机制更新所述文本异构图中图节点的文本词深度特征;
    采用BiLSTM模型对更新后的文本异构图内所有图节点进行特征提取,得到待解码文本词向量表示;
    对所述待解码文本词向量表示进行解码和标注,得到嵌套命名实体识别结果。
  2. 如权利要求1所述的词性感知嵌套命名实体识别方法,其特征在于,所述获取待识别文本的文本词数据的步骤包括:
    按照所述待识别文本内各个文本词的位置顺序,给各个文本词设置对应的文本序列ID;
    对所述待识别文本进行词性标注,并根据词性标注结果,对所述待识别文本中的各个文本词进行词性分类和词频统计,得到对应的词性类别和词频;
    通过BERT模型,生成所述待识别文本内各个文本词的词向量表示。
  3. 如权利要求1所述的词性感知嵌套命名实体识别方法,其特征在于,所述采用BiLSTM模型对所述文本词数据进行特征提取,得到对应的文本词深度特征的步骤包括:
    将各个文本词数据的文本序列ID、词性、词频和词向量表示进行拼接整合,得到文本词初始特征;
    采用BiLSTM模型对所述文本词初始特征进行特征提取,得到所述文本词深度特征;所述文本词深度特征表示为:
    h(x i)=BiLSTM(F(x i))
    式中,
    Figure PCTCN2022133113-appb-100001
    其中,x i、F(x i)和h(x i)分别表示第i个文本词的文本词数据、文本词初始特征和文本词深度特征;
    Figure PCTCN2022133113-appb-100002
    Figure PCTCN2022133113-appb-100003
    分别表示第i个文本词数据内的文本序列ID、词性类别、词频和词向量表示。
  4. 如权利要求1所述的词性感知嵌套命名实体识别方法,其特征在于,所述待识别文本的文本异构图表示为:
    G=(V,E,Ov,Path)
    其中,V表示由不同词性文本词构成的节点集合,且各个节点的取值为文本词深度特征;E表示以节点构成的边集合;Ov表示节点的词性类型集合;Path表示预设词性路径,且包括动词与名词路径、名词修饰词与名词路径、连接词与名词路径、以及动词修饰词与动词路径。
  5. 如权利要求1所述的词性感知嵌套命名实体识别方法,其特征在于,所述根据所述文本异构图和预设词性路径,通过注意力机制更新所述文本异构图中图节点的文本词 深度特征的步骤包括:
    根据各个预设词性路径,对所述文本异构图进行深度优先遍历,得到对应的图节点序列;
    根据所述图节点序列,对各个预设词性路径中的各个图节点进行邻居节点采样,得到对应的邻居节点集合;
    通过注意力机制,对各个预设词性路径中的各个图节点的邻居节点集合进行节点信息整合,得到对应的图节点表示;所述图节点表示为:
    Figure PCTCN2022133113-appb-100004
    式中,
    Figure PCTCN2022133113-appb-100005
    Figure PCTCN2022133113-appb-100006
    其中,v表示第i条预设词性路径中的图节点,且取值为对应的文本词深度特征;
    Figure PCTCN2022133113-appb-100007
    表示第i条预设词性路径Path i中图节点对应的邻居节点集合;
    Figure PCTCN2022133113-appb-100008
    表示第i条词性路径Path i中图节点v的第j个邻居节点;k表示注意力头数;
    Figure PCTCN2022133113-appb-100009
    表示第i条词性路径Path i中图节点v的第j个邻居节点的权重系数;
    Figure PCTCN2022133113-appb-100010
    表示图节点v经过k个注意头的注意力计算得到的图节点表示;exp(·)表示以e为底的指数函数;LeakyReLU(·)表示激活函数;u T为边的权重矩阵;
    根据所述图节点表示,更新所述文本异构图中对应图节点的词频和词向量表示。
  6. 如权利要求5所述的词性感知嵌套命名实体识别方法,其特征在于,所述根据所述图节点序列,对各个预设词性路径中的各个图节点进行邻居节点采样,得到对应的邻居节点集合的步骤包括:
    获取所述预设词性路径的节点数目,并根据所述节点数目确定基础采样间隔;
    根据预设采样概率和采样停止条件,在所述图节点序列中,以基础采样间隔的整数倍为移动步长,随机获取预设词性路径中各个图节点对应的若干邻居节点,得到对应的所述邻居节点集合;所述采样停止条件为采样得到的邻居节点总数满足预设数目要求,且邻居节点中各个词性类别对应的节点数目满足预设比例要求。
  7. 如权利要求1所述的词性感知嵌套命名实体识别方法,其特征在于,所述对所述待解码文本词向量表示进行解码和标注,得到嵌套命名实体识别结果的步骤包括:
    采用条件随机场对所述待解码文本词向量表示进行解码和标注,得到命名实体识别结果和对应的第一标注文本词向量表示;
    采用改进的LSTM单元对所述第一标注文本词向量表示进行边界检测,判断所述第一标注文本词向量表示是否存在实体边界词;所述改进的LSTM单元通过在LSTM单元的输出隐藏层上增加多层感知机MLP得到;
    若所述第一标注文本词向量表示存在实体边界词,则将相邻所述实体边界词之间的所述第一标注文本词向量表示进行合并,得到第二标注文本词向量表示,并对所述第二标注文本词向量表示进行解码标注和边界检测,开始下一轮实体识别迭代,反之,则停 止迭代,将所述命名实体识别结果作为正向命名实体识别结果;
    根据所述正向实体识别结果对应的文本词向量表示进行逆向填充,得到第三标注文本词向量表示,并将所述第三文本词向量表示与上一轮实体识别迭代对应的文本词向量表示进行合并,得到第四标注文本词向量表示;
    采用条件随机场对所述第四标注文本词向量表示进行解码和标注,得到所述嵌套命名实体识别结果。
  8. 一种词性感知嵌套命名实体识别系统,其特征在于,所述系统包括:
    预处理模块,用于获取待识别文本的文本词数据;所述文本词数据包括文本序列ID、词性类别、词频和词向量表示;
    节点初始化模块,用于采用BiLSTM模型对所述文本词数据进行特征提取,得到对应的文本词深度特征,并根据所述文本词深度特征,将所述待识别文本的各个文本词初始化为对应的图节点;
    图构建模块,用于根据各个图节点间的转移关系,构建所述待识别文本的文本异构图;
    节点更新模块,用于根据所述文本异构图和预设词性路径,通过注意力机制更新所述文本异构图中图节点的文本词深度特征;
    特征优化模块,用于采用BiLSTM模型对更新后的文本异构图内所有图节点进行特征提取,得到待解码文本词向量表示;
    结果生成模块,用于对所述待解码文本词向量表示进行解码和标注,得到嵌套命名实体识别结果。
  9. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至7中任一所述方法的步骤。
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一所述方法的步骤。
PCT/CN2022/133113 2021-12-13 2022-11-21 词性感知嵌套命名实体识别方法、系统、设备和存储介质 WO2023109436A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/520,629 US20240111956A1 (en) 2021-12-13 2023-11-28 Nested named entity recognition method based on part-of-speech awareness, device and storage medium therefor

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111518808.5A CN114330343B (zh) 2021-12-13 2021-12-13 词性感知嵌套命名实体识别方法、系统、设备和存储介质
CN202111518808.5 2021-12-13

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/520,629 Continuation-In-Part US20240111956A1 (en) 2021-12-13 2023-11-28 Nested named entity recognition method based on part-of-speech awareness, device and storage medium therefor

Publications (1)

Publication Number Publication Date
WO2023109436A1 true WO2023109436A1 (zh) 2023-06-22

Family

ID=81050859

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/133113 WO2023109436A1 (zh) 2021-12-13 2022-11-21 词性感知嵌套命名实体识别方法、系统、设备和存储介质

Country Status (3)

Country Link
US (1) US20240111956A1 (zh)
CN (1) CN114330343B (zh)
WO (1) WO2023109436A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236333A (zh) * 2023-10-17 2023-12-15 哈尔滨工业大学(威海) 一种基于威胁情报的复杂命名实体识别方法
CN117316372A (zh) * 2023-11-30 2023-12-29 天津大学 一种基于深度学习的耳疾电子病历解析方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114330343B (zh) * 2021-12-13 2023-07-25 广州大学 词性感知嵌套命名实体识别方法、系统、设备和存储介质
CN116384339B (zh) * 2023-04-04 2024-07-02 华润数字科技有限公司 一种文本生成领域的解码方法、装置、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180260474A1 (en) * 2017-03-13 2018-09-13 Arizona Board Of Regents On Behalf Of The University Of Arizona Methods for extracting and assessing information from literature documents
CN109902303A (zh) * 2019-03-01 2019-06-18 腾讯科技(深圳)有限公司 一种实体识别方法及相关设备
CN110866401A (zh) * 2019-11-18 2020-03-06 山东健康医疗大数据有限公司 基于注意力机制的中文电子病历命名实体识别方法及系统
CN111046671A (zh) * 2019-12-12 2020-04-21 中国科学院自动化研究所 基于图网络融入词典的中文命名实体识别方法
CN113723103A (zh) * 2021-08-26 2021-11-30 北京理工大学 融合多源知识的中文医学命名实体和词性联合学习方法
CN114330343A (zh) * 2021-12-13 2022-04-12 广州大学 词性感知嵌套命名实体识别方法、系统、设备和存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536679B (zh) * 2018-04-13 2022-05-20 腾讯科技(成都)有限公司 命名实体识别方法、装置、设备及计算机可读存储介质
CN109871545B (zh) * 2019-04-22 2022-08-05 京东方科技集团股份有限公司 命名实体识别方法及装置
CN112101031B (zh) * 2020-08-25 2022-03-18 厦门渊亭信息科技有限公司 一种实体识别方法、终端设备及存储介质
CN113688631B (zh) * 2021-07-05 2023-06-09 广州大学 一种嵌套命名实体识别方法、系统、计算机和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180260474A1 (en) * 2017-03-13 2018-09-13 Arizona Board Of Regents On Behalf Of The University Of Arizona Methods for extracting and assessing information from literature documents
CN109902303A (zh) * 2019-03-01 2019-06-18 腾讯科技(深圳)有限公司 一种实体识别方法及相关设备
CN110866401A (zh) * 2019-11-18 2020-03-06 山东健康医疗大数据有限公司 基于注意力机制的中文电子病历命名实体识别方法及系统
CN111046671A (zh) * 2019-12-12 2020-04-21 中国科学院自动化研究所 基于图网络融入词典的中文命名实体识别方法
CN113723103A (zh) * 2021-08-26 2021-11-30 北京理工大学 融合多源知识的中文医学命名实体和词性联合学习方法
CN114330343A (zh) * 2021-12-13 2022-04-12 广州大学 词性感知嵌套命名实体识别方法、系统、设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FU YAO, TAN CHUANQI, CHEN MOSHA, HUANG SONGFANG, HUANG FEI: "Nested Named Entity Recognition with Partially-Observed TreeCRFs", PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 35, no. 14, 1 January 2021 (2021-01-01), pages 12839 - 12847, XP093071499, ISSN: 2159-5399, DOI: 10.1609/aaai.v35i14.17519 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236333A (zh) * 2023-10-17 2023-12-15 哈尔滨工业大学(威海) 一种基于威胁情报的复杂命名实体识别方法
CN117316372A (zh) * 2023-11-30 2023-12-29 天津大学 一种基于深度学习的耳疾电子病历解析方法
CN117316372B (zh) * 2023-11-30 2024-04-09 天津大学 一种基于深度学习的耳疾电子病历解析方法

Also Published As

Publication number Publication date
CN114330343A (zh) 2022-04-12
CN114330343B (zh) 2023-07-25
US20240111956A1 (en) 2024-04-04

Similar Documents

Publication Publication Date Title
CN112214995B (zh) 用于同义词预测的分层多任务术语嵌入学习
Li et al. A survey on deep learning for named entity recognition
US10963794B2 (en) Concept analysis operations utilizing accelerators
Lyu et al. Long short-term memory RNN for biomedical named entity recognition
CN109241524B (zh) 语义解析方法及装置、计算机可读存储介质、电子设备
WO2023109436A1 (zh) 词性感知嵌套命名实体识别方法、系统、设备和存储介质
US9318027B2 (en) Caching natural language questions and results in a question and answer system
US9436918B2 (en) Smart selection of text spans
CN103154936B (zh) 用于自动化文本校正的方法和系统
CN117076653B (zh) 基于思维链及可视化提升上下文学习知识库问答方法
CN110162771B (zh) 事件触发词的识别方法、装置、电子设备
CN113688631B (zh) 一种嵌套命名实体识别方法、系统、计算机和存储介质
CN113392209B (zh) 一种基于人工智能的文本聚类方法、相关设备及存储介质
US9760626B2 (en) Optimizing parsing outcomes of documents
Sarwar et al. $ CAG $: Stylometric authorship attribution of multi-author documents using a co-authorship graph
Sun et al. Feature-frequency–adaptive on-line training for fast and accurate natural language processing
Ma et al. DC-CNN: Dual-channel Convolutional Neural Networks with attention-pooling for fake news detection
WO2023060633A1 (zh) 增强语义的关系抽取方法、装置、计算机设备及存储介质
Zhen et al. Frequent words and syntactic context integrated biomedical discontinuous named entity recognition method
US10169074B2 (en) Model driven optimization of annotator execution in question answering system
Zhang et al. Bg-efrl: Chinese named entity recognition method and application based on enhanced feature representation
Xin et al. Text feature-based copyright recognition method for comics
US20240303496A1 (en) Exploiting domain-specific language characteristics for language model pretraining
Zong et al. Emotion-cause pair extraction via knowledge-driven multi-classification and graph-based position embedding
Sharma et al. A Comprehensive Study on Natural Language Processing, It’s Techniques and Advancements in Nepali Language

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22906188

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE