CN114358011A - Named entity extraction method and device and electronic equipment - Google Patents

Named entity extraction method and device and electronic equipment Download PDF

Info

Publication number
CN114358011A
CN114358011A CN202210013663.1A CN202210013663A CN114358011A CN 114358011 A CN114358011 A CN 114358011A CN 202210013663 A CN202210013663 A CN 202210013663A CN 114358011 A CN114358011 A CN 114358011A
Authority
CN
China
Prior art keywords
label
type
token
sequence
named entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210013663.1A
Other languages
Chinese (zh)
Inventor
郭延明
刘盼
雷军
魏迎梅
谢毓湘
王翔汉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210013663.1A priority Critical patent/CN114358011A/en
Publication of CN114358011A publication Critical patent/CN114358011A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application provides a named entity extraction method and device and electronic equipment. The method comprises the following steps: firstly, identifying a first type of token belonging to a named entity and a second type of token not belonging to the named entity in a natural language text through a named entity identification model so as to convert the natural language text into a token sequence, and labeling the token sequence according to the sequence from front to back to obtain a target label sequence. And extracting each named entity from the token sequence according to the target label sequence. The target label sequence discriminately labels boundary characters before and after the named entity, so that the discrimination of the named entity is remarkably improved, and the result of the named entity identification is more accurate.

Description

Named entity extraction method and device and electronic equipment
Technical Field
The present application relates to the field of data recognition technologies, and in particular, to a named entity extraction method and apparatus, and an electronic device.
Background
Named entity recognition is intended to identify the boundaries and types of entities in natural language text that have a particular meaning, including primarily person names, place names, organization names, proper nouns, and the like. Tokens (tokens) are the basic units of text, usually referring to words in english or characters in chinese. In machine learning, a text is regarded as a token sequence, and a task identified by a named entity is converted into a sequence labeling task, namely, each token in the sequence is labeled and labeled. The labeling scheme has a large impact on the performance of named entity recognition. Non-entity labels are treated as the same for the labeling schemes in related named entity recognition schemes and differences between these labels are ignored.
Disclosure of Invention
In view of the above, an object of the present application is to provide a named entity extraction method and apparatus, and an electronic device.
Based on the above purpose, the application provides
A named entity extraction method is characterized by comprising the following steps:
identifying a first type of token belonging to a named entity and a second type of token not belonging to the named entity in a natural language text through a named entity identification model so as to convert the natural language text into a token sequence, and labeling the token sequence according to the sequence from front to back to obtain a target tag sequence;
extracting each named entity from the token sequence according to the target tag sequence,
wherein labeling the token sequence to obtain the target tag sequence comprises:
respectively printing a first class label and a second class label on the first class token and the second class token in the token sequence according to the sequence from front to back to obtain an initial label sequence;
for each of the second type tags in the initial tag sequence, performing the following operations to obtain the target tag sequence:
in response to determining that a previous label of the second type of label is absent or does not belong to the first type of label and a subsequent label of the second type of label belongs to the first type of label, changing the second type of label to a first type of boundary marker;
in response to determining that a previous label of the second type of label belongs to the first type of label and a subsequent label of the second type of label is absent or does not belong to the first type of label, changing the second type of label to a second type of boundary marker.
Further, the operations further comprise:
in response to determining that both a previous label and a subsequent label of the second type of label belong to the first type of label, changing the second type of label to the second type of boundary marker.
Further, the operations further comprise:
in response to determining that both a previous label and a subsequent label of the second type of label belong to the first type of label, changing the second type of label to the first type of boundary marker.
Further, the changing the second type label to the first type boundary mark includes:
acquiring a first entity type identifier contained in a subsequent label of the second type label;
changing the second type label to said first type boundary marker comprising said first entity type identifier,
the changing the second type label into the second type boundary mark comprises:
acquiring a second entity type identifier contained in a previous label of the second type label;
the second type label is changed to the second type boundary marker containing the second entity type identifier.
Further, in response to determining that both the previous label and the next label of the second type of label belong to the first type of label, changing the second type of label to the second type of boundary marker includes:
acquiring a third entity type identifier contained in a previous label of the second type label;
changing the second type label to the second type boundary marker containing the third entity type identifier.
Further, in response to determining that both the previous label and the next label of the second type of label belong to the first type of label, changing the second type of label to the first type of boundary marker includes:
acquiring a fourth entity type identifier contained in a label behind the second type label;
the second type label is changed to the first type boundary marker containing the fourth entity type identifier.
Further, the named entity recognition model comprises a pre-training language model and a label decoder;
identifying the first class of tokens and the second class of tokens in the natural language text to convert the natural language text into the token sequence comprises: converting the natural language text into token representation through the pre-training language model, and identifying the first class token and the second class token in the token representation to obtain the token sequence;
labeling the token sequence in a front-to-back order comprises: labeling, by the tag decoder, the token sequence in a front-to-back order.
Further, the tag decoder includes a conditional random field CRF model.
Based on the same concept, the application also provides a named entity extraction device, which comprises:
the labeling module is used for identifying a first type of token belonging to a named entity and a second type of token not belonging to the named entity in the natural language text through a named entity identification model so as to convert the natural language text into a token sequence, and labeling the token sequence according to the sequence from front to back to obtain a target label sequence;
an extraction module for extracting each named entity from the token sequence according to the target tag sequence,
wherein the annotation module is configured to:
respectively printing a first class label and a second class label on the first class token and the second class token in the token sequence according to the sequence from front to back to obtain an initial label sequence;
for each of the second type tags in the initial tag sequence, performing the following operations to obtain the target tag sequence:
in response to determining that a previous label of the second type of label is absent or does not belong to the first type of label and a subsequent label of the second type of label belongs to the first type of label, changing the second type of label to a first type of boundary marker;
in response to determining that a previous label of the second type of label belongs to the first type of label and a subsequent label of the second type of label is absent or does not belong to the first type of label, changing the second type of label to a second type of boundary marker.
Based on the same concept, the present application also provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the method according to any one of the above.
From the above description, according to the named entity extraction method and apparatus and the electronic device provided by the application, first, the first class token belonging to the named entity and the second class token not belonging to the named entity in the natural language text are identified through the named entity identification model, so that the natural language text is converted into the token sequence, and the token sequence is labeled according to the sequence from front to back to obtain the target tag sequence. And extracting each named entity from the token sequence according to the target label sequence. The target label sequence discriminately labels boundary characters before and after the named entity, so that the discrimination of the named entity is remarkably improved, and the result of the named entity identification is more accurate.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic flow chart illustrating a named entity extraction method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a named entity extraction apparatus according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.
It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
As described in the background section, the tagging scheme in related named entity identifications considers only the internal tags of entities and ignores the information of their external tags. All relevant labeling schemes treat non-entity labels as the same and ignore differences between such labels, applicants have discovered in the course of practicing the present application that labels near an entity may represent the boundaries of the entity and should be treated differently than other labels.
In view of this, one or more embodiments of the present specification provide a named entity extraction scheme, and with reference to fig. 1, a named entity identification method of an embodiment of the present specification includes the following steps:
step S101, identifying a first type of token belonging to a named entity and a second type of token not belonging to the named entity in a natural language text through a named entity identification model so as to convert the natural language text into a token sequence, and labeling the token sequence according to the sequence from front to back to obtain a target label sequence;
in this step, labeling the token sequence to obtain the target tag sequence includes:
respectively printing a first class label and a second class label on the first class token and the second class token in the token sequence according to the sequence from front to back to obtain an initial label sequence;
for each of the second type tags in the initial tag sequence, performing the following operations to obtain the target tag sequence:
in response to determining that a previous label of the second type of label is absent or does not belong to the first type of label and a subsequent label of the second type of label belongs to the first type of label, changing the second type of label to a first type of boundary marker;
in response to determining that a previous label of the second type of label belongs to the first type of label and a subsequent label of the second type of label is absent or does not belong to the first type of label, changing the second type of label to a second type of boundary marker.
In this embodiment, the natural language text includes named entities and non-named entities. Named entities are names of people, organizations, places, and all other entities identified by names. The broader named entities also include numbers, dates, currencies, addresses, and the like. Non-named entities are other parts. Tokens (tokens) are the basic units of text, usually referred to as words in english or characters in chinese, and tokens of a first type corresponding to named entities and tokens of a second type corresponding to non-named entities. For a first type token, firstly marking the first type token as a first type label to represent that a character or a word corresponding to the token is a named entity; for the second type token, it is first marked as a second type label to indicate that the corresponding character or word of the token is a non-named entity. And then, analyzing the conditions of the labels before and after the second-class label, and changing the second-class labels corresponding to the non-named entities at two sides of the named entity into boundary marks so as to highlight the front and back boundaries of the named entity.
As a specific example, the phrase "premature infant is not yet mature in renal function" is labeled, wherein "kidney" is a named entity, and a first boundary marker is labeled on "infant" before "kidney" and a second boundary marker is labeled on "work" after "kidney".
And S102, extracting each named entity from the token sequence according to the target label sequence.
In this step, the target tag sequence corresponding to the token sequence has the named entity and the tag of the named entity boundary information, and when extracting, only the named entity needs to be extracted according to the corresponding tag.
Therefore, the target label sequence carries out distinguishing labeling on the boundary characters before and after the named entity, so that the distinguishing degree of the named entity is obviously improved, and the result of the named entity identification is more accurate.
In some other embodiments, the operations described for the previous embodiments may further include:
in response to determining that both a previous label and a subsequent label of the second type of label belong to the first type of label, changing the second type of label to the second type of boundary marker.
In the present embodiment, there is a case: the characters or words of the non-named entity only occupy the length of one token, so that the corresponding second type label of the non-named entity is only one. If the front and the rear of the non-named entity are named entities, the second type label corresponding to the non-named entity needs to be changed into the second type boundary label to be used as the label of the rear boundary of the front named entity.
As a specific example, the phrase "kidney and liver function of premature infants are not yet mature" is labeled, wherein "kidney" and "liver" are named entities, and it is necessary to label a first class boundary marker of "kidney" on "infants before" kidney ", a second class boundary marker of" kidney "on" infants after "kidney", and a second class boundary marker of "liver" on "work" after "liver".
In some other embodiments, the operations described for the previous embodiments may further include:
in response to determining that both a previous label and a subsequent label of the second type of label belong to the first type of label, changing the second type of label to the first type of boundary marker.
In the present embodiment, there is a case: the characters or words of the non-named entity only occupy the length of one token, so that the corresponding second type label of the non-named entity is only one. If the front and the back of the non-named entity are named entities at this time, the second type label corresponding to the non-named entity needs to be changed into the first type boundary label to serve as a label of the front boundary of the named entity at the back.
In some other embodiments, the changing the second type label to the first type boundary mark in the foregoing embodiments includes:
acquiring a first entity type identifier contained in a subsequent label of the second type label;
changing the second type label to said first type boundary marker comprising said first entity type identifier,
the changing the second type label into the second type boundary mark comprises:
acquiring a second entity type identifier contained in a previous label of the second type label;
the second type label is changed to the second type boundary marker containing the second entity type identifier.
In this embodiment, the tag may represent two aspects of information for a character or word, including two parts: a first type tag and a second type tag indicating whether corresponding characters or words are named entities; the entity type identifier indicating what type of corresponding character or word is used to indicate type information such as a person name, an organization name, and a place name. In this embodiment, the first type boundary markers and the second type boundary markers used for representing the boundary information of the named entity are also given as the entity type identifiers corresponding to the boundary markers, so that the discrimination of the named entity is further improved, and the result of the named entity identification is more accurate.
In some other embodiments, for the foregoing embodiment, in response to determining that both the previous tag and the next tag of the second class of tags belong to the first class of tags, changing the second class of tags to the second class of boundary markers includes:
acquiring a third entity type identifier contained in a previous label of the second type label;
changing the second type label to the second type boundary marker containing the third entity type identifier.
As a specific example, referring to table 1, a named entity may be a certain body organ, or may be a certain disease. The labeling scheme in this embodiment is referred to as BIO + ES, where the first-type boundary flag is set to O —, the second-type boundary flag is set to O +, the first-type label of the named entity with a single character is set to S, the first-type label start symbol of the named entity with multiple characters is set to B, the middle symbol is set to I, the end symbol is set to E, the second-type label is set to O, the entity type identifier of the body organ is set to bad, the entity type identifier of the disease is set to dis, and it is easy for premature infants to have immature renal function, low glucose renal threshold, and diabetes. "annotate.
TABLE 1
token Early stage Product produced by birth Children's toy Kidney (Kidney) Work (Gong) Can be used for Shang dynasty Is not limited to Become into Cooking Grape Grape with water-soluble core Candy Kidney (Kidney) Threshold(s) Compared with Is low in Easy to use Go out Now that Candy Urine collection device
Labeling O O O- bod S- bod O+ bod O O O O O O- bod B- bod I- bod E- bod S- bod O+ bod O O O O O O- dis B- dis E- dis O+ dis
In some other embodiments, for the foregoing embodiment, in response to determining that both the previous tag and the next tag of the second type of tag belong to the first type of tag, changing the second type of tag to the first type of boundary marker includes:
acquiring a fourth entity type identifier contained in a label behind the second type label;
the second type label is changed to the first type boundary marker containing the fourth entity type identifier.
In still other embodiments, the named entity recognition model for the method described in the previous embodiments includes a pre-trained language model and a tag decoder;
identifying the first class of tokens and the second class of tokens in the natural language text to convert the natural language text into the token sequence comprises: converting the natural language text into token representation through the pre-training language model, and identifying the first class token and the second class token in the token representation to obtain the token sequence;
labeling the token sequence in a front-to-back order comprises: labeling, by the tag decoder, the token sequence in a front-to-back order.
In the present embodiment, the character representation includes the embedding of Chinese characters and other effective representations, which are the basic units of Chinese. In natural language processing, Chinese characters are generally treated as English words, and each Chinese character is treated as a token. The function of the character representation is to map the markers into a continuous space to facilitate subsequent calculations. In general, we need to vectorize the input before proceeding with various machine learning algorithms. One-hot encoding represents each token with a very long vector, whose length is the size of the dictionary. In the one-hot vector space, two different characters have orthogonal representations and cannot reflect the semantic relationship between tokens. The distributed representation may overcome the disadvantages of the one-hot representation. The basic idea of distributed representation is to map each token to a short vector of fixed length by training. All these vectors constitute a vector space, each dimension of the space representing a potential feature, and each token can be considered as a point in space. Distributed representations are automatically learned from text, which can automatically capture semantic and syntactic properties of tokens, thereby enabling the input characters to be often converted into distributed representations in named entity recognition. Thus, in named entity recognition, input characters are converted into a distributed representation for learning and training in named entity recognition.
The pre-trained language model is also called dynamic context-dependent embedding, and in this embodiment, the pre-trained language model may be ELMo, GPT, BERT, ERNIE, ALBERT, NEZHA, etc., and the character representation generated by these models is context-dependent and will vary with context. For a given character, BERT uses the sum of its character position vector, sentence position vector and character vector as input, and uses mask language Model (maskedlangie Model) to pre-train the input for deep bi-directional representation, so as to obtain robust context-dependent character representation.
In some embodiments, the pre-trained language model does not recognize the context correlation of the natural language text with the additional information, and the context encoder can further analyze the positions of characters in the sentence and the dependency relationship between different characters, encode the input representation, and obtain the representation of each character in the current sentence.
In this embodiment, the context encoder may be a neural network model based on a loop, or may be a transform encoder. A Recurrent Neural Network (RNN) is a type of recurrent neural Network (recurrent neural Network) in which sequence data is input, recursion is performed in the direction of evolution of the sequence, and all nodes (recurrent units) are connected in a chain. The recurrent neural network designs a parameter sharing recurrent structure which does not change with the length and the position of the sequence data, and the state of the current time step of each recurrent unit is determined by the input of the time step and the state of the last time step. Its weighting coefficients are shared, i.e. in one iteration, the loop node processes all time steps with the same weighting coefficients. Weight sharing reduces the total number of parameters of the RNN compared to feed-forward neural networks. Weight sharing also means that RNNs can extract time-varying features in the sequence, and thus can exhibit generalization ability when learning and testing sequences are of different lengths. Due to the gradient vanishing problem, the original RNN gives more weight to the nearest node and cannot learn long-distance dependence. Therefore, gated-cyclic networks have been created, which assume that RNNs are given the ability to control their internal information accumulation through gating units, both to grasp long-range dependencies and to selectively forget information to prevent overloading when learning. The main stream of gated cyclic network has gated cyclic units (GRU) and long-term memory networks (LSTM).
The cycle-based neural network model is typically a sequential computation of input and output sequences that produce a sequence of hidden states for a location based on the hidden states and inputs of the previous step. This inherent sequential nature hinders parallelization of sample training, which becomes critical over longer sequence lengths, as limited memory limits the batch size of samples.
In this embodiment, the context encoders can be nested, actually the BERT and the like have context encoding information, but most of the context encoders with the LTSM structure are still added after the BERT and the like, and only in this case, the effect improvement brought by adding the context encoders is not obvious. But for the additional information introduced it does not have context related information itself, which requires the encoder to capture its context information.
In other embodiments, the tag decoder may be a conditional random field CRF model.
In this embodiment, the tag decoder is the final stage of the named entity recognition model. It takes a context-dependent representation as input and generates a sequence of tags corresponding to the input natural language text, i.e. a target tag sequence. There are currently two main forms of implementation. (1) MLP + Softmax. The sequence marking task is converted into a multi-class classification problem, and after the representation of each word is obtained, the score of each label corresponding to the word is directly obtained by using a linear layer. The label of each token in this structure is predicted independently from its context-dependent representation derived by the decoder, regardless of its surrounding tokens. (2) Conditional Random Field (CRF), which models dependencies within tag sequences, such as in the biees labeling scheme where B is followed only by I or E, but not S. Conditional random fields can learn this inter-label dependence gradually during training, thereby avoiding some errors. In this application, a CRF is used as the decoder.
A conditional random field is a Markov random field of a random variable Y given a random variable X. The main use of the sequence labeling problem is the linear chain element random field, in this case, in the conditional random field model P (Y | X), Y is an output variable representing a labeled sequence, and X is an input variable representing an observed sequence to be labeled. During learning, a conditional probability model P' (Y | X) is obtained by utilizing training data through maximum likelihood estimation or regularized maximum likelihood estimation; in the prediction, an output y having the maximum conditional probability P' (y | x) is obtained for a given input sequence x. The problem of probability calculation for conditional random fields is to calculate a conditional probability P (Y | X) given a conditional random field P (Y | X), an input sequence X and an output sequence Yi=yi|x),P(Yi-1=yi-1,yi|xi) And the corresponding mathematical expectations. The problem of predicting conditional random fields is to label the observation sequences by solving the output sequence (labeled sequence) Y with the highest conditional probability given the conditional random field P (Y | X) and the input sequence (observation sequence) X. The prediction algorithm of conditional random fields is the well-known Viterbi algorithm, using dynamic rulesAnd efficiently solving an optimal path, namely the target label sequence with the maximum probability.
It can be seen that in the present embodiment, a distributed representation of the token sequence is obtained through a pre-training language model. The context correlation between characters is then captured by the context encoder using a recurrent neural network, a transform model, or other network. And further predicting the label of the token by using the coded context information through a label decoder to obtain a target label sequence.
A specific application scenario of the embodiment of the present application is given below, and table 2 shows differences between the BIO + ES labeling scheme and other related labeling schemes in the present application. Similarly, premature infants are immature in kidney function, low in glucose kidney threshold and prone to diabetes. "is an example.
TABLE 2
token IO IOB BIO IOE1 IOE2 BIOE1 BIOE2 BIOS BIOES BIO+ES
Early stage O O O O O O O O O O
Product produced by birth O O O O O O O O O O
Children's toy O O O O O O O O O O-bod
Kidney (Kidney) I-bod I-bod B-bod I-bod E-bod B-bod E-bod S-bod S-bod S-bod
Work (Gong) O O O O O O O O O O+bod
Can be used for O O O O O O O O O O
Shang dynasty O O O O O O O O O O
Is not limited to O O O O O O O O O O
Become into O O O O O O O O O O
Cooking O O O O O O O O O O
O O O O O O O O O O-bod
Grape I-bod I-bod B-bod I-bod I-bod B-bod B-bod B-bod B-bod B-bod
Grape I-bod I-bod I-bod I-bod I-bod I-bod I-bod I-bod I-bod I-bod
Candy I-bod I-bod I-bod E-bod E-bod E-bod E-bod I-bod E-bod E-bod
Kidney (Kidney) I-bod B-bod B-bod I-bod E-bod B-bod E-bod S-bod S-bod S-bod
Threshold(s) O O O O O O O O O O+bod
Compared with O O O O O O O O O O
Is low in O O O O O O O O O O
O O O O O O O O O O
Easy to use O O O O O O O O O O
Go out O O O O O O O O O O
Now that O O O O O O O O O O-dis
Candy I-dis I-dis B-dis I-dis I-dis B-dis B-dis B-dis B-dis B-dis
Urine collection device I-dis I-dis I-dis I-dis E-dis E-dis E-dis I-dis E-dis E-dis
O O O O O O O O O O+dis
In this embodiment, three reference chinese named entity recognition datasets were analyzed, including: MSRA, WEIBO and RESUME. Table 3 shows the recognition effect of the named entity recognition method using BIO + ES and the named entity recognition method using biees on the target text in the dataset through the BERT/ERNIE two pre-trained language models and using CRF as the encoder, respectively. The recognition effect is shown by converting the F1 score to a percentile. As a multi-classification task, the named entity recognition transformed sequence annotation task generally employs the macro-average F1 score and the micro-average F1 score of the precisely matched F scores as evaluation indexes. The F-score is an index used in statistics to measure the accuracy of classification models. For an unbalanced sample distribution, it is not sufficient to measure the quality of the model with accuracy alone. And F, comprehensively considering the accuracy and the recall rate of the classification model to obtain the weighted harmonic average of the classification model and the recall rate. The F fraction is between 0 and 1, with higher fractions indicating better performance.
TABLE 3
MSRA WEIBO RESUME
BERT+CRF(BIOES) 95.72 72.28 95.78
BERT+CRF(BIO+ES) 95.93 72.46 95.99
ERNIE+CRF(BIOES) 95.93 71.83 96.77
ERNIE+CRF(BIO+ES) 96.O1 72.97 96.92
As can be seen from the table, the named entity extraction method in the embodiment of the present application is higher in F score than the related method, so the BIO + ES labeling scheme proposed in the embodiment of the present application performs better than other schemes in most cases. This indicates that the named entity extraction method using the BIO + ES annotation scheme of the present application is a more expressive scheme.
It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.
It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a named entity extraction device.
Referring to fig. 2, the named entity extracting apparatus includes:
the labeling module 201 is configured to identify a first class token belonging to a named entity and a second class token not belonging to the named entity in a natural language text through a named entity identification model, convert the natural language text into a token sequence, and label the token sequence in a front-to-back order to obtain a target label sequence;
an extracting module 202, configured to extract each named entity from the token sequence according to the target tag sequence,
wherein the annotation module is configured to:
respectively printing a first class label and a second class label on the first class token and the second class token in the token sequence according to the sequence from front to back to obtain an initial label sequence;
for each of the second type tags in the initial tag sequence, performing the following operations to obtain the target tag sequence:
in response to determining that a previous label of the second type of label is absent or does not belong to the first type of label and a subsequent label of the second type of label belongs to the first type of label, changing the second type of label to a first type of boundary marker;
in response to determining that a previous label of the second type of label belongs to the first type of label and a subsequent label of the second type of label is absent or does not belong to the first type of label, changing the second type of label to a second type of boundary marker.
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations as the present application.
The apparatus in the foregoing embodiment is used to implement the corresponding named entity identification method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, corresponding to any of the above embodiments, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the program, the named entity extraction method according to any of the above embodiments is implemented.
Fig. 3 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The electronic device of the foregoing embodiment is used to implement the corresponding named entity extraction method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, corresponding to any of the above-described embodiment methods, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the named entity extraction method according to any of the above embodiments.
Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The computer instructions stored in the storage medium of the foregoing embodiment are used to enable the computer to execute the named entity extraction method according to any embodiment, and have the beneficial effects of the corresponding method embodiment, which are not described herein again.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the present application.

Claims (10)

1. A named entity extraction method is characterized by comprising the following steps:
identifying a first type of token belonging to a named entity and a second type of token not belonging to the named entity in a natural language text through a named entity identification model so as to convert the natural language text into a token sequence, and labeling the token sequence according to the sequence from front to back to obtain a target tag sequence;
extracting each named entity from the token sequence according to the target tag sequence,
wherein labeling the token sequence to obtain the target tag sequence comprises:
respectively printing a first class label and a second class label on the first class token and the second class token in the token sequence according to the sequence from front to back to obtain an initial label sequence;
for each of the second type tags in the initial tag sequence, performing the following operations to obtain the target tag sequence:
in response to determining that a previous label of the second type of label is absent or does not belong to the first type of label and a subsequent label of the second type of label belongs to the first type of label, changing the second type of label to a first type of boundary marker;
in response to determining that a previous label of the second type of label belongs to the first type of label and a subsequent label of the second type of label is absent or does not belong to the first type of label, changing the second type of label to a second type of boundary marker.
2. The method of claim 1, wherein the operations further comprise:
in response to determining that both a previous label and a subsequent label of the second type of label belong to the first type of label, changing the second type of label to the second type of boundary marker.
3. The method of claim 1, wherein the operations further comprise:
in response to determining that both a previous label and a subsequent label of the second type of label belong to the first type of label, changing the second type of label to the first type of boundary marker.
4. The method of claim 1,
the changing the second type label into the first type boundary mark comprises:
acquiring a first entity type identifier contained in a subsequent label of the second type label;
changing the second type label to said first type boundary marker comprising said first entity type identifier,
the changing the second type label into the second type boundary mark comprises:
acquiring a second entity type identifier contained in a previous label of the second type label;
the second type label is changed to the second type boundary marker containing the second entity type identifier.
5. The method of claim 2, wherein modifying the second type label to the second type boundary marker in response to determining that both a previous label and a subsequent label of the second type label belong to the first type label comprises:
acquiring a third entity type identifier contained in a previous label of the second type label;
changing the second type label to the second type boundary marker containing the third entity type identifier.
6. The method of claim 3, wherein modifying the second type label to the first type boundary marker in response to determining that both a previous label and a subsequent label of the second type label belong to the first type label comprises:
acquiring a fourth entity type identifier contained in a label behind the second type label;
the second type label is changed to the first type boundary marker containing the fourth entity type identifier.
7. The method according to any one of claims 1 to 6,
the named entity recognition model comprises a pre-training language model and a label decoder;
identifying the first class of tokens and the second class of tokens in the natural language text to convert the natural language text into the token sequence comprises: converting the natural language text into token representation through the pre-training language model, and identifying the first class token and the second class token in the token representation to obtain the token sequence;
labeling the token sequence in a front-to-back order comprises: labeling, by the tag decoder, the token sequence in a front-to-back order.
8. The method of claim 7, wherein the tag decoder comprises a conditional random field CRF model.
9. A named entity extraction apparatus, comprising:
the labeling module is used for identifying a first type of token belonging to a named entity and a second type of token not belonging to the named entity in the natural language text through a named entity identification model so as to convert the natural language text into a token sequence, and labeling the token sequence according to the sequence from front to back to obtain a target label sequence;
an extraction module for extracting each named entity from the token sequence according to the target tag sequence,
wherein the annotation module is configured to:
respectively printing a first class label and a second class label on the first class token and the second class token in the token sequence according to the sequence from front to back to obtain an initial label sequence;
for each of the second type tags in the initial tag sequence, performing the following operations to obtain the target tag sequence:
in response to determining that a previous label of the second type of label is absent or does not belong to the first type of label and a subsequent label of the second type of label belongs to the first type of label, changing the second type of label to a first type of boundary marker;
in response to determining that a previous label of the second type of label belongs to the first type of label and a subsequent label of the second type of label is absent or does not belong to the first type of label, changing the second type of label to a second type of boundary marker.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable by the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the computer program.
CN202210013663.1A 2022-01-06 2022-01-06 Named entity extraction method and device and electronic equipment Pending CN114358011A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210013663.1A CN114358011A (en) 2022-01-06 2022-01-06 Named entity extraction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210013663.1A CN114358011A (en) 2022-01-06 2022-01-06 Named entity extraction method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN114358011A true CN114358011A (en) 2022-04-15

Family

ID=81106847

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210013663.1A Pending CN114358011A (en) 2022-01-06 2022-01-06 Named entity extraction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN114358011A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024044014A1 (en) * 2022-08-21 2024-02-29 Nec Laboratories America, Inc. Concept-conditioned and pretrained language models based on time series to free-form text description generation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024044014A1 (en) * 2022-08-21 2024-02-29 Nec Laboratories America, Inc. Concept-conditioned and pretrained language models based on time series to free-form text description generation

Similar Documents

Publication Publication Date Title
CN107085581B (en) Short text classification method and device
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN109978060B (en) Training method and device of natural language element extraction model
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN110363049B (en) Method and device for detecting, identifying and determining categories of graphic elements
CN112632225B (en) Semantic searching method and device based on case and event knowledge graph and electronic equipment
CN111858843B (en) Text classification method and device
CN108664512B (en) Text object classification method and device
CN110941951B (en) Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN111475622A (en) Text classification method, device, terminal and storage medium
CN112632226B (en) Semantic search method and device based on legal knowledge graph and electronic equipment
WO2012158572A2 (en) Exploiting query click logs for domain detection in spoken language understanding
CN112667782A (en) Text classification method, device, equipment and storage medium
CN110399547B (en) Method, apparatus, device and storage medium for updating model parameters
Cholewa et al. Estimation of the number of states for gesture recognition with Hidden Markov Models based on the number of critical points in time sequence
CN112182217A (en) Method, device, equipment and storage medium for identifying multi-label text categories
CN116304307A (en) Graph-text cross-modal retrieval network training method, application method and electronic equipment
CN113435531B (en) Zero sample image classification method and system, electronic equipment and storage medium
CN114358011A (en) Named entity extraction method and device and electronic equipment
CN113723077A (en) Sentence vector generation method and device based on bidirectional characterization model and computer equipment
CN116109907B (en) Target detection method, target detection device, electronic equipment and storage medium
CN110889290B (en) Text encoding method and apparatus, text encoding validity checking method and apparatus
CN110929499B (en) Text similarity obtaining method, device, medium and electronic equipment
CN107533672A (en) Pattern recognition device, mode identification method and program
CN109657710B (en) Data screening method and device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination