CN113901823A - Named entity identification method, device, storage medium and terminal equipment - Google Patents

Named entity identification method, device, storage medium and terminal equipment Download PDF

Info

Publication number
CN113901823A
CN113901823A CN202111233302.XA CN202111233302A CN113901823A CN 113901823 A CN113901823 A CN 113901823A CN 202111233302 A CN202111233302 A CN 202111233302A CN 113901823 A CN113901823 A CN 113901823A
Authority
CN
China
Prior art keywords
text data
named entity
labeled
entity recognition
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111233302.XA
Other languages
Chinese (zh)
Inventor
司世景
王健宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202111233302.XA priority Critical patent/CN113901823A/en
Publication of CN113901823A publication Critical patent/CN113901823A/en
Priority to PCT/CN2022/089993 priority patent/WO2023065635A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of artificial intelligence, and particularly relates to a named entity identification method, a named entity identification device, a computer readable storage medium and terminal equipment. The method comprises the following steps: pre-training a coder in the twin network, taking the coder as a coder of a named entity recognition model, and training the model by using the labeled text data to obtain a labeled trained model; predicting the label-free text data by using the model after label training to obtain first-class and second-class text data; acquiring second-class text data after manual labeling, and taking the first-class text data after model labeling and the second-class text data after manual labeling as new labeling text data; adjusting the model after the label training by using the newly added label text data to obtain an adjusted model; and acquiring target text data to be recognized, and processing the target text data by using the adjusted model to obtain the entity category of each named entity in the target text data.

Description

Named entity identification method, device, storage medium and terminal equipment
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a named entity identification method, a named entity identification device, a computer readable storage medium and terminal equipment.
Background
Named Entity Recognition (NER), also known as Entity Recognition, Entity segmentation or Entity extraction, is a subtask of information extraction that aims to locate and classify Named entities in unstructured text into predefined categories, such as names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. The prior art methods, which generally employ IOB tagging schemes to transform them into sequence tagging problems, with the combination of two-way long-short term memory networks with conditional random fields being the typical model, have achieved great success in benefiting from large amounts of correctly manually tagged data. However, in the actual named entity recognition scenario, even if only thousands or tens of thousands of training data are labeled manually, the consumed time and money costs are huge, and in order to obtain a higher recognition accuracy, more labeled data are required, which also causes a larger cost consumption.
Disclosure of Invention
In view of this, embodiments of the present invention provide a named entity identification method, an apparatus, a computer-readable storage medium, and a terminal device, so as to solve the problem that the existing named entity identification method is relatively high in cost consumption.
A first aspect of an embodiment of the present invention provides a method for identifying a named entity, which may include:
pre-training an encoder in a preset twin network by using text data in a preset text data set to obtain a pre-trained encoder;
taking the pre-trained encoder as an encoder of a preset named entity recognition model, and training the named entity recognition model by using the labeled text data in the text data set to obtain a named entity recognition model after label training;
predicting the non-labeled text data in the text data set by using the labeled trained named entity recognition model to obtain first type text data labeled by the model and second type text data to be labeled manually;
acquiring second-class text data after artificial labeling, and taking the first-class text data after model labeling and the second-class text data after artificial labeling as new labeling text data;
adjusting the named entity recognition model after the tagging training by using the newly added tagging text data to obtain an adjusted named entity recognition model;
and acquiring target text data to be recognized, and processing the target text data by using the adjusted named entity recognition model to obtain the entity category of each named entity in the target text data.
In a specific implementation manner of the first aspect, the pre-training an encoder in a preset twin network by using text data in a preset text data set to obtain a pre-trained encoder may include:
performing data enhancement on the text data in the text data set to obtain a preset number of enhanced text data pairs; any one of the enhanced text data pairs comprises two different enhanced text data obtained by performing data enhancement on the same text data;
processing the enhanced text data pair by using the twin network to respectively obtain a first feature vector and a second feature vector;
calculating a first loss function according to the first eigenvector and the second eigenvector;
and pre-training the encoder in the twin network by taking the minimized first loss function as a target to obtain a pre-trained encoder.
In a specific implementation manner of the first aspect, the calculating a first loss function according to the first feature vector and the second feature vector may include:
calculating the first loss function according to:
Figure BDA0003316673540000021
wherein p is1Is the first feature vector, z2For the second feature vector, | p1||2Is the modulus of the first feature vector, | | z2||2Is the modulus of the second feature vector,
Figure BDA0003316673540000031
is the first loss function.
In a specific implementation manner of the first aspect, the training the named entity recognition model by using labeled text data in the text dataset to obtain a named entity recognition model after label training may include:
encoding the labeled text data in the text data set by using an encoder of the named entity recognition model to obtain an encoded feature vector;
processing the coded feature vector by using a multilayer perceptron of the named entity recognition model to obtain the probability distribution of entity categories;
calculating a second loss function according to the probability distribution;
and training the named entity recognition model by taking the minimized second loss function as a target to obtain the named entity recognition model after label training.
In a specific implementation manner of the first aspect, the processing, by using the multi-layer perceptron of the named entity recognition model, the encoded feature vector to obtain a probability distribution of an entity class may include:
the probability distribution of the entity classes is calculated according to the following formula:
pi=Softmax(U tanh(Vhi))
wherein i is the serial number of the labeled text data in the text data set, hiFor the coded feature vector corresponding to the ith labeled text data in the text data set, U and V are both preset model parameters, Softmax is a preset excitation function, piTo be with theAnd (4) probability distribution of entity categories corresponding to the ith labeled text data in the text data set.
In a specific implementation manner of the first aspect, the calculating a second loss function according to the probability distribution may include:
calculating the second loss function according to:
Figure BDA0003316673540000032
wherein n is the number of labeled text data in the text data set, ziAnd the loss is the entity class label corresponding to the ith labeled text data in the text data set, and the loss is the second loss function.
In a specific implementation manner of the first aspect, the predicting, by using the named entity recognition model after the label training, non-labeled text data in the text data set to obtain first type text data after the model label and second type text data to be manually labeled may include:
predicting the non-labeled text data in the text data set by using the labeled trained named entity recognition model, and calculating the confidence of the prediction result;
using the non-labeled text data with the confidence coefficient of the prediction result being greater than or equal to a preset confidence coefficient threshold value as the first type of text data;
and taking the non-labeled text data with the confidence coefficient of the prediction result smaller than the confidence coefficient threshold value as the second type of text data.
A second aspect of an embodiment of the present invention provides a named entity identifying device, which may include:
the encoder pre-training module is used for pre-training an encoder in a preset twin network by using the text data in the preset text data set to obtain a pre-trained encoder;
the model training module is used for taking the pre-trained encoder as a preset encoder of the named entity recognition model, and training the named entity recognition model by using the labeled text data in the text data set to obtain a labeled and trained named entity recognition model;
the model prediction module is used for predicting the non-labeled text data in the text data set by using the labeled trained named entity recognition model to obtain first type text data labeled by the model and second type text data to be labeled manually;
the newly added labeled text data module is used for acquiring manually labeled second-type text data and taking the model labeled first-type text data and the manually labeled second-type text data as newly added labeled text data;
the model adjusting module is used for adjusting the named entity recognition model after the tagging training by using the newly added tagging text data to obtain an adjusted named entity recognition model;
and the named entity recognition module is used for acquiring target text data to be recognized and processing the target text data by using the adjusted named entity recognition model to obtain the entity category of each named entity in the target text data.
In a specific implementation manner of the second aspect, the encoder pre-training module may include:
the data enhancement unit is used for performing data enhancement on the text data in the text data set to obtain a preset number of enhanced text data pairs; any one of the enhanced text data pairs comprises two different enhanced text data obtained by performing data enhancement on the same text data;
the twin network processing unit is used for processing the enhanced text data pair by using the twin network to respectively obtain a first feature vector and a second feature vector;
a first loss function calculation unit configured to calculate a first loss function from the first eigenvector and the second eigenvector;
and the pre-training unit is used for pre-training the encoder in the twin network by taking the minimized first loss function as a target to obtain the pre-trained encoder.
In a specific implementation manner of the second aspect, the first loss function calculating unit may be specifically configured to calculate the first loss function according to the following formula:
Figure BDA0003316673540000051
wherein p is1Is the first feature vector, z2For the second feature vector, | p1||2Is the modulus of the first feature vector, | | z2||2Is the modulus of the second feature vector,
Figure BDA0003316673540000052
is the first loss function.
In a specific implementation manner of the second aspect, the model training module may include:
the coding unit is used for coding the labeled text data in the text data set by using the coder of the named entity recognition model to obtain coded feature vectors;
the multilayer perception unit is used for processing the coded feature vectors by using a multilayer perceptron of the named entity recognition model to obtain the probability distribution of entity categories;
a second loss function calculation unit for calculating a second loss function according to the probability distribution;
and the model training unit is used for training the named entity recognition model by taking the minimized second loss function as a target to obtain the named entity recognition model after label training.
In a specific implementation of the second aspect, the multi-layered sensing unit may be specifically configured to calculate the probability distribution of the entity class according to the following formula:
pi=Softmax(U tanh(Vhi))
wherein i is the serial number of the labeled text data in the text data set, hiFor the coded feature vector corresponding to the ith labeled text data in the text data set, U and V are both preset model parameters, Softmax is a preset excitation function, piIs the probability distribution of the entity class corresponding to the i-th labeled text data in the text data set.
In a specific implementation manner of the second aspect, the second loss function calculating unit may be specifically configured to calculate the second loss function according to the following formula:
Figure BDA0003316673540000061
wherein n is the number of labeled text data in the text data set, ziAnd the loss is the entity class label corresponding to the ith labeled text data in the text data set, and the loss is the second loss function.
In a specific implementation manner of the second aspect, the model prediction module may include:
the prediction unit is used for predicting the non-labeled text data in the text data set by using the labeled trained named entity recognition model and calculating the confidence coefficient of a prediction result;
the first-class text data determining unit is used for taking the non-labeled text data of which the confidence coefficient of the prediction result is greater than or equal to a preset confidence coefficient threshold value as the first-class text data;
and the second-class text data determining unit is used for taking the non-labeled text data of which the confidence coefficient of the prediction result is smaller than the confidence coefficient threshold value as the second-class text data.
A third aspect of embodiments of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any one of the named entity identifying methods described above.
A fourth aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements any of the above steps of the named entity identifying method when executing the computer program.
A fifth aspect of embodiments of the present invention provides a computer program product, which, when running on a terminal device, causes the terminal device to perform any of the above-mentioned steps of the named entity recognition method.
Compared with the prior art, the embodiment of the invention has the following beneficial effects: the method comprises the steps that pre-training is conducted on a coder in a preset twin network through text data in a preset text data set, and the coder after pre-training is obtained; taking the pre-trained encoder as an encoder of a preset named entity recognition model, and training the named entity recognition model by using the labeled text data in the text data set to obtain a named entity recognition model after label training; predicting the non-labeled text data in the text data set by using the labeled trained named entity recognition model to obtain first type text data labeled by the model and second type text data to be labeled manually; acquiring second-class text data after artificial labeling, and taking the first-class text data after model labeling and the second-class text data after artificial labeling as new labeling text data; adjusting the named entity recognition model after the tagging training by using the newly added tagging text data to obtain an adjusted named entity recognition model; and acquiring target text data to be recognized, and processing the target text data by using the adjusted named entity recognition model to obtain the entity category of each named entity in the target text data. According to the embodiment of the invention, active learning and contrast learning are fused, firstly, an encoder is pre-trained by using a contrast learning method and used for training a named entity recognition model, in the subsequent active learning process, the model continuously carries out self iterative training and optimization through received feedback, only a small amount of manual labeling is needed in the whole process, and the consumption of time and money cost is effectively reduced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a flowchart of an embodiment of a named entity recognition method according to the present invention;
FIG. 2 is a schematic flow diagram of pre-training encoders in a pre-defined twin network using text data in a pre-defined set of text data;
FIG. 3 is a schematic flow diagram of training a named entity recognition model using tagged text data in a text dataset;
FIG. 4 is a schematic diagram of a named entity recognition model;
FIG. 5 is a block diagram of an embodiment of a named entity recognition apparatus according to the present invention;
fig. 6 is a schematic block diagram of a terminal device in an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The execution subject of the embodiment of the present invention may be an artificial intelligence-based terminal device, and is used to execute the named entity identification method in the embodiment of the present invention.
Referring to fig. 1, an embodiment of a method for identifying a named entity according to an embodiment of the present invention may include:
and S101, pre-training an encoder in a preset twin network by using text data in a preset text data set to obtain the pre-trained encoder.
As shown in fig. 2, step S101 may specifically include the following processes:
step S1011, performing data enhancement on the text data in the text data set to obtain a preset number of enhanced text data pairs.
Wherein, the text data set can include marked text data and non-marked text data. Any one of the enhanced text data pairs comprises two different enhanced text data obtained by performing data enhancement on the same text data.
In the embodiment of the invention, a series of self-similar text data pairs can be obtained as positive examples by performing data enhancement on the same sample. Specifically, what kind of data enhancement method is adopted can be set according to actual situations, for example, in one implementation, a dropout method can be used for data enhancement, that is, a dropout mask (dropout mask) is randomly sampled to perform a dropout operation on text data, and two consecutive dropout operations are performed on the same text data, so that two different enhanced text data can be obtained respectively, and thus an enhanced text data pair is formed.
It should be noted that the above is only an example and is not a limitation on the data enhancement method, and in practical applications, other data enhancement methods in the prior art may be adopted according to specific situations, and this is not specifically limited in the embodiment of the present invention.
Step S1012, processing the enhanced text data pair using the twin network to obtain a first feature vector and a second feature vector, respectively.
The twin network may be a SimSiam network, the SimSiam network includes two processing branches (respectively marked as branch 1 and branch 2), and respectively processes two enhanced text data (respectively marked as text data 1 and text data 2) in the enhanced text data pair, wherein an encoder (encoder) in the branch 1 encodes the text data 1 to obtain a first feature vector; the encoder in branch 2 encodes the text data 2, and the encoding result is subjected to nonlinear change by a predictor to obtain a second feature vector. It is noted that the encoder in branch 1 and the encoder in branch 2 share the same parameters, which can be considered as the same encoder.
Step S1013, a first loss function is calculated according to the first feature vector and the second feature vector.
In particular, the first loss function may be calculated according to:
Figure BDA0003316673540000101
wherein p is1Is the first feature vector, z2For the second feature vector, | p1||2Is the modulus of the first feature vector, | | z2||2Is the modulus of the second feature vector,
Figure BDA0003316673540000102
is the first loss function.
And step S1014, pre-training the encoder in the twin network by taking the minimized first loss function as a target to obtain a pre-trained encoder.
After the first loss function is calculated, the model parameters of the twin network may be adjusted according to the first loss function. In the embodiment of the present invention, assuming that the model parameter of the twin network is W1, the first loss function is propagated backward to modify the model parameter W1 of the twin network, so as to obtain a modified model parameter W2. After the parameters are modified, the next training process is continued, in the training process, the first loss function is obtained through recalculation, the first loss function is reversely propagated and modified for the model parameters W2 of the twin network, the modified model parameters W3, … … are obtained, and the rest is done, the above processes are repeated continuously, the model parameters of the twin network can be modified in each training process until the preset training conditions are met, wherein the training conditions can be that the training times reach the preset time threshold value, and the time threshold value can be set according to the actual conditions, for example, the training process can be set to values of thousands, tens of thousands, hundreds of thousands or even larger; the training condition may also be said twin network convergence; as it may happen that the number of training times has not reached a time threshold, but the twin network has converged, unnecessary work may be repeated; or the twin network can not be converged all the time, which may result in infinite loop and can not end the training process, based on the two situations, the training condition may also be that the training frequency reaches a threshold value or the twin network converges. And when the training condition is met, the twin network after pre-training can be obtained, and the encoder at the moment is the encoder after pre-training.
By the comparison learning mode in the embodiment of the invention, the parameters of the encoder are optimized and can be used for the subsequent named entity identification process.
And S102, taking the pre-trained encoder as a preset named entity recognition model encoder, and training the named entity recognition model by using the labeled text data in the text data set to obtain the named entity recognition model after label training.
As shown in fig. 3, step S102 may specifically include the following processes:
and S1021, encoding the labeled text data in the text data set by using an encoder of the named entity recognition model to obtain an encoded feature vector.
Fig. 4 is a schematic diagram of the named entity recognition model, which may include an encoder and a Multi-Layer perceptron (Multi-Layer persistence). i is the serial number, x, of the labeled text data in the text data setiFor the ith labeled text data in the text data set, i is more than or equal to 1 and less than or equal to n, n is the number of labeled text data in the text data set, and the encoder of the named entity recognition model pairs xiCoding to obtain the sum xiThe corresponding coded feature vector is denoted as hi
Step S1022, the coded feature vector is processed by using the multilayer perceptron of the named entity recognition model, so as to obtain the probability distribution of the entity category.
Specifically, the probability distribution of the entity classes can be calculated according to the following formula:
pi=Softmax(U tanh(Vhi))
wherein, U and V are both preset model parameters, Softmax is a preset excitation function, piIs equal to xiProbability distribution of the corresponding entity class.
And step S1023, calculating a second loss function according to the probability distribution.
Specifically, the second loss function may be calculated according to the following equation:
Figure BDA0003316673540000111
wherein z isiIs equal to xiAnd the corresponding entity class label, namely the entity class which is manually marked in advance, and the loss is the second loss function.
And step S1024, training the named entity recognition model by taking the minimized second loss function as a target to obtain the named entity recognition model after label training.
After the second loss function is calculated, the model parameters of the named entity recognition model may be adjusted according to the second loss function. In the embodiment of the present invention, assuming that the model parameter of the named entity recognition model is V1, the model parameter V1 of the named entity recognition model is modified by propagating back the second loss function, so as to obtain a modified model parameter V2. After the parameters are modified, the next training process is continued, in the training process, the second loss function is obtained through recalculation, the second loss function is reversely propagated and modified for the model parameter V2 of the named entity recognition model, the modified model parameter V3, … … is obtained, and the rest is done, the above processes are repeated continuously, the model parameter of the named entity recognition model can be modified in each training process until the preset training condition is met, wherein the training condition can be that the training times reach the preset time threshold value, and the time threshold value can be set according to the actual situation, for example, the training process can be set to thousands, hundreds of thousands or even larger values; the training condition may also be the named entity recognition model convergence; since it may happen that the training times have not reached a time threshold, but the named entity recognition model has converged, unnecessary work may be repeated; or the named entity recognition model cannot be converged all the time, which may result in infinite loop and failure to end the training process, and based on the two cases, the training condition may also be that the number of times of training reaches a threshold value or the named entity recognition model converges. And when the training condition is met, the named entity recognition model after the labeling training can be obtained.
Step S103, predicting the non-labeled text data in the text data set by using the labeled trained named entity recognition model to obtain first type text data labeled by the model and second type text data to be labeled manually.
Specifically, the named entity recognition model after the label training may be used to predict the text data without labels in the text data set, and calculate the confidence of the prediction result. Then, using the non-labeled text data with the confidence coefficient of the prediction result being greater than or equal to a preset confidence coefficient threshold value as the first type of text data, namely the text data which is easier to label through the model; and taking the non-labeled text data with the confidence coefficient of the prediction result smaller than the confidence coefficient threshold value as second type text data, namely text data which is difficult to label through a model. For example, if "Mary" is labeled, if the probabilities of the predictions given by the model labeled as the entity categories "PER" and "LOC" are relatively close, the final labeling result cannot be determined, and therefore the model can be used as the second type of text data and needs to be screened out for manual labeling. The specific value of the confidence threshold may be set according to an actual situation, which is not specifically limited in the embodiment of the present application.
And step S104, acquiring the second type of text data after the manual marking, and taking the first type of text data after the model marking and the second type of text data after the manual marking as the newly added marking text data.
In the embodiment of the invention, experts can be arranged to manually label the screened second type of text data, and through the active learning mode, text data which is easy to label through the model is directly labeled by using the model, and only the screened text data which is difficult to label through the model is manually labeled, so that the overall efficiency of the model is greatly improved.
And S105, adjusting the named entity recognition model after the tagging training by using the newly added tagging text data to obtain the adjusted named entity recognition model.
Through the process, part of the unmarked text data is converted into marked text data, the number of the marked text data is continuously increased in the process, and the named entity recognition model can be continuously adjusted by using the newly added marked text data to obtain the adjusted named entity recognition model.
It should be noted that the model adjustment is a continuous iterative process, i.e. steps S103 to S105 are repeated continuously, the unlabeled text data is continuously converted into labeled text data, and a new round of model adjustment is performed by using the newly added labeled text data until the finally obtained named entity recognition model reaches the predetermined recognition accuracy. In the subsequent named entity recognition task, the model can be used for carrying out named entity recognition.
And S106, acquiring target text data to be recognized, and processing the target text data by using the adjusted named entity recognition model to obtain the entity category of each named entity in the target text data.
The target text data may be pre-stored in the terminal device, or may be sent to the terminal device by other devices through a preset communication mode, or may be input to the terminal device by a user through a preset human-computer interaction interface. When the named entity recognition is needed, the terminal device can use the adjusted named entity recognition model to process the target text data, so that the entity category of each named entity in the target text data is obtained. For example, if the target text data is "8 o' clock in the morning of Xiaoming to go to school class. ", after the adjusted named entity recognition model is processed, the final recognition result is: (named entity: Xiaoming, entity class: name of person); (named entity: 8 am, entity class: time); (named entity: school, entity class: location).
In summary, in the embodiments of the present invention, the pre-training is performed on the encoder in the preset twin network by using the text data in the preset text data set, so as to obtain the pre-trained encoder; taking the pre-trained encoder as an encoder of a preset named entity recognition model, and training the named entity recognition model by using the labeled text data in the text data set to obtain a named entity recognition model after label training; predicting the non-labeled text data in the text data set by using the labeled trained named entity recognition model to obtain first type text data labeled by the model and second type text data to be labeled manually; acquiring second-class text data after artificial labeling, and taking the first-class text data after model labeling and the second-class text data after artificial labeling as new labeling text data; adjusting the named entity recognition model after the tagging training by using the newly added tagging text data to obtain an adjusted named entity recognition model; and acquiring target text data to be recognized, and processing the target text data by using the adjusted named entity recognition model to obtain the entity category of each named entity in the target text data. According to the embodiment of the invention, active learning and contrast learning are fused, firstly, an encoder is pre-trained by using a contrast learning method and used for training a named entity recognition model, in the subsequent active learning process, the model continuously carries out self iterative training and optimization through received feedback, only a small amount of manual labeling is needed in the whole process, and the consumption of time and money cost is effectively reduced.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Fig. 5 is a structural diagram of an embodiment of a named entity recognition apparatus according to an embodiment of the present invention, which corresponds to the named entity recognition method described in the foregoing embodiment.
In this embodiment, a named entity recognition apparatus may include:
the encoder pre-training module 501 is configured to pre-train an encoder in a preset twin network by using text data in a preset text data set to obtain a pre-trained encoder;
a model training module 502, configured to use the pre-trained encoder as an encoder of a preset named entity recognition model, and train the named entity recognition model using labeled text data in the text data set to obtain a named entity recognition model after label training;
the model prediction module 503 is configured to predict non-labeled text data in the text data set by using the labeled trained named entity recognition model, so as to obtain first type text data labeled by the model and second type text data to be labeled manually;
a newly added labeled text data module 504, configured to obtain second-type text data after manual labeling, and use the first-type text data after model labeling and the second-type text data after manual labeling as newly added labeled text data;
a model adjusting module 505, configured to adjust the named entity recognition model after the tagging training by using the newly added tagged text data, to obtain an adjusted named entity recognition model;
and a named entity recognition module 506, configured to obtain target text data to be recognized, and process the target text data using the adjusted named entity recognition model to obtain an entity category of each named entity in the target text data.
In a specific implementation manner of the embodiment of the present invention, the encoder pre-training module may include:
the data enhancement unit is used for performing data enhancement on the text data in the text data set to obtain a preset number of enhanced text data pairs; any one of the enhanced text data pairs comprises two different enhanced text data obtained by performing data enhancement on the same text data;
the twin network processing unit is used for processing the enhanced text data pair by using the twin network to respectively obtain a first feature vector and a second feature vector;
a first loss function calculation unit configured to calculate a first loss function from the first eigenvector and the second eigenvector;
and the pre-training unit is used for pre-training the encoder in the twin network by taking the minimized first loss function as a target to obtain the pre-trained encoder.
In a specific implementation manner of the embodiment of the present invention, the first loss function calculating unit may be specifically configured to calculate the first loss function according to the following formula:
Figure BDA0003316673540000151
wherein p is1Is the first feature vector, z2For the second feature vector, | p1||2Is the modulus of the first feature vector, | | z2||2Is the modulus of the second feature vector,
Figure BDA0003316673540000161
is the first loss function.
In a specific implementation manner of the embodiment of the present invention, the model training module may include:
the coding unit is used for coding the labeled text data in the text data set by using the coder of the named entity recognition model to obtain coded feature vectors;
the multilayer perception unit is used for processing the coded feature vectors by using a multilayer perceptron of the named entity recognition model to obtain the probability distribution of entity categories;
a second loss function calculation unit for calculating a second loss function according to the probability distribution;
and the model training unit is used for training the named entity recognition model by taking the minimized second loss function as a target to obtain the named entity recognition model after label training.
In a specific implementation manner of the embodiment of the present invention, the multi-layer sensing unit may be specifically configured to calculate a probability distribution of the entity class according to the following formula:
pi=Softmax(U tanh(Vhi))
wherein i is the serial number of the labeled text data in the text data set,hiFor the coded feature vector corresponding to the ith labeled text data in the text data set, U and V are both preset model parameters, Softmax is a preset excitation function, piIs the probability distribution of the entity class corresponding to the i-th labeled text data in the text data set.
In a specific implementation manner of the embodiment of the present invention, the second loss function calculating unit may be specifically configured to calculate the second loss function according to the following formula:
Figure BDA0003316673540000162
wherein n is the number of labeled text data in the text data set, ziAnd the loss is the entity class label corresponding to the ith labeled text data in the text data set, and the loss is the second loss function.
In a specific implementation manner of the embodiment of the present invention, the model prediction module may include:
the prediction unit is used for predicting the non-labeled text data in the text data set by using the labeled trained named entity recognition model and calculating the confidence coefficient of a prediction result;
the first-class text data determining unit is used for taking the non-labeled text data of which the confidence coefficient of the prediction result is greater than or equal to a preset confidence coefficient threshold value as the first-class text data;
and the second-class text data determining unit is used for taking the non-labeled text data of which the confidence coefficient of the prediction result is smaller than the confidence coefficient threshold value as the second-class text data.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, modules and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Fig. 6 shows a schematic block diagram of a terminal device according to an embodiment of the present invention, and for convenience of description, only the relevant parts related to the embodiment of the present invention are shown.
In this embodiment, the terminal device 6 may be a desktop computer, a notebook, a palm computer, or other computing devices. The terminal device 6 may include: a processor 60, a memory 61, and computer readable instructions 62 stored in the memory 61 and executable on the processor 60, such as computer readable instructions to perform the named entity identification method described above. The processor 60, when executing the computer readable instructions 62, implements the steps in the various named entity identification method embodiments described above, such as steps S101-S106 shown in fig. 1. Alternatively, the processor 60, when executing the computer readable instructions 62, implements the functions of the modules/units in the above-described device embodiments, such as the functions of the modules 501 to 506 shown in fig. 5.
Illustratively, the computer readable instructions 62 may be partitioned into one or more modules/units that are stored in the memory 61 and executed by the processor 60 to implement the present invention. The one or more modules/units may be a series of computer-readable instruction segments capable of performing specific functions, which are used to describe the execution process of the computer-readable instructions 62 in the terminal device 6.
The Processor 60 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 6. Further, the memory 61 may also include both an internal storage unit and an external storage device of the terminal device 6. The memory 61 is used for storing the computer readable instructions and other instructions and text data required by the terminal device 6. The memory 61 may also be used to temporarily store text data that has been output or is to be output.
Each functional unit in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes a plurality of computer readable instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like, which can store computer readable instructions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A named entity recognition method, comprising:
pre-training an encoder in a preset twin network by using text data in a preset text data set to obtain a pre-trained encoder;
taking the pre-trained encoder as an encoder of a preset named entity recognition model, and training the named entity recognition model by using the labeled text data in the text data set to obtain a named entity recognition model after label training;
predicting the non-labeled text data in the text data set by using the labeled trained named entity recognition model to obtain first type text data labeled by the model and second type text data to be labeled manually;
acquiring second-class text data after artificial labeling, and taking the first-class text data after model labeling and the second-class text data after artificial labeling as new labeling text data;
adjusting the named entity recognition model after the tagging training by using the newly added tagging text data to obtain an adjusted named entity recognition model;
and acquiring target text data to be recognized, and processing the target text data by using the adjusted named entity recognition model to obtain the entity category of each named entity in the target text data.
2. The named entity recognition method of claim 1, wherein the pre-training of the encoder in the pre-set twin network using the text data in the pre-set of text data to obtain a pre-trained encoder comprises:
performing data enhancement on the text data in the text data set to obtain a preset number of enhanced text data pairs; any one of the enhanced text data pairs comprises two different enhanced text data obtained by performing data enhancement on the same text data;
processing the enhanced text data pair by using the twin network to respectively obtain a first feature vector and a second feature vector;
calculating a first loss function according to the first eigenvector and the second eigenvector;
and pre-training the encoder in the twin network by taking the minimized first loss function as a target to obtain a pre-trained encoder.
3. The named entity recognition method of claim 2, wherein said computing a first penalty function from said first feature vector and said second feature vector comprises:
calculating the first loss function according to:
Figure FDA0003316673530000021
wherein p is1Is the first feature vector, z2For the second feature vector, | p1||2Is the modulus of the first feature vector, | | z2||2Is the modulus of the second feature vector,
Figure FDA0003316673530000022
is the first loss function.
4. The method according to claim 1, wherein the training of the named entity recognition model using the labeled text data in the text dataset to obtain a labeled trained named entity recognition model comprises:
encoding the labeled text data in the text data set by using an encoder of the named entity recognition model to obtain an encoded feature vector;
processing the coded feature vector by using a multilayer perceptron of the named entity recognition model to obtain the probability distribution of entity categories;
calculating a second loss function according to the probability distribution;
and training the named entity recognition model by taking the minimized second loss function as a target to obtain the named entity recognition model after label training.
5. The method according to claim 4, wherein the processing the encoded feature vector by the multi-layered perceptron of the named entity recognition model to obtain a probability distribution of entity classes comprises:
the probability distribution of the entity classes is calculated according to the following formula:
pi=Softmax(U tanh(Vhi))
wherein i is the serial number of the labeled text data in the text data set, hiFor the coded feature vector corresponding to the ith labeled text data in the text data set, U and V are both preset model parameters, Softmax is a preset excitation function, piIs the probability distribution of the entity class corresponding to the i-th labeled text data in the text data set.
6. The named entity recognition method of claim 5, wherein said computing a second loss function from said probability distribution comprises:
calculating the second loss function according to:
Figure FDA0003316673530000031
wherein n is a label in the text datasetNumber of text data, ziAnd the loss is the entity class label corresponding to the ith labeled text data in the text data set, and the loss is the second loss function.
7. The method according to any one of claims 1 to 6, wherein the predicting, using the named entity recognition model trained with labeling, non-labeled text data in the text data set to obtain model-labeled first-type text data and model-labeled second-type text data to be labeled manually comprises:
predicting the non-labeled text data in the text data set by using the labeled trained named entity recognition model, and calculating the confidence of the prediction result;
using the non-labeled text data with the confidence coefficient of the prediction result being greater than or equal to a preset confidence coefficient threshold value as the first type of text data;
and taking the non-labeled text data with the confidence coefficient of the prediction result smaller than the confidence coefficient threshold value as the second type of text data.
8. A named entity recognition apparatus, comprising:
the encoder pre-training module is used for pre-training an encoder in a preset twin network by using the text data in the preset text data set to obtain a pre-trained encoder;
the model training module is used for taking the pre-trained encoder as a preset encoder of the named entity recognition model, and training the named entity recognition model by using the labeled text data in the text data set to obtain a labeled and trained named entity recognition model;
the model prediction module is used for predicting the non-labeled text data in the text data set by using the labeled trained named entity recognition model to obtain first type text data labeled by the model and second type text data to be labeled manually;
the newly added labeled text data module is used for acquiring manually labeled second-type text data and taking the model labeled first-type text data and the manually labeled second-type text data as newly added labeled text data;
the model adjusting module is used for adjusting the named entity recognition model after the tagging training by using the newly added tagging text data to obtain an adjusted named entity recognition model;
and the named entity recognition module is used for acquiring target text data to be recognized and processing the target text data by using the adjusted named entity recognition model to obtain the entity category of each named entity in the target text data.
9. A computer readable storage medium storing computer readable instructions, characterized in that the computer readable instructions, when executed by a processor, implement the steps of the named entity recognition method according to any one of claims 1 to 7.
10. A terminal device comprising a memory, a processor and computer readable instructions stored in the memory and executable on the processor, characterized in that the processor, when executing the computer readable instructions, implements the steps of the named entity recognition method according to any one of claims 1 to 7.
CN202111233302.XA 2021-10-22 2021-10-22 Named entity identification method, device, storage medium and terminal equipment Pending CN113901823A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111233302.XA CN113901823A (en) 2021-10-22 2021-10-22 Named entity identification method, device, storage medium and terminal equipment
PCT/CN2022/089993 WO2023065635A1 (en) 2021-10-22 2022-04-28 Named entity recognition method and apparatus, storage medium and terminal device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111233302.XA CN113901823A (en) 2021-10-22 2021-10-22 Named entity identification method, device, storage medium and terminal equipment

Publications (1)

Publication Number Publication Date
CN113901823A true CN113901823A (en) 2022-01-07

Family

ID=79025932

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111233302.XA Pending CN113901823A (en) 2021-10-22 2021-10-22 Named entity identification method, device, storage medium and terminal equipment

Country Status (2)

Country Link
CN (1) CN113901823A (en)
WO (1) WO2023065635A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023065635A1 (en) * 2021-10-22 2023-04-27 平安科技(深圳)有限公司 Named entity recognition method and apparatus, storage medium and terminal device
CN116776154A (en) * 2023-07-06 2023-09-19 华中师范大学 AI man-machine cooperation data labeling method and system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959252B (en) * 2018-06-28 2022-02-08 中国人民解放军国防科技大学 Semi-supervised Chinese named entity recognition method based on deep learning
CN109543009B (en) * 2018-10-17 2019-10-25 龙马智芯(珠海横琴)科技有限公司 Text similarity assessment system and text similarity appraisal procedure
US11568143B2 (en) * 2019-11-15 2023-01-31 Intuit Inc. Pre-trained contextual embedding models for named entity recognition and confidence prediction
CN112818691A (en) * 2021-02-01 2021-05-18 北京金山数字娱乐科技有限公司 Named entity recognition model training method and device
CN113901823A (en) * 2021-10-22 2022-01-07 平安科技(深圳)有限公司 Named entity identification method, device, storage medium and terminal equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023065635A1 (en) * 2021-10-22 2023-04-27 平安科技(深圳)有限公司 Named entity recognition method and apparatus, storage medium and terminal device
CN116776154A (en) * 2023-07-06 2023-09-19 华中师范大学 AI man-machine cooperation data labeling method and system
CN116776154B (en) * 2023-07-06 2024-04-09 华中师范大学 AI man-machine cooperation data labeling method and system

Also Published As

Publication number Publication date
WO2023065635A1 (en) 2023-04-27

Similar Documents

Publication Publication Date Title
US10824949B2 (en) Method and system for extracting information from graphs
US20220114343A1 (en) Method of training model, device, and storage medium
CN112732911B (en) Semantic recognition-based speaking recommendation method, device, equipment and storage medium
CN110210024B (en) Information processing method, device and storage medium
CN116415654A (en) Data processing method and related equipment
CN113553864A (en) Translation model training method and device, electronic equipment and storage medium
CN110874625B (en) Data processing method and device
CN113901823A (en) Named entity identification method, device, storage medium and terminal equipment
CN111324738B (en) Method and system for determining text label
CN113889076B (en) Speech recognition and coding/decoding method, device, electronic equipment and storage medium
CN111428757B (en) Model training method, abnormal data detection method and device and electronic equipment
US20210374517A1 (en) Continuous Time Self Attention for Improved Computational Predictions
CN110543561A (en) Method and device for emotion analysis of text
CN112633002A (en) Sample labeling method, model training method, named entity recognition method and device
Xu et al. Stacked deep learning structure with bidirectional long-short term memory for stock market prediction
US20220067579A1 (en) Dynamic ontology classification system
CN113239702A (en) Intention recognition method and device and electronic equipment
CN115687934A (en) Intention recognition method and device, computer equipment and storage medium
Cottrell et al. Neural networks for complex data
CN116820711B (en) Task driven autonomous agent method
CN116308219B (en) Generated RPA flow recommendation method and system based on Tranformer
CN112966140A (en) Field identification method, field identification device, electronic device, storage medium, and program product
CN111723186A (en) Knowledge graph generation method based on artificial intelligence for dialog system and electronic equipment
CN114897183B (en) Question data processing method, training method and device of deep learning model
CN114491030A (en) Skill label extraction and candidate phrase classification model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination