WO2021218024A1 - 命名实体识别模型的训练方法、装置、计算机设备 - Google Patents

命名实体识别模型的训练方法、装置、计算机设备 Download PDF

Info

Publication number
WO2021218024A1
WO2021218024A1 PCT/CN2020/118523 CN2020118523W WO2021218024A1 WO 2021218024 A1 WO2021218024 A1 WO 2021218024A1 CN 2020118523 W CN2020118523 W CN 2020118523W WO 2021218024 A1 WO2021218024 A1 WO 2021218024A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
model
named entity
models
entity recognition
Prior art date
Application number
PCT/CN2020/118523
Other languages
English (en)
French (fr)
Inventor
陈桢博
金戈
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021218024A1 publication Critical patent/WO2021218024A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the technical field of artificial intelligence, and in particular to a training method, device, computer equipment and storage medium for a named entity recognition model.
  • NER Named Entity Recognition
  • BiLSTM-CRF BiLSTM-CRF model
  • the main purpose of this application is to provide a training method, device, computer equipment and storage medium for a named entity recognition model, aiming to overcome the shortcomings of the low accuracy of the named entity recognition model and the small amount of labeled data when training the model.
  • this application provides a method for training a named entity recognition model, which includes the following steps:
  • the BiLSTM-CRF model is trained based on preset training samples to obtain three training models; wherein, the preset training samples include two groups of labeled data sets, the two groups of labeled data sets are public data sets and the designated domain Named entity training set;
  • the unlabeled target data is added with the prediction label and updated to the training samples of the unselected training model to train the unselected training model ; And put the unlabeled target data back into the unlabeled data set until the samples in the unlabeled data set are no longer updated, stop iterative training, and obtain three trained models as the final named entity Recognition model; wherein the named entity recognition model is used to perform named entity recognition on the target text in the designated field.
  • This application also provides a training device for a named entity recognition model, including:
  • the request obtaining unit is configured to obtain the designated field in which the target text to be recognized by the named entity recognition model is located when a request for training of the named entity recognition model is received; and obtain the designated field named entity training according to the designated field set;
  • the first training unit is used to train the BiLSTM-CRF model based on preset training samples to obtain three training models; wherein, the preset training samples include two groups of labeled data sets, and the two groups of labeled data sets are public A data set and a training set of named entities in the designated domain;
  • the first prediction unit is used to iteratively select any two training models randomly from the three training models, and sequentially select one unlabeled target data from the unlabeled data set and input it into the two selected training models for prediction, Obtain the predicted labels predicted by the two training models;
  • the second training unit is configured to, if the predicted labels predicted by the two training models are the same, add the predicted labels to the unlabeled target data and update them to the training samples of the unselected training models for training The unselected training model; and put the unlabeled target data back into the unlabeled data set until the samples in the unlabeled data set are no longer updated, stop iterative training, and obtain three completed training
  • the models are all used as the final named entity recognition model; wherein, the named entity recognition model is used to perform named entity recognition on the target text in the designated field.
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, a method for training a named entity recognition model is implemented, including the following steps:
  • the BiLSTM-CRF model is trained based on preset training samples to obtain three training models; wherein, the preset training samples include two groups of labeled data sets, the two groups of labeled data sets are public data sets and the designated domain Named entity training set;
  • the unlabeled target data is added with the prediction label and updated to the training samples of the unselected training model to train the unselected training model ; And put the unlabeled target data back into the unlabeled data set until the samples in the unlabeled data set are no longer updated, stop iterative training, and obtain three trained models as the final named entity Recognition model; wherein the named entity recognition model is used to perform named entity recognition on the target text in the designated field.
  • This application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, a method for training a named entity recognition model is realized, which includes the following steps:
  • the BiLSTM-CRF model is trained based on preset training samples to obtain three training models; wherein, the preset training samples include two groups of labeled data sets, the two groups of labeled data sets are public data sets and the designated domain Named entity training set;
  • the unlabeled target data is added with the prediction label and updated to the training samples of the unselected training model to train the unselected training model ; And put the unlabeled target data back into the unlabeled data set until the samples in the unlabeled data set are no longer updated, stop iterative training, and obtain three trained models as the final named entity Recognition model; wherein the named entity recognition model is used to perform named entity recognition on the target text in the designated field.
  • the training method, device, computer equipment and storage medium of the named entity recognition model provided by this application include: training a BiLSTM-CRF model based on preset training samples to obtain three training models; iteratively randomly select from the three training models Select any two training models, and sequentially select one unlabeled target data from the unlabeled data set and input it into the two selected training models for prediction, and obtain the prediction labels predicted by the two training models; If the predicted label predicted by the training model is the same, the unlabeled target data is added with the predicted label and updated to the training samples of the unselected training model to train the unselected training model; semi-supervised
  • the method replaces the original method of training with only labeled data, makes full use of the original data, and overcomes the shortcomings of insufficient labeled data; uses the three learning model voting consistency principles to implicitly express confidence and reduces frequent cross-validation.
  • the time required increases the reliability of the model, makes the model training effect better, the named entity recognition effect of the resume text is better, and the generalization ability is improved
  • FIG. 1 is a schematic diagram of the steps of a training method for a named entity recognition model in an embodiment of the present application
  • FIG. 2 is a structural block diagram of a training device for a named entity recognition model in an embodiment of the present application
  • FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
  • an embodiment of the present application provides a method for training a named entity recognition model, which includes the following steps:
  • Step S01 when a request for training a named entity recognition model is received, obtain a designated field where the target text to be recognized by the named entity recognition model is located; and obtain a designated field named entity training set according to the designated field;
  • Step S1 Train the BiLSTM-CRF model based on the preset training samples to obtain three training models; wherein, the preset training samples include two groups of labeled data sets, the two groups of labeled data sets are public data sets and designated Domain named entity training set;
  • Step S2 Iteratively select any two training models randomly from the three training models, and sequentially select one unlabeled target data from the unlabeled data set and input it into the two selected training models for prediction, and obtain two training models. State the predicted label predicted by the training model;
  • Step S3 if the predicted labels predicted by the two training models are the same, add the predicted labels to the unlabeled target data and update them to the training samples of the unselected training models to train all the unselected training models.
  • the training model and put the unlabeled target data back into the unlabeled data set until the samples in the unlabeled data set are no longer updated, stop iterative training, and obtain three trained models as the final A named entity recognition model; wherein the named entity recognition model is used to perform named entity recognition on the target text in the designated field.
  • the named entity recognition model trained in the above method is used to automatically identify named entities such as school names and place names from the resume text (that is, the above specified fields) to generate structured data; usually in the resume content recognition
  • NER named entity recognition
  • the above named entity recognition model usually uses the BiLSTM-CRF model, and its training samples usually use a large number of Chinese named entity data sets publicly available on the existing network (that is, the above public data set, which is an existing resource and data Large amount), the above public data set is a labeled data set.
  • the application scenario of this model is to realize the named entity recognition of resume text. Therefore, the model in this embodiment needs to be trained on this scenario data set before it can be used for this task. If the training set of named entities in the designated field is directly used, it will cause over-fitting problems due to the small amount of data.
  • the corresponding named entity recognition model needs to be trained before performing named entity recognition on the target text in the specified field. Therefore, the user can trigger a request instruction for training the model, and the terminal receives
  • a named entity recognition model is requested for training, in order to better train the aforementioned named entity recognition model and improve its recognition accuracy, it is necessary to obtain which field the named entity recognition model is used to recognize the target text.
  • the training set of the corresponding field for training Specifically, when a training request is received, the designated field where all the recognized target texts are obtained; if the designated field is a resume text field, the entity training set in the corresponding resume text is obtained according to the resume text field for subsequent use The training process.
  • this model needs to be trained on the named entity data set in the field of resume text before it can be used for this task. If the named entity data set is directly used, it will cause over-fitting problems due to the small amount of data. Therefore, this solution is based on the public data set to first pre-train the BiLSTM-CRF model to obtain the pre-trained model M0 to initialize the neural network parameters, and then use the specified domain named entity data set for training. This method can effectively improve the robustness of the algorithm , The training set of named entities in the specified domain is used in the above training samples (the training set has a small amount of data, but it is a specific domain vocabulary and has strong pertinence).
  • the named entity training set in the specified domain refers to the training corpus in the specified domain , which is also a labeled data set.
  • the designated field named entity training set is divided into three groups of training data sets, and the pre-training is separately trained based on each training data set.
  • the training model M0 obtains a training model respectively; the three sets of training models are all trained based on the pre-training model M0. The difference is that the training data set used is different, so the final training model is also different.
  • the above-mentioned model training using labeled data is a supervised training method, which is very time-consuming and usually has a very limited amount of data. Therefore, in order to make full use of existing data, in this embodiment, Furthermore, a semi-supervised training method (tri-training) is used to train the model, that is, in addition to the above-mentioned labeled data set, an unlabeled data set is also used, which not only increases the amount of training data, but also increases the model Reliability.
  • tri-training is used to train the model, that is, in addition to the above-mentioned labeled data set, an unlabeled data set is also used, which not only increases the amount of training data, but also increases the model Reliability.
  • Target data is labeled for prediction, that is, two random models are used to predict the predicted label corresponding to the same unlabeled target data. If the predicted labels obtained by the two models are the same, it can be considered that the confidence of the two models is high; otherwise, the confidence is low. It is understandable that the aforementioned predicted label is not just a label, it is a set of labels corresponding to the unlabeled data set; the number of labels in the set of labels depends on the number of words in the unlabeled data set.
  • the labeling method of the above unlabeled data set is the BIOES labeling method.
  • the corresponding labeling of the same word will be different; for example, in some scenarios, a word is the beginning of a place name. It can be marked with the B in the place name. If it is the end of the place name, it is marked as the E in the place name; for example, "Beijing" in "Beijing” is marked as B and Beijing is marked as E; in other scenes, the above-mentioned north is used as the name
  • the word "Gubei", Bei may be marked as the E in the name, that is, the same word will have different labels in different scenes.
  • step S3 if the two training models predict the same unlabeled target data and obtain the same prediction label, the predicted label can be added to the unlabeled target data to update to the unselected training Among the training samples of the model, another unselected model is iteratively trained. At the same time, the aforementioned unlabeled target data is put back into the unlabeled data set. In this embodiment, it is necessary to combine the model prediction results to determine whether unlabeled data is added to the training sample. In the prior art, when unlabeled data is used to train a model, it is usually determined based on whether the model predicts the probability of unlabeled data reaches a threshold. Whether to add unlabeled data to the training sample.
  • the model predicts the probability to determine whether unlabeled data is added to the training sample in this embodiment, which has obvious differences; this application combines multiple model voting consistency principles to implicitly express the confidence level, which increases The reliability of the model makes the training effect of the model better and the recognition is more accurate.
  • step S3 Repeat the above step S3 in sequence to complete the retraining of the above three training models.
  • the training data set of the corresponding model is updated, and then the next cycle is entered, and the above steps are repeated until the training set of all models is no longer updated.
  • unlabeled data can be effectively used to increase the data volume of training samples and improve the generalization of the model.
  • the vocabulary features of a specific field are first added to make word segmentation more accurate in the professional field, thereby improving the accuracy of named entity recognition; neural network algorithms are combined in semi-autonomous applications.
  • tri-training is used in CRF and BiLSTM-CRF to complete the NER task.
  • the semi-supervised method replaces the original training with only labeled data, makes full use of the original data, and overcomes the current insufficient amount of labeled data.
  • the three learning model voting consistency principles are used to implicitly express the confidence, reduce the time required for frequent cross-validation, increase the reliability of the model, make the model training effect better, and the named entity recognition effect of the resume text Good, and improved generalization ability.
  • the model obtained through training in this embodiment in practical applications, such as in a resume recognition scenario, can use specific resume text for iterative training to automatically update the model.
  • the above solution in the construction of a smart city, in order to enhance the efficient transmission and expression of information, the above solution can also be used in a smart office scenario to promote the construction of a smart city.
  • the step S1 of training the BiLSTM-CRF model based on preset training samples to obtain three training models includes:
  • Step S11 training based on the BiLSTM-CRF model based on the public data set to obtain a pre-training model
  • Step S12 Perform replacement sampling on the named entity training set of the designated domain to obtain three sets of training data
  • Step S13 Training the pre-training model based on the three sets of training data sets to obtain three training models.
  • the step S13 of separately training the pre-training model based on the three sets of training data sets to obtain three training models includes:
  • the preprocessing training model is trained through the three sets of training data sets to obtain three training models.
  • the difference between the above two labeled data sets is that the designated field named entity training set is a named entity specifically labeled for the current task, and the public data set comes from a large Chinese labeled named entity data set published on the Internet; the above designated field That is, the domain of the current named entity recognition task.
  • the above-mentioned designated field may be a resume text field.
  • the BiLSTM-CRF model is used for training, and the pre-training model M0 is obtained to initialize the neural network parameters in the model; then, as described in step S13 above, the specified domain name is adopted
  • the entity training set is replaced with the three sets of training data sets obtained after sampling, and training is performed based on the above-mentioned pre-training model M0.
  • the training samples used are the public data set and the designated domain named entity training set, which not only enables the training of the above three models to ensure a high traditional named entity recognition rate, but also ensures that the The effect of named entity recognition.
  • the large public data set is only used to better initialize some parameters of the model neural network (that is, the parameters of the BiLSTM part) and improve the robustness of the model, without the need to initialize the CRF parameters.
  • the model is finally used for resume named entity recognition, so it needs to be trained on the labeled data set in this field, and the CRF layer will be retrained, so the CRF layer needs to be initialized.
  • the initialization process is to retain only the pre-training parameters of the BiLSTM part, reset the CRF parameters, and then use the replacement sampling method to obtain the above three training data sets to train the models separately to obtain the above three training models M1, M2, M3.
  • the training method of this solution can enable the model to obtain a higher generalization ability.
  • the method includes:
  • Step S4 when receiving a named entity recognition instruction of the text to be recognized, input the text to be recognized into any one of the named entity recognition models for prediction, and obtain a named entity recognition result of the text to be recognized; wherein, The named entity recognition result is the label of the character in the text to be recognized; three named entity recognition models are obtained through training in the above process, all of which can be used to recognize the named entity in the text to be recognized.
  • Step S5 Add the to-be-recognized text to the unlabeled data set, and after adding the to-be-recognized text to the named entity recognition result, it is updated to the designated domain named entity training set.
  • the step S1 of training the BiLSTM-CRF model based on preset training samples to obtain three training models includes:
  • Step S1a randomly select target public data from the public data set; in this embodiment, since the amount of data in the named entity training set of the designated field is limited, some high-quality data can be selected from the public data set for training , In order to enhance the recognition accuracy of the above named entity recognition model. Therefore, the target public data is randomly selected from the public data set.
  • the agent model can be used for selection. When the agent model selects the data, it will automatically optimize the selection according to the results of the final model output, that is, make the selected data Data quality is getting better and better.
  • Step S1b dividing the designated domain named entity training set into a designated training set and a designated test set;
  • Step S1c the target public data and the designated training set form a model training set, and the model training set is input into the BiLSTM-CRF model for training to obtain a pre-training model; in this embodiment,
  • the accuracy of the model will be the highest, but the amount of data is small, and the generalization ability is poor; the quality of the above-mentioned target public data is lower than the specified training set.
  • the above-mentioned target public data and the designated training set are jointly trained, which will affect the accuracy of the model, but if the quality of the target public data is better, the impact will be smaller. Therefore, the quality of the aforementioned pre-training model is related to the quality of the aforementioned target public data.
  • Step S1d input the specified test set into the pre-trained model after training for testing, and obtain the correct probability that the predicted label of the specified test set is the correct label;
  • Step S1e judging whether the correct probability is greater than the preset probability, and if it is greater, the target public data and the designated domain named entity training set are combined into a target training set; in this embodiment, the aforementioned designated test set is used
  • the aforementioned designated test set is used
  • the above-mentioned pre-training model if it is obtained that the correct probability of the predicted label of the specified test set being the correct label is greater than the preset probability, it indicates that the predictive ability of the above-mentioned pre-training model is less affected, that is, the quality of the above-mentioned target public data is high .
  • the above-mentioned target public data can also be used as the target training set for the pre-training model for subsequent training.
  • the above-mentioned correct probability is less than the preset probability, it indicates that the predictive ability of the above-mentioned pre-training model is greatly affected, that is, the quality of the above-mentioned target public data is low. At this time, another part of the target public data needs to be randomly selected from the above-mentioned public data set. .
  • Step S1f Perform replacement sampling on the target training set to obtain three sets of training data sets;
  • step S1g the pre-training model is trained separately based on the three sets of training data sets to obtain three training models.
  • the above-mentioned step S1f and step S1g are consistent with the specific implementation of the above-mentioned step S12 and step S13, and will not be repeated here.
  • an embodiment of the present application also provides a training device for a named entity recognition model, including:
  • the request obtaining unit 100 is configured to obtain a designated field in which the target text to be recognized by the named entity recognition model is located when a request for training a named entity recognition model is received; and according to the designated field, obtain a designated field named entity Training set;
  • the first training unit 10 is used to train the BiLSTM-CRF model based on preset training samples to obtain three training models; wherein, the preset training samples include two groups of labeled data sets, and the two groups of labeled data sets are Public data sets and training sets of named entities in designated fields;
  • the first prediction unit 20 is configured to iteratively select any two training models randomly from the three training models, and sequentially select one unlabeled target data from the unlabeled data set and input it into the two selected training models for prediction , Obtain the predicted labels predicted by the two training models;
  • the second training unit 30 is configured to, if the predicted labels predicted by the two training models are the same, add the predicted labels to the unlabeled target data and update them to the training samples of the unselected training models to Train the unselected training model; and put the unlabeled target data back into the unlabeled data set until the samples in the unlabeled data set are no longer updated, stop the iterative training, and obtain three training completions All of the models are used as the final named entity recognition model; wherein, the named entity recognition model is used to perform named entity recognition on the target text in the designated field.
  • the first training unit 10 includes:
  • the first training subunit is used for training based on the BiLSTM-CRF model based on the public data set to obtain a pre-training model
  • the first sampling subunit is used to perform replacement sampling on the named entity training set of the designated field to obtain three sets of training data sets;
  • the second training subunit is used to train the pre-training models based on the three sets of training data sets to obtain three training models.
  • the second training subunit is specifically used for:
  • the preprocessing training model is trained through the three sets of training data sets to obtain three training models.
  • the training device for the named entity recognition model further includes:
  • the second prediction unit is configured to input the text to be recognized into any one of the named entity recognition models for prediction when receiving a named entity recognition instruction of the text to be recognized, and obtain a named entity recognition of the text to be recognized Result; wherein, the named entity recognition result is the label of the character in the text to be recognized;
  • the adding unit is configured to add the to-be-recognized text to the unlabeled data set, and after adding the to-be-recognized text to the named entity recognition result, it is updated to the designated domain named entity training set.
  • the first training unit 10 includes:
  • the selection subunit is used to randomly select target public data from the public data set
  • the classification subunit is used to divide the named entity training set of the specified domain into a specified training set and a specified test set;
  • the third training subunit is used to compose the target public data and the designated training set into a model training set, and input the model training set into the BiLSTM-CRF model for training to obtain a pre-training model;
  • the test subunit is used to input the specified test set into the pre-trained model after training for testing, and obtain the correct probability that the predicted label of the specified test set is the correct label;
  • a judging unit configured to judge whether the correct probability is greater than a preset probability, and if it is greater, combine the target public data and the designated domain named entity training set into a target training set;
  • the second sampling subunit is used to perform replacement sampling on the target training set to obtain three sets of training data
  • the fourth training subunit is used to train the pre-training models based on the three sets of training data sets to obtain three training models.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store training data and so on.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the BiLSTM-CRF model is trained based on preset training samples to obtain three training models; wherein, the preset training samples include two groups of labeled data sets, the two groups of labeled data sets are public data sets and the designated domain Named entity training set;
  • the unlabeled target data is added with the prediction label and updated to the training samples of the unselected training model to train the unselected training model ; And put the unlabeled target data back into the unlabeled data set until the samples in the unlabeled data set are no longer updated, stop iterative training, and obtain three trained models as the final named entity Recognition model; wherein the named entity recognition model is used to perform named entity recognition on the target text in the designated field.
  • FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored.
  • a method for training a named entity recognition model is realized, which includes the following steps:
  • the BiLSTM-CRF model is trained based on preset training samples to obtain three training models; wherein, the preset training samples include two groups of labeled data sets, the two groups of labeled data sets are public data sets and the designated domain Named entity training set;
  • the unlabeled target data is added with the prediction label and updated to the training samples of the unselected training model to train the unselected training model ; And put the unlabeled target data back into the unlabeled data set until the samples in the unlabeled data set are no longer updated, stop iterative training, and obtain three trained models as the final named entity Recognition model; wherein the named entity recognition model is used to perform named entity recognition on the target text in the designated field.
  • the computer-readable storage medium in this embodiment may be a volatile readable storage medium or a non-volatile readable storage medium.
  • the training method, device, computer equipment, and storage medium of the named entity recognition model include training the BiLSTM-CRF model based on preset training samples to obtain three training models; among them,
  • the preset training samples include two groups of labeled data sets, the two groups of labeled data sets are public data sets and the designated domain named entity training set; iteratively select any two training models from the three training models, And sequentially select one unlabeled target data from the unlabeled data set and input it into the two selected training models for prediction, and obtain the predicted labels predicted by the two training models; if the predicted labels predicted by the two training models The same, add the predicted label to the unlabeled target data and update it to the training samples of the unselected training model to train the unselected training model; and put the unlabeled target data back In the unlabeled data set, until the samples in the unlabeled data set are no longer updated, the iterative training is stopped, and three trained models are obtained as final named entity recognition models.
  • the semi-supervised method is used instead of the original method of training with only labeled data, making full use of the original data, and overcoming the shortcomings of insufficient labeled data; using the three learning model voting consistency principles to implicitly express confidence and reduce frequency
  • the time required for cross-validation increases the reliability of the model, makes the model training effect better, recognizes the named entity of the resume text better, and improves the generalization performance.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Character Discrimination (AREA)

Abstract

本申请涉及人工智能领域,提供了一种命名实体识别模型的训练方法、装置、计算机设备和存储介质,包括:基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;采用半监督的方法,充分利用原有数据,克服标注数据量不足的缺陷;用三个学习模型投票一致性原则来隐式表达置信度,增加了模型的可靠性。

Description

命名实体识别模型的训练方法、装置、计算机设备
本申请要求于2020年04月29日提交中国专利局、申请号为202010357577.3,发明名称为“命名实体识别模型的训练方法、装置、计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能的技术领域,特别涉及一种命名实体识别模型的训练方法、装置、计算机设备和存储介质。
背景技术
在对电子简历文本进行识别的场景中,通常需要识别简历文本中的校名、地名等命名实体。命名实体识别(Named Entity Recognition,NER)任务主要是识别出对应文本中出现的人名、地名、机构名等专有名称并加以归类,它是信息抽取、信息检索、问答系统等多种自然语言处理任务的基础。目前采用的命名实体识别模型通常为BiLSTM-CRF模型。
目前,在简历文本中对识别命名实体的准确率提出了更高的要求,而发明人意识到目前的BiLSTM-CRF模型通常基于通用语料库,因此其对简历文本的识别准确率不高;同时训练过程大多采用有监督的方法进行训练,标注十分耗时并且数据量有限。
技术问题
本申请的主要目的为提供一种命名实体识别模型的训练方法、装置、计算机设备和存储介质,旨在克服命名实体识别模型准确率低以及训练模型时标注数据量少的缺陷。
技术解决方案
为实现上述目的,本申请提供了一种命名实体识别模型的训练方法,包括以下步骤:
在接收到对命名实体识别模型进行训练的请求时,获取所述命名实体识别模型所要识别的目标文本所在的指定领域;并根据所述指定领域,获取指定领域命名实体训练集;
基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及所述指定领域命名实体训练集;
迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;
若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型;其中,所述命名实体识别模型用于对所述指定领域中的所述目标文本进行命名实体识别。
本申请还提供了一种命名实体识别模型的训练装置,包括:
请求获取单元,用于在接收到对命名实体识别模型进行训练的请求时,获取所述命名实体识别模型所要识别的目标文本所在的指定领域;并根据所述指定领域,获取指定领域命名实体训练集;
第一训练单元,用于基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及所述指定领域命名实体训练集;
第一预测单元,用于迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;
第二训练单元,用于若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型;其中,所述命名实体识别模型用于对所述指定领域中的所述目标文本进行命名实体识别。
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序时实现一种命名实体识别模型的训练方法,包括以下步骤:
在接收到对命名实体识别模型进行训练的请求时,获取所述命名实体识别模型所要识别的目标文本所在的指定领域;并根据所述指定领域,获取指定领域命名实体训练集;
基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及所述指定领域命名实体训练集;
迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;
若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型;其中,所述命名实体识别模型用于对所述指定领域中的所述目标文本进行命名实体识别。
本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种命名实体识别模型的训练方法,包括以下步骤:
在接收到对命名实体识别模型进行训练的请求时,获取所述命名实体识别模型所要识别的目标文本所在的指定领域;并根据所述指定领域,获取指定领域命名实体训练集;
基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及所述指定领域命名实体训练集;
迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;
若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型;其中,所述命名实体识别模型用于对所述指定领域中的所述目标文本进行命名实体识别。
有益效果
本申请提供的命名实体识别模型的训练方法、装置、计算机设备和存储介质,包括:基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;采用半监督的方法代替原有只用有标签数据训练的方式,充分利用了原有数据,克服标注数据量不足的缺陷;用三个学习模型投票一致性原则来隐式表达置信度,减少频繁交叉验证所需要的时间,增加了模型的可靠性,使模型训练效果更好,对简历文本的命名实体识别效果更好,并提升了泛化能力。
附图说明
图1是本申请一实施例中命名实体识别模型的训练方法步骤示意图;
图2是本申请一实施例中命名实体识别模型的训练装置结构框图;
图3为本申请一实施例的计算机设备的结构示意框图。
本发明的最佳实施方式
参照图1,本申请一实施例中提供了一种命名实体识别模型的训练方法,包括以下步骤:
步骤S01,在接收到对命名实体识别模型进行训练的请求时,获取所述命名实体识别模型所要识别的目标文本所在的指定领域;并根据所述指定领域,获取指定领域命名实体训练集;
步骤S1,基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及指定领域命名实体训练集;
步骤S2,迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;
步骤S3,若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型;其中,所述命名实体识别模型用于对所述指定领域中的所述目标文本进行命名实体识别。
在本实施例中,上述方法中所训练得到的命名实体识别模型用于从简历文本(即上述指定领域)中批量自动识别校名、地名等命名实体从而生成结构化数据;通常在简历内容识别过程中,需要对简历文本中的校名、地名等进行识别,这便需要应用到命名实体识别(NER)技术,上述命名实体识别模型便是为了实现该需求。
目前,上述命名实体识别模型通常采用了BiLSTM-CRF模型,其训练样本通常采用的是现有网络上公开的大量中文命名实体数据集(即上述公开数据集,该数据集为现有资源而且数据量大),上述公开数据集为标注有标签的数据集。
而在本实施例中,本模型的应用场景是实现对简历文本的命名实体识别,因此本实施例中的模型需要针对该场景数据集进行训练,才能用于该任务。如果直接采用该指定领域命名实体训练集,会因为数据量较小导致过拟合问题。
因此,如上述步骤S01所述的,在对指定领域中的目标文本进行命名实体识别之前,需要训练对应的命名实体识别模型,因此,用户可以触发一个训练模型的请求指令,而终端在接收到对命名实体识别模型进行训练的请求时,为了更好训练上述命名实体识别模型,提升其识别准确率,因此,需要获取到该命名实体识别模型具体用于对哪个领域中的目标文本进行识别,以便获取对应领域的训练集用于训练。具体地,接收到训练请求时,获取所有识别的目标文本所在的指定领域;如该指定领域为简历文本领域,则根据简历文本领域,获取对应的简历文本中的实体训练集,用于后续的训练过程。
如上述步骤S1所述的,由于本模型的应用场景是简历文本命名实体识别,因此本模型需要针对简历文本领域的命名实体数据集进行训练,才能用于该任务。如果直接采用该命名实体数据集,会因为数据量较小导致过拟合问题。因此,本方案基于公开数据集首先预训练BiLSTM-CRF模型得到预训练模型M0,以初始化其中神经网络参数,随后再采用指定领域命名实体数据集进行训练,这种方法能够有效提升算法鲁棒性,上述训练样本中采用了指定领域命名实体训练集(该训练集数据量虽小,但是其为特定领域词汇,针对性强),上述指定领域命名实体训练集指的是指定领域当中的训练语料,其也是标注有标签的数据集。具体地,在一实施例中,在基于公开数据集首先预训练模型得到预训练模型M0之后,将指定领域命名实体训练集分成三组训练数据集,并基于每一组训练数据集分别训练预训练模型M0,分别得到一个训练模型;三组训练模型均是基于预训练模型M0所训练得到,区别在于其采用的训练数据集有所不同,因此最终得到的训练模型也有所不同。
如上述步骤S2所述的,上述采用有标签数据进行模型训练为有监督的训练方法,其十分耗费时间,而且通常数据量十分有限,因此,为了充分利用现有数据,在本实施例中,进一步地采用了半监督的训练方法(tri-training)来训练模型,即除了采用上述有标签的数据集之外,还采用了无标签的数据集,不仅增加了训练数据量,而且可以增加模型的可靠性。
具体地,基于上述步骤S1训练得到的三个模型,在半监督训练的每一轮训练过程中,从上述三个模型随机选择出任意两个模型,依次从无标签数据集中选择出一个无标签目标数据进行标签预测,即通过随机两个模型预测同一无标签目标数据对应的预测标签。若该两个模型预测得到的预测标签相同,则可以认为该两个模型的置信度高;否则,置信度较低。可以理解的是,上述预测标签并不是只是一个标签,其为无标签数据集对应的一组标签;该组标签中的标签数量取决于该无标签数据集中的词的数量。
对上述无标签数据集的标注方式采用的是BIOES标注方式,在不同的应用场景中,对相同的词所对应的标注会有所不同;例如在一些场景中,某个词为地名的开头,其可以使用地名中的B进行标注,若为地名的结尾,则标注为地名中的E;例如“北京”中的北会标注为B,京标注为E;在其它场景中,上述北作为名字“顾北”中的一词,北又可能被标注为名字中的E,即相同的词在不同的场景所对应的标签会有所不同。
因此,如上述步骤S3所述的,若两个训练模型预测同一个无标签目标数据得到的预测标签相同,则可以将其预测的预测标签添加至无标签目标数据中以更新至未选择的训练模型的训练样本中,以迭代训练另一个未选择的模型。同时,上述无标签目标数据放回至所述无标签数据集中。在本实施例中,需要结合模型预测结果,确定无标签数据是否加入至训练样本中,现有技术中,采用无标签数据训练模型时通常是根据模型预测无标签数据的概率是否达到阈值来确定是否将无标签数据是否加入至训练样本中。本实施例中相对于现有技术中通过模型预测概率以确定无标签数据是否加入至训练样本中,具有明显的区别;本申请结合多个模型投票一致性原则来隐式表达置信度,增加了模型的可靠性,使模型训练效果更好,在识别时更准确。
若上述选择出的两个训练模型预测得到的预测标签不同,则上述选择出的两个训练模型的置信度较低,需要继续训练,因此上述无标签目标数据无法添加上述预测标签加入至训练样本中。
依次重复上述步骤S3,即完成对上述三个训练模型的再训练。
依次重复上述步骤S2,S3,即更换选择出的两个模型,直至从所述无标签数据集中识别的样本不变时,停止迭代训练,表明此时模型训练已经完成,得到最终的命名实体识别模型。
在本实施例中,每一轮所有无标签数据预测结束之后,相应模型的训练数据集得到更新,之后进入下一轮循环,重复上述步骤,直到所有模型的训练集不再更新的时候停止。通过这种方式,可以有效利用无标签数据增加训练样本的数据量,提升模型泛化性。
在本实施例中,在原有BiLSTM-CRF的基础上,首先加入了特定领域词汇特征,在专业领域下使得分词准确率更高,从而提高命名实体识别的准确率;神经网络算法结合应用于半监督训练方法中,即在CRF和BiLSTM-CRF中应用tri-training来完成NER任务,采用半监督的方法代替原有只用有标签数据训练,充分利用原有数据,克服目前标注数据量不足的缺陷;用三个学习模型投票一致性原则来隐式表达置信度,减少频繁交叉验证所需要的时间,增加了模型的可靠性,使模型训练效果更好,对简历文本的命名实体识别效果更好,并提升了泛化能力。同时,本实施例中训练得到的模型,在实际应用中,例如在简历识别场景中,可以使用具体的简历文本进行迭代训练,以自动化更新模型。
在本实施例中,在智慧城市的建设中,为了加强信息之间的高效传输、表达,上述方案还可以用于智慧办公场景中,推动智慧城市的建设。
在一实施例中,所述基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型的步骤S1,包括:
步骤S11,基于所述公开数据集,基于BiLSTM-CRF模型进行训练,得到预训练模型;
步骤S12,对所述指定领域命名实体训练集进行放回抽样,得到三组训练数据集;
步骤S13,基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型。
具体地,,所述基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型的步骤S13,包括:
保留所述预训练模型中BiLSTM的模型参数,并初始化所述预训练模型中CRF层的模型参数,得到预处理训练模型;
通过三组所述训练数据集分别训练所述预处理训练模型,得到三个所述训练模型。
在本实施例中,上述两组有标签数据集的差别在于,指定领域命名实体训练集为当前任务专门标注的命名实体,公开数据集来自网络公开的大型中文标注命名实体数据集;上述指定领域即为当前命名实体识别任务所在的领域。例如在本实施例中,上述指定领域可为简历文本领域。
如上述步骤S11所述的,基于公开数据集,采用BiLSTM-CRF模型进行训练,得到预训练模型M0,以初始化模型中的神经网络参数;然后如上述步骤S13所述的,采用上述指定领域命名实体训练集放回抽样后得到的三组训练数据集,基于上述预训练模型M0进行训练。在本实施例中,训练样本所采用的是公开数据集以及指定领域命名实体训练集,不仅仅使得训练得到的上述三个模型能保证较高的传统命名实体识别率,又能确保特定领域中的命名实体识别效果。
在本实施例中,首先利用公开数据集训练BiLSTM-CRF模型,获得M0,这一状态下的M0模型,通过预训练之后已经优化了模型参数并获得了一定的预测能力。在这一M0模型基础上,替换掉CRF层(即初始化)并重新采用指定领域命名实体训练集进行训练模型,并进一步优化参数,以使得训练得到的模型能够用于当前任务。在本实施例中,基于大型公开数据集仅仅是为了更好初始化模型神经网络部分参数(即BiLSTM部分的参数),提升模型鲁棒性,而不需要初始化CRF参数。模型最终用于简历命名实体识别,因此需要针对该领域的标注数据集进行训练,CRF层将重新训练,因此需要初始化该CRF层。初始化过程为仅保留BiLSTM部分的预训练参数,并重置CRF参数,再采用放回抽样的方式得到上述三个训练数据集,以分别训练模型,得到上述三个训练模型M1、M2、M3。本实施例中,相比于目前单一采用对应任务的数据集或者单一采用公开数据集,本方案的训练方式能够让模型获得更高的泛化能力。
在一实施例中,所述直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型的步骤S3之后,包括:
步骤S4,接收到对待识别文本的命名实体识别指令时,将所述待识别文本输入至任意一个所述命名实体识别模型中进行预测,得到对所述待识别文本的命名实体识别结果;其中,所述命名实体识别结果为所述待识别文本中字符的标签;上述过程训练得到三个命名实体识别模型,均可以用于对待识别文本进行命名实体的识别。
步骤S5,将所述待识别文本添加至所述无标签数据集中,并将所述待识别文本添加所述命名实体识别结果之后,更新至所述指定领域命名实体训练集中。为了持续更新上述命名实体识别模型,即对上述命名实体识别模型进行迭代训练,则可以继续使用上述待识别文本作为训练样本,不断优化上述命名实体识别模型,此过程中无需人为标注数据,减少工作量,且可以不断增加训练样本的数据量。
在又一实施例中,所述基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型的步骤S1,包括:
步骤S1a,从所述公开数据集中随机选择出目标公开数据;在本实施例中,由于上述指定领域命名实体训练集的数据量有限,可以从上述公开数据集中选择出一些优质的数据来进行训练,以增强上述命名实体识别模型的识别准确率。因此,从所述公开数据集中随机选择出目标公开数据,该过程中,可以采用agent模型进行选择,agent模型在选择数据时,会根据最终模型输出的结果自动进行优化选择,即使得选择出的数据质量越来越好。
步骤S1b,将所述指定领域命名实体训练集分成指定训练集以及指定测试集;
步骤S1c,将所述目标公开数据以及所述指定训练集构成模型训练集,并将所述模型训练集输入至所述BiLSTM-CRF模型中进行训练,得到预训练模型;本实施例中,在训练上述BiLSTM-CRF模型时,若只采用所述指定训练集,则该模型的准确率会最高,但是其数据量少,泛化能力差;上述目标公开数据质量低于上述指定训练集,将上述目标公开数据以及指定训练集共同训练,则其对模型的准确率会造成影响,但是若该目标公开数据的质量越好,则影响越小。因此,上述预训练模型的质量与上述目标公开数据的质量相关。
步骤S1d,将所述指定测试集输入至训练后的预训练模型中进行测试,得到所述指定测试集的预测标签为正确标签的正确概率;
步骤S1e,判断所述正确概率是否大于预设概率,若大于,则将所述目标公开数据以及所述指定领域命名实体训练集组合成目标训练集;在本实施例中,采用上述指定测试集来测试上述预训练模型,若得到所述指定测试集的预测标签为正确标签的正确概率大于预设概率,则表明上述预训练模型的预测能力受影响较小,即上述目标公开数据的质量高,则可以将上述目标公开数据也作为目标训练集用于后续训练的预训练模型。若上述正确概率小于预设概率,则表明上述预训练模型的预测能力受影响较大,即上述目标公开数据的质量低,此时需要重新从上述公开数据集中随机选择出另一部分的目标公开数据。
步骤S1f,对所述目标训练集进行放回抽样,得到三组训练数据集;
步骤S1g,基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型。上述步骤S1f、步骤S1g与上述步骤S12、步骤S13的具体实现一致,在此不再进行赘述。
参照图2,本申请一实施例中还提供了一种命名实体识别模型的训练装置,包括:
请求获取单元100,用于在接收到对命名实体识别模型进行训练的请求时,获取所述命名实体识别模型所要识别的目标文本所在的指定领域;并根据所述指定领域,获取指定领域命名实体训练集;
第一训练单元10,用于基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及指定领域命名实体训练集;
第一预测单元20,用于迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;
第二训练单元30,用于若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型;其中,所述命名实体识别模型用于对所述指定领域中的所述目标文本进行命名实体识别。
在一实施例中,所述第一训练单元10,包括:
第一训练子单元,用于基于所述公开数据集,基于BiLSTM-CRF模型进行训练,得到预训练模型;
第一抽样子单元,用于对所述指定领域命名实体训练集进行放回抽样,得到三组训练数据集;
第二训练子单元,用于基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型。
在一实施例中,所述第二训练子单元,具体用于:
保留所述预训练模型中BiLSTM的模型参数,并初始化所述预训练模型中CRF层的模型参数,得到预处理训练模型;
通过三组所述训练数据集分别训练所述预处理训练模型,得到三个所述训练模型。
在一实施例中,所述命名实体识别模型的训练装置,还包括:
第二预测单元,用于接收到对待识别文本的命名实体识别指令时,将所述待识别文本输入至任意一个所述命名实体识别模型中进行预测,得到对所述待识别文本的命名实体识别结果;其中,所述命名实体识别结果为所述待识别文本中字符的标签;
添加单元,用于将所述待识别文本添加至所述无标签数据集中,并将所述待识别文本添加所述命名实体识别结果之后,更新至所述指定领域命名实体训练集中。
在一实施例中,所述第一训练单元10,包括:
选择子单元,用于从所述公开数据集中随机选择出目标公开数据;
分类子单元,用于将所述指定领域命名实体训练集分成指定训练集以及指定测试集;
第三训练子单元,用于将所述目标公开数据以及所述指定训练集构成模型训练集,并将所述模型训练集输入至所述BiLSTM-CRF模型中进行训练,得到预训练模型;
测试子单元,用于将所述指定测试集输入至训练后的预训练模型中进行测试,得到所述指定测试集的预测标签为正确标签的正确概率;
判断单元,用于判断所述正确概率是否大于预设概率,若大于,则将所述目标公开数据以及所述指定领域命名实体训练集组合成目标训练集;
第二抽样子单元,用于对所述目标训练集进行放回抽样,得到三组训练数据集;
第四训练子单元,用于基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型。
在本实施例中,上述装置实施例中的各个单元、子单元的具体实现请参照上述方法实施例中的具体实现,在此不再进行赘述。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储训练数据等。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种命名实体识别模型的训练方法,包括以下步骤:
在接收到对命名实体识别模型进行训练的请求时,获取所述命名实体识别模型所要识别的目标文本所在的指定领域;并根据所述指定领域,获取指定领域命名实体训练集;
基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及所述指定领域命名实体训练集;
迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;
若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型;其中,所述命名实体识别模型用于对所述指定领域中的所述目标文本进行命名实体识别。
本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。
本申请一实施例还提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现一种命名实体识别模型的训练方法,包括以下步骤:
在接收到对命名实体识别模型进行训练的请求时,获取所述命名实体识别模型所要识别的目标文本所在的指定领域;并根据所述指定领域,获取指定领域命名实体训练集;
基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及所述指定领域命名实体训练集;
迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;
若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型;其中,所述命名实体识别模型用于对所述指定领域中的所述目标文本进行命名实体识别。
可以理解的是,本实施例中的计算机可读存储介质可以是易失性可读存储介质,也可以为非易失性可读存储介质。
综上所述,为本申请实施例中提供的命名实体识别模型的训练方法、装置、计算机设备和存储介质,包括基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及指定领域命名实体训练集;迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型。采用半监督的方法代替原有只用有标签数据训练的方式,充分利用了原有数据,克服标注数据量不足的缺陷;用三个学习模型投票一致性原则来隐式表达置信度,减少频繁交叉验证所需要的时间,增加了模型的可靠性,使模型训练效果更好,对简历文本的命名实体识别效果更好,并提升了泛化能。
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其它相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种命名实体识别模型的训练方法,其中,包括以下步骤:
    在接收到对命名实体识别模型进行训练的请求时,获取所述命名实体识别模型所要识别的目标文本所在的指定领域;并根据所述指定领域,获取指定领域命名实体训练集;
    基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及所述指定领域命名实体训练集;
    迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;
    若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型;其中,所述命名实体识别模型用于对所述指定领域中的所述目标文本进行命名实体识别。
  2. 根据权利要求1所述的命名实体识别模型的训练方法,其中,所述基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型的步骤,包括:
    基于所述公开数据集,基于BiLSTM-CRF模型进行训练,得到预训练模型;
    对所述指定领域命名实体训练集进行放回抽样,得到三组训练数据集;
    基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型。
  3. 根据权利要求2所述的命名实体识别模型的训练方法,其中,所述基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型的步骤,包括:
    保留所述预训练模型中BiLSTM的模型参数,并初始化所述预训练模型中CRF层的模型参数,得到预处理训练模型;
    通过三组所述训练数据集分别训练所述预处理训练模型,得到三个所述训练模型。
  4. 根据权利要求1所述的命名实体识别模型的训练方法,其中,所述直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型的步骤之后,包括:
    接收到对待识别文本的命名实体识别指令时,将所述待识别文本输入至任意一个所述命名实体识别模型中进行预测,得到对所述待识别文本的命名实体识别结果;其中,所述命名实体识别结果为所述待识别文本中字符的标签;
    将所述待识别文本添加至所述无标签数据集中,并将所述待识别文本添加所述命名实体识别结果之后,更新至所述指定领域命名实体训练集中。
  5. 根据权利要求1所述的命名实体识别模型的训练方法,其中,所述基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型的步骤,包括:
    从所述公开数据集中随机选择出目标公开数据;
    将所述指定领域命名实体训练集分成指定训练集以及指定测试集;
    将所述目标公开数据以及所述指定训练集构成模型训练集,并将所述模型训练集输入至所述BiLSTM-CRF模型中进行训练,得到预训练模型;
    将所述指定测试集输入至训练后的预训练模型中进行测试,得到所述指定测试集的预测标签为正确标签的正确概率;
    判断所述正确概率是否大于预设概率,若大于,则将所述目标公开数据以及所述指定领域命名实体训练集组合成目标训练集;
    对所述目标训练集进行放回抽样,得到三组训练数据集;
    基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型。
  6. 一种命名实体识别模型的训练装置,其中,包括:
    请求获取单元,用于在接收到对命名实体识别模型进行训练的请求时,获取所述命名实体识别模型所要识别的目标文本所在的指定领域;并根据所述指定领域,获取指定领域命名实体训练集;
    第一训练单元,用于基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及所述指定领域命名实体训练集;
    第一预测单元,用于迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;
    第二训练单元,用于若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型;其中,所述命名实体识别模型用于对所述指定领域中的所述目标文本进行命名实体识别。
  7. 根据权利要求6所述的命名实体识别模型的训练装置,其中,所述第一训练单元,包括:
    第一训练子单元,用于基于所述公开数据集,基于BiLSTM-CRF模型进行训练,得到预训练模型;
    第一抽样子单元,用于对所述指定领域命名实体训练集进行放回抽样,得到三组训练数据集;
    第二训练子单元,用于基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型。
  8. 根据权利要求7所述的命名实体识别模型的训练装置,其中,所述第二训练子单元,具体用于:
    保留所述预训练模型中BiLSTM的模型参数,并初始化所述预训练模型中CRF层的模型参数,得到预处理训练模型;
    通过三组所述训练数据集分别训练所述预处理训练模型,得到三个所述训练模型。
  9. 根据权利要求6所述的命名实体识别模型的训练装置,其中,所述命名实体识别模型的训练装置,还包括:
    第二预测单元,用于接收到对待识别文本的命名实体识别指令时,将所述待识别文本输入至任意一个所述命名实体识别模型中进行预测,得到对所述待识别文本的命名实体识别结果;其中,所述命名实体识别结果为所述待识别文本中字符的标签;
    添加单元,用于将所述待识别文本添加至所述无标签数据集中,并将所述待识别文本添加所述命名实体识别结果之后,更新至所述指定领域命名实体训练集中。
  10. 根据权利要求6所述的命名实体识别模型的训练装置,其中,所述第一训练单元,包括:
    选择子单元,用于从所述公开数据集中随机选择出目标公开数据;
    分类子单元,用于将所述指定领域命名实体训练集分成指定训练集以及指定测试集;
    第三训练子单元,用于将所述目标公开数据以及所述指定训练集构成模型训练集,并将所述模型训练集输入至所述BiLSTM-CRF模型中进行训练,得到预训练模型;
    测试子单元,用于将所述指定测试集输入至训练后的预训练模型中进行测试,得到所述指定测试集的预测标签为正确标签的正确概率;
    判断单元,用于判断所述正确概率是否大于预设概率,若大于,则将所述目标公开数据以及所述指定领域命名实体训练集组合成目标训练集;
    第二抽样子单元,用于对所述目标训练集进行放回抽样,得到三组训练数据集;
    第四训练子单元,用于基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型。
  11. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,其中,所述处理器执行所述计算机程序时实现一种命名实体识别模型的训练方法,包括以下步骤:
    在接收到对命名实体识别模型进行训练的请求时,获取所述命名实体识别模型所要识别的目标文本所在的指定领域;并根据所述指定领域,获取指定领域命名实体训练集;
    基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及所述指定领域命名实体训练集;
    迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;
    若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型;其中,所述命名实体识别模型用于对所述指定领域中的所述目标文本进行命名实体识别。
  12. 根据权利要求11所述的计算机设备,其中,所述基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型的步骤,包括:
    基于所述公开数据集,基于BiLSTM-CRF模型进行训练,得到预训练模型;
    对所述指定领域命名实体训练集进行放回抽样,得到三组训练数据集;
    基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型。
  13. 根据权利要求12所述的计算机设备,其中,所述基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型的步骤,包括:
    保留所述预训练模型中BiLSTM的模型参数,并初始化所述预训练模型中CRF层的模型参数,得到预处理训练模型;
    通过三组所述训练数据集分别训练所述预处理训练模型,得到三个所述训练模型。
  14. 根据权利要求11所述的计算机设备,其中,所述直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型的步骤之后,包括:
    接收到对待识别文本的命名实体识别指令时,将所述待识别文本输入至任意一个所述命名实体识别模型中进行预测,得到对所述待识别文本的命名实体识别结果;其中,所述命名实体识别结果为所述待识别文本中字符的标签;
    将所述待识别文本添加至所述无标签数据集中,并将所述待识别文本添加所述命名实体识别结果之后,更新至所述指定领域命名实体训练集中。
  15. 根据权利要求11所述的计算机设备,其中,所述基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型的步骤,包括:
    从所述公开数据集中随机选择出目标公开数据;
    将所述指定领域命名实体训练集分成指定训练集以及指定测试集;
    将所述目标公开数据以及所述指定训练集构成模型训练集,并将所述模型训练集输入至所述BiLSTM-CRF模型中进行训练,得到预训练模型;
    将所述指定测试集输入至训练后的预训练模型中进行测试,得到所述指定测试集的预测标签为正确标签的正确概率;
    判断所述正确概率是否大于预设概率,若大于,则将所述目标公开数据以及所述指定领域命名实体训练集组合成目标训练集;
    对所述目标训练集进行放回抽样,得到三组训练数据集;
    基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型。
  16. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现一种命名实体识别模型的训练方法,包括以下步骤:
    在接收到对命名实体识别模型进行训练的请求时,获取所述命名实体识别模型所要识别的目标文本所在的指定领域;并根据所述指定领域,获取指定领域命名实体训练集;
    基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型;其中,预设的所述训练样本包括两组有标签数据集,两组有标签数据集为公开数据集以及所述指定领域命名实体训练集;
    迭代从三个所述训练模型中随机选择任意两个训练模型,并依次从无标签数据集中选择一个无标签目标数据输入至选择出的两个训练模型中进行预测,得到两个所述训练模型预测出的预测标签;
    若两个所述训练模型预测的预测标签相同,则将所述无标签目标数据添加所述预测标签并更新至未选择的所述训练模型的训练样本中,以训练未选择的所述训练模型;并将所述无标签目标数据放回至所述无标签数据集中,直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型;其中,所述命名实体识别模型用于对所述指定领域中的所述目标文本进行命名实体识别。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型的步骤,包括:
    基于所述公开数据集,基于BiLSTM-CRF模型进行训练,得到预训练模型;
    对所述指定领域命名实体训练集进行放回抽样,得到三组训练数据集;
    基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型的步骤,包括:
    保留所述预训练模型中BiLSTM的模型参数,并初始化所述预训练模型中CRF层的模型参数,得到预处理训练模型;
    通过三组所述训练数据集分别训练所述预处理训练模型,得到三个所述训练模型。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述直至所述无标签数据集中的样本不再更新时,停止迭代训练,得到三个训练完成的模型均作为最终的命名实体识别模型的步骤之后,包括:
    接收到对待识别文本的命名实体识别指令时,将所述待识别文本输入至任意一个所述命名实体识别模型中进行预测,得到对所述待识别文本的命名实体识别结果;其中,所述命名实体识别结果为所述待识别文本中字符的标签;
    将所述待识别文本添加至所述无标签数据集中,并将所述待识别文本添加所述命名实体识别结果之后,更新至所述指定领域命名实体训练集中。
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述基于预设的训练样本训练BiLSTM-CRF模型,得到三个训练模型的步骤,包括:
    从所述公开数据集中随机选择出目标公开数据;
    将所述指定领域命名实体训练集分成指定训练集以及指定测试集;
    将所述目标公开数据以及所述指定训练集构成模型训练集,并将所述模型训练集输入至所述BiLSTM-CRF模型中进行训练,得到预训练模型;
    将所述指定测试集输入至训练后的预训练模型中进行测试,得到所述指定测试集的预测标签为正确标签的正确概率;
    判断所述正确概率是否大于预设概率,若大于,则将所述目标公开数据以及所述指定领域命名实体训练集组合成目标训练集;
    对所述目标训练集进行放回抽样,得到三组训练数据集;
    基于三组所述训练数据集分别对所述预训练模型进行训练,得到三个训练模型。
PCT/CN2020/118523 2020-04-29 2020-09-28 命名实体识别模型的训练方法、装置、计算机设备 WO2021218024A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010357577.3A CN111553164A (zh) 2020-04-29 2020-04-29 命名实体识别模型的训练方法、装置、计算机设备
CN202010357577.3 2020-04-29

Publications (1)

Publication Number Publication Date
WO2021218024A1 true WO2021218024A1 (zh) 2021-11-04

Family

ID=72006261

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118523 WO2021218024A1 (zh) 2020-04-29 2020-09-28 命名实体识别模型的训练方法、装置、计算机设备

Country Status (2)

Country Link
CN (1) CN111553164A (zh)
WO (1) WO2021218024A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114169338A (zh) * 2022-02-10 2022-03-11 北京智源人工智能研究院 一种医疗命名实体识别方法、装置和电子设备
CN114218951A (zh) * 2021-12-16 2022-03-22 北京百度网讯科技有限公司 实体识别模型的训练方法、实体识别方法及装置
CN114266253A (zh) * 2021-12-21 2022-04-01 武汉百智诚远科技有限公司 一种未标注数据的半监督命名实体识别的方法
CN114882472A (zh) * 2022-05-17 2022-08-09 安徽蔚来智驾科技有限公司 一种车位检测方法、计算机可读存储介质及车辆
CN115186670A (zh) * 2022-09-08 2022-10-14 北京航空航天大学 一种基于主动学习的领域命名实体识别方法及系统
CN116204610A (zh) * 2023-04-28 2023-06-02 深圳市前海数据服务有限公司 一种基于可研报告命名实体识别的数据挖掘方法及装置
CN116545779A (zh) * 2023-07-06 2023-08-04 鹏城实验室 网络安全命名实体识别方法、装置、设备和存储介质
US11886820B2 (en) * 2020-10-06 2024-01-30 Genpact Luxembourg S.à r.l. II System and method for machine-learning based extraction of information from documents

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553164A (zh) * 2020-04-29 2020-08-18 平安科技(深圳)有限公司 命名实体识别模型的训练方法、装置、计算机设备
CN111985240B (zh) * 2020-08-19 2024-02-27 腾讯云计算(长沙)有限责任公司 命名实体识别模型的训练方法、命名实体识别方法及装置
CN114548103B (zh) * 2020-11-25 2024-03-29 马上消费金融股份有限公司 一种命名实体识别模型的训练方法和命名实体的识别方法
CN112613312B (zh) * 2020-12-18 2022-03-18 平安科技(深圳)有限公司 实体命名识别模型的训练方法、装置、设备及存储介质
CN112633002A (zh) * 2020-12-29 2021-04-09 上海明略人工智能(集团)有限公司 样本标注、模型训练、命名实体识别方法和装置
CN112766485B (zh) * 2020-12-31 2023-10-24 平安科技(深圳)有限公司 命名实体模型的训练方法、装置、设备及介质
CN112733911B (zh) * 2020-12-31 2023-05-30 平安科技(深圳)有限公司 实体识别模型的训练方法、装置、设备和存储介质
CN112765985B (zh) * 2021-01-13 2023-10-27 中国科学技术信息研究所 一种面向特定领域专利实施例的命名实体识别方法
CN113240125B (zh) * 2021-01-13 2024-05-28 深延科技(北京)有限公司 模型训练方法及装置、标注方法及装置、设备及存储介质
CN113158675B (zh) * 2021-04-23 2024-04-02 平安科技(深圳)有限公司 基于人工智能的实体抽取方法、装置、设备及介质
CN113919355B (zh) * 2021-10-19 2023-11-07 四川大学 一种适用于少训练语料场景的半监督命名实体识别方法
CN114548109B (zh) * 2022-04-24 2022-09-23 阿里巴巴达摩院(杭州)科技有限公司 命名实体识别模型训练方法及命名实体识别方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223737A (zh) * 2019-06-13 2019-09-10 电子科技大学 一种中药化学成分命名实体识别方法与装置
CN110705293A (zh) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 基于预训练语言模型的电子病历文本命名实体识别方法
US20200050662A1 (en) * 2018-08-09 2020-02-13 Oracle International Corporation System And Method To Generate A Labeled Dataset For Training An Entity Detection System
CN111553164A (zh) * 2020-04-29 2020-08-18 平安科技(深圳)有限公司 命名实体识别模型的训练方法、装置、计算机设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200050662A1 (en) * 2018-08-09 2020-02-13 Oracle International Corporation System And Method To Generate A Labeled Dataset For Training An Entity Detection System
CN110223737A (zh) * 2019-06-13 2019-09-10 电子科技大学 一种中药化学成分命名实体识别方法与装置
CN110705293A (zh) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 基于预训练语言模型的电子病历文本命名实体识别方法
CN111553164A (zh) * 2020-04-29 2020-08-18 平安科技(深圳)有限公司 命名实体识别模型的训练方法、装置、计算机设备

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11886820B2 (en) * 2020-10-06 2024-01-30 Genpact Luxembourg S.à r.l. II System and method for machine-learning based extraction of information from documents
CN114218951A (zh) * 2021-12-16 2022-03-22 北京百度网讯科技有限公司 实体识别模型的训练方法、实体识别方法及装置
CN114266253A (zh) * 2021-12-21 2022-04-01 武汉百智诚远科技有限公司 一种未标注数据的半监督命名实体识别的方法
CN114266253B (zh) * 2021-12-21 2024-01-23 武汉百智诚远科技有限公司 一种未标注数据的半监督命名实体识别的方法
CN114169338A (zh) * 2022-02-10 2022-03-11 北京智源人工智能研究院 一种医疗命名实体识别方法、装置和电子设备
CN114882472A (zh) * 2022-05-17 2022-08-09 安徽蔚来智驾科技有限公司 一种车位检测方法、计算机可读存储介质及车辆
CN115186670A (zh) * 2022-09-08 2022-10-14 北京航空航天大学 一种基于主动学习的领域命名实体识别方法及系统
CN116204610A (zh) * 2023-04-28 2023-06-02 深圳市前海数据服务有限公司 一种基于可研报告命名实体识别的数据挖掘方法及装置
CN116545779A (zh) * 2023-07-06 2023-08-04 鹏城实验室 网络安全命名实体识别方法、装置、设备和存储介质
CN116545779B (zh) * 2023-07-06 2023-10-03 鹏城实验室 网络安全命名实体识别方法、装置、设备和存储介质

Also Published As

Publication number Publication date
CN111553164A (zh) 2020-08-18

Similar Documents

Publication Publication Date Title
WO2021218024A1 (zh) 命名实体识别模型的训练方法、装置、计算机设备
CN110457675B (zh) 预测模型训练方法、装置、存储介质及计算机设备
US11948058B2 (en) Utilizing recurrent neural networks to recognize and extract open intent from text inputs
CN111967266B (zh) 中文命名实体识别系统、模型构建方法和应用及相关设备
US10510336B2 (en) Method, apparatus, and system for conflict detection and resolution for competing intent classifiers in modular conversation system
WO2020114429A1 (zh) 关键词提取模型训练方法、关键词提取方法及计算机设备
US20190354810A1 (en) Active learning to reduce noise in labels
US20210019599A1 (en) Adaptive neural architecture search
CN109614625B (zh) 标题正文相关度的确定方法、装置、设备及存储介质
JP5901001B1 (ja) 音響言語モデルトレーニングのための方法およびデバイス
CN110704576B (zh) 一种基于文本的实体关系抽取方法及装置
US11551437B2 (en) Collaborative information extraction
CN110929114A (zh) 利用动态记忆网络来跟踪数字对话状态并生成响应
WO2021139257A1 (zh) 标注数据的选择方法、装置、计算机设备和存储介质
WO2023207096A1 (zh) 一种实体链接方法、装置、设备及非易失性可读存储介质
CN111737432A (zh) 一种基于联合训练模型的自动对话方法和系统
WO2023137911A1 (zh) 基于小样本语料的意图分类方法、装置及计算机设备
CN109858004B (zh) 文本改写方法、装置及电子设备
RU2712101C2 (ru) Предсказание вероятности появления строки с использованием последовательности векторов
WO2021001517A1 (en) Question answering systems
CN113434683A (zh) 文本分类方法、装置、介质及电子设备
CN112214595A (zh) 类别确定方法、装置、设备及介质
CN113688955B (zh) 文本识别方法、装置、设备及介质
US20230205994A1 (en) Performing machine learning tasks using instruction-tuned neural networks
CN114528387A (zh) 基于对话流自举的深度学习对话策略模型构建方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933682

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20933682

Country of ref document: EP

Kind code of ref document: A1