WO2021139257A1 - 标注数据的选择方法、装置、计算机设备和存储介质 - Google Patents

标注数据的选择方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2021139257A1
WO2021139257A1 PCT/CN2020/118533 CN2020118533W WO2021139257A1 WO 2021139257 A1 WO2021139257 A1 WO 2021139257A1 CN 2020118533 W CN2020118533 W CN 2020118533W WO 2021139257 A1 WO2021139257 A1 WO 2021139257A1
Authority
WO
WIPO (PCT)
Prior art keywords
dictionary
preset
model
target
data
Prior art date
Application number
PCT/CN2020/118533
Other languages
English (en)
French (fr)
Inventor
梁欣
顾婷婷
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139257A1 publication Critical patent/WO2021139257A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Definitions

  • This application relates to the field of blockchain technology, and in particular to a method, device, computer equipment, and storage medium for selecting labeled data.
  • Entity recognition is the first step in natural language processing tasks, and it is also a very critical step. Especially in vertical fields such as finance, e-commerce, and medical care, entity recognition is the key to natural language processing tasks. For example, entity linking, relationship extraction between entities, relationship classification and other downstream tasks will transmit errors caused by upstream tasks layer by layer.
  • the main purpose of this application is to provide a method, device, computer equipment, and storage medium for selecting annotated data, aiming to overcome the current defects of incomplete annotated data and the inability to select high-quality annotated data.
  • this application provides a method for selecting annotated data, which includes the following steps:
  • the target entity is constructed and expanded to a preset dictionary, and the expanded dictionary is used as the target dictionary; wherein, the target dictionary is all labeled data; the target entity and the entities in the preset dictionary have connection relation;
  • This application also provides a device for selecting annotated data, including:
  • the construction unit is used for constructing a target entity to be expanded into a preset dictionary based on the knowledge graph, so as to obtain the expanded dictionary as a target dictionary; wherein, the target dictionary is all labeled data; the target entity and the preset The entities in the dictionary have an association relationship;
  • the selection unit is used to select dictionary annotation data from the target dictionary based on the agent model
  • the classification unit is used to divide the preset manually labeled data into a manual training set and a manual test set;
  • a training unit configured to form a model training set from the dictionary labeled data and the manual training set, and input the model training set into a preset entity recognition model for training;
  • the test unit is configured to input the manual test set into the trained entity recognition model for testing, and obtain the correct probability that the prediction of the manual test set is labeled as the correct label;
  • the judging unit is used to calculate the difference between the correct probability and the preset probability, and determine whether the difference is less than a threshold, and if not, select an optimized dictionary label from the target dictionary based on the agent model Data, and re-execute the dictionary-labeled data and the manual training set to form a model training set.
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the method for selecting the above-mentioned annotation data is realized, including the following steps:
  • the target entity is constructed and expanded to a preset dictionary, and the expanded dictionary is used as the target dictionary; wherein, the target dictionary is all labeled data; the target entity and the entities in the preset dictionary have connection relation;
  • This application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method for selecting the above-mentioned annotation data is realized, including the following steps:
  • the target entity is constructed and expanded to a preset dictionary, and the expanded dictionary is used as the target dictionary; wherein, the target dictionary is all labeled data; the target entity and the entities in the preset dictionary have connection relation;
  • the method, device, computer equipment, and storage medium for labeling data provided in this application construct the target entity and add it to the preset dictionary based on the knowledge graph to obtain the expanded dictionary as the target dictionary, so that the dictionary labeling data in the target dictionary More complete; at the same time, the entity recognition model is jointly trained based on manual annotation data and dictionary annotation data to determine whether the quality of the selected dictionary annotation data meets the requirements. If not, the optimized dictionary annotation data is selected from the target dictionary, namely It is possible to select higher-quality dictionary annotation data.
  • FIG. 1 is a schematic diagram of the steps of a method for selecting annotated data in an embodiment of the present application
  • Fig. 2 is a structural block diagram of a device for selecting annotated data in an embodiment of the present application
  • FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
  • an embodiment of the present application provides a method for selecting annotation data, which includes the following steps:
  • Step S1 based on the knowledge graph, construct a target entity to be expanded into a preset dictionary to obtain the expanded dictionary as a target dictionary; wherein, the target dictionary is all labeled data; the target entity and the preset dictionary are Entity has an association relationship;
  • Step S2 selecting dictionary annotation data from the target dictionary based on the agent model
  • Step S3 dividing the preset manual annotation data into a manual training set and a manual test set;
  • Step S4 forming a model training set by the dictionary labeling data and the manual training set, and inputting the model training set into a preset entity recognition model for training;
  • Step S5 input the manual test set to the trained entity recognition model for testing, and obtain the correct probability that the prediction of the manual test set is labeled as the correct label;
  • Step S6 Calculate the difference between the correct probability and the preset probability, and determine whether the difference is less than a threshold, and if not, select optimized dictionary annotation data from the target dictionary based on the agent model, And re-enter the step of forming the model training set with the dictionary labeling data and the manual training set.
  • the above method is applied to filter the annotation data required for training in the process of training an entity recognition model, and the entity recognition model is used to identify entities in the medical text field.
  • the solution in this embodiment can also be applied to the smart medical field of a smart city, so as to promote the construction of a smart city.
  • the high-quality annotation data is usually manual annotation. Therefore, in this embodiment, a small amount of high-quality manual annotation data and dictionaries in similar fields are combined to obtain training samples, which can effectively increase the amount of data, so that the model can obtain a larger training set and improve the generalization of the model.
  • the preset dictionary is annotated data obtained by using the entity dictionary in the vertical domain to annotate sentences.
  • construct based on the knowledge graph A target entity having an association relationship with an entity in the preset dictionary is added to the preset dictionary to expand the preset dictionary.
  • the above-mentioned association relationship refers to: constructing corresponding aliases for the entities of diseases and symptoms in the preset dictionary, such as "chronic bronchitis” expanding “chronic bronchitis”; constructing target entities with higher similarity to the entities in the preset dictionary ,
  • the calculation method of similarity can be based on the shortest edit distance of the string, pinyin, radical radicals and other features to calculate alone or in combination; in addition, for some character descriptions of entities in the preset dictionary, replace similar words or antonyms, For example, expand “acute asthma” to “chronic asthma”, expand “diabetes with high blood pressure” to expand “diabetes without high blood pressure” and so on.
  • the agent model (agent model) is obtained based on reinforcement learning training. It is used to select the correctly labeled dictionary labeled data from the labeled data labeled by the target dictionary, and the data selected each time has Orientation, making the quality of labeling higher and higher, and the selected data is used to train the entity recognition model; because the data labeled in the dictionary may be incomplete or sometimes incorrect, it needs to be continuously selected by the agent model to be more accurate
  • the data is optimized for the dictionary annotation data used to train the entity recognition model.
  • the above-mentioned artificially annotated data is obtained by manual annotation, which is high-quality annotated data.
  • manual annotation which is high-quality annotated data.
  • the above-mentioned artificially annotated data needs to be divided into a manual training set and a manual Test set.
  • the amount of data in the manual training set is relatively small. Therefore, it needs to be combined with the dictionary labeled data selected in the target dictionary to form training data to obtain the model training set and increase the data of the training data.
  • Quantity input the model training set to a preset entity recognition model for training, so as to improve the generalization of the entity recognition model.
  • the aforementioned entity recognition model includes the BiLSTM-CRF model.
  • the training data includes not only high-quality manually labeled data, but also some incomplete and inaccurate dictionary labeled data. It is understandable that if the above-mentioned dictionary labeling data is incomplete and inaccurate, the labeling accuracy obtained when the above-mentioned trained entity recognition model is tested using the above-mentioned manual test set will decrease.
  • the accuracy rate when the above manual test set is normally used for testing should be 1, and the above 1 can be used as a preset probability.
  • the artificial test set is input into the trained entity recognition model for testing to obtain the correct probability that the artificial test set’s prediction is marked as the correct label, and then the said The difference between the correct probability and the preset probability, and determine whether the difference is less than the threshold; if the correct probability is close to the preset probability (that is, the difference is small), it indicates that the quality of the dictionary labeled data is good; if the above The correct probability is not close to the above preset probability (that is, the difference is large), which indicates that the quality of the above dictionary annotation data is not good, and there must be more incomplete and inaccurate annotation data, which affects the recognition accuracy of the above entity recognition model rate.
  • the agent model can be triggered to re-select more optimized dictionary annotation data from the target dictionary, and then re-enter the step of forming the model training set by the dictionary annotation data and the manual training set. Since the above agent model is based on reinforcement learning training, the iteratively selected dictionary labeled data are all more accurate labeled data selected based on the test results.
  • the selected annotation data is continuously input into the above-mentioned entity recognition model for training, and the training is iterated in sequence, until the test result stabilizes, then the training is completed.
  • first manually annotate a small amount of annotated data use the entity dictionary in the vertical domain, use the dictionary to annotate sentences to obtain dictionary annotation data, enhance the data, and generate a large number of data sets, so that the model gets a larger training set and improves Model generalization.
  • the reinforcement learning method the incomplete and noisy data generated by remote supervision is screened, and the training is carried out under the guidance of the prior knowledge of manually labeling small data sets, so that the model can be used on both the manually labeled data and the dictionary. Training on the labeled data reduces the time cost of manual labeling and improves the recall rate of the model.
  • the step S4 of inputting the model training set into a preset entity recognition model for training includes:
  • Step S401 separately constructing a word vector and a word vector corresponding to each text data in the model training set, and splicing the word vector and word vector corresponding to the same text data to obtain a splicing vector;
  • Step S402 input the splicing vector into a preset entity recognition model, and output a first feature vector;
  • Step S403 combining the first feature vector and the stitching vector, and inputting them into a preset entity recognition model, and outputting a second feature vector;
  • Step S404 Input the second feature vector into the classification layer of the preset entity recognition model, and perform training to optimize the network parameters of the classification layer.
  • the word vectors and words corresponding to each text data in the model training set are constructed respectively.
  • Vector, the word vector and word vector corresponding to the same text data are spliced to obtain a splicing vector; then the splicing vector is input into a preset entity recognition model, and the first feature vector is output; in order to further improve the aforementioned entity recognition
  • the model expresses the features of the text data to increase the depth of feature extraction; therefore, after combining the first feature vector and the stitching vector, they are re-input into the preset entity recognition model, and the second feature vector is obtained as output.
  • the second feature vector is used as the feature vector corresponding to the above-mentioned text data.
  • it is input into the classification layer for iterative training, and the network parameters are optimized to obtain the trained entity recognition model.
  • step S4 of inputting the model training set into a preset entity recognition model for training it includes:
  • an initial long and short memory model is trained to obtain a preset entity recognition model.
  • the public data set may be used to train the initial long and short memory model to initialize the neural network parameters therein to obtain the aforementioned preset entity recognition model. Then use the model training set for training, this method can effectively improve the robustness of the model.
  • the method before the step S1 of constructing an entity that has an association relationship with the entity in the preset dictionary based on the knowledge graph to obtain the expanded dictionary as the target dictionary, the method further includes :
  • Step S1a receiving a model training instruction input by a user, wherein the model training instruction carries information about the application domain of the model to be trained;
  • Step S1b Obtain a preset dictionary of the corresponding field according to the application field information.
  • the model training should be performed by using the labeled data of the corresponding field.
  • a user sends a request for training a model, he can input a corresponding model training instruction, and the model training instruction can carry application domain information of the model to be trained.
  • the label data of the corresponding field can be obtained, and the label data of the corresponding field can be used to better train the above model.
  • the obtained entity recognition model has better effect in recognizing text in the corresponding field.
  • Step S7 iteratively train a preset entity recognition model until the difference between the correct probability and the preset probability is less than the threshold, and a target entity recognition model is obtained;
  • Step S8 receiving the target text input by the user, and receiving an entity recognition request instruction in the target text
  • Step S9 based on the request instruction, identifying the domain information of the target text
  • Step S10 judging whether the domain information of the target text is the same as the application domain information of the target entity recognition model
  • Step S11 if they are the same, perform named entity recognition on the target text based on the target entity recognition model; if they are not the same, obtain training data corresponding to the domain information of the target text to retrain the target entity recognition model.
  • the above-mentioned target text when using the above-mentioned target entity recognition model to perform entity recognition in the target text, the above-mentioned target text may not be text in the medical field. Therefore, in order to improve the accuracy of recognition and avoid recognition errors, it is necessary to first recognize the The domain information of the target text, if the domain information of the target text is the same as the application domain information of the above-mentioned target entity recognition model, when the target entity recognition model is used for named entity recognition, the accuracy can be significantly improved. If the domain information of the target text is different from the application domain information of the aforementioned target entity recognition model, it is necessary to obtain training data corresponding to the domain information of the target text to retrain the target entity recognition model.
  • the aforementioned preset dictionary, target dictionary, agent model, manual annotation data, and preset entity recognition model are stored in a blockchain, which is a distributed data storage, peer-to-peer transmission, consensus mechanism, The new application mode of computer technology such as encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • an embodiment of the present application also provides a device for selecting annotated data, including:
  • the construction unit is used for constructing a target entity to be expanded into a preset dictionary based on the knowledge graph, so as to obtain the expanded dictionary as a target dictionary; wherein, the target dictionary is all labeled data; the target entity and the preset The entities in the dictionary have an association relationship;
  • the selection unit is used to select dictionary annotation data from the target dictionary based on the agent model
  • the classification unit is used to divide the preset manually labeled data into a manual training set and a manual test set;
  • a training unit configured to form a model training set from the dictionary labeled data and the manual training set, and input the model training set into a preset entity recognition model for training;
  • the test unit is configured to input the manual test set into the trained entity recognition model for testing, and obtain the correct probability that the prediction of the manual test set is labeled as the correct label;
  • the judging unit is used to calculate the difference between the correct probability and the preset probability, and determine whether the difference is less than a threshold, and if not, select an optimized dictionary label from the target dictionary based on the agent model Data, and re-execute the dictionary-labeled data and the manual training set to form a model training set.
  • the training unit includes:
  • a constructing subunit for separately constructing a word vector and a word vector corresponding to each text data in the model training set, and splicing the word vector and word vector corresponding to the same text data to obtain a splicing vector;
  • the first output subunit is used to input the splicing vector into a preset entity recognition model, and output the first feature vector;
  • the second output subunit is used to combine the first feature vector and the splicing vector, and input them into a preset entity recognition model, and output a second feature vector;
  • the training subunit is used to input the second feature vector into the classification layer of the preset entity recognition model, and perform training to optimize the network parameters of the classification layer.
  • it further includes:
  • the first obtaining unit is used to obtain a public data set
  • the initial training unit is used to train an initial long and short memory model based on the public data set to obtain a preset entity recognition model.
  • it further includes:
  • the first receiving unit is configured to receive a model training instruction input by a user, wherein the model training instruction carries application field information of the model to be trained;
  • the second obtaining unit is used to obtain a preset dictionary of the corresponding field according to the application field information.
  • it further includes:
  • An iterative unit configured to iteratively train a preset entity recognition model until the difference between the correct probability and the preset probability is less than the threshold to obtain a target entity recognition model
  • the second receiving unit is configured to receive the target text input by the user, and receive an entity recognition request instruction in the target text;
  • a recognition unit configured to recognize domain information of the target text based on the request instruction
  • a domain judgment unit configured to judge whether the domain information of the target text is the same as the application domain information of the target entity recognition model
  • the processing unit is configured to perform named entity recognition on the target text based on the target entity recognition model if they are the same; if they are not the same, obtain training data corresponding to the domain information of the target text to retrain the target entity recognition model.
  • the device further includes:
  • the storage unit is used to store the target dictionary, agent model, manual annotation data, and preset entity recognition model in the blockchain.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store annotation data, models, etc.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the selection method of the above-mentioned annotation data includes the following steps:
  • the target entity is constructed and expanded to a preset dictionary, and the expanded dictionary is used as the target dictionary; wherein, the target dictionary is all labeled data; the target entity and the entities in the preset dictionary have connection relation;
  • FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, a method for selecting annotated data is implemented.
  • the selection method of the above-mentioned annotation data includes the following steps:
  • the target entity is constructed and expanded to a preset dictionary, and the expanded dictionary is used as the target dictionary; wherein, the target dictionary is all labeled data; the target entity and the entities in the preset dictionary have connection relation;
  • the computer-readable storage medium in this embodiment may be a volatile readable storage medium or a non-volatile readable storage medium.
  • the method, device, computer equipment, and storage medium for labeling data construct and add the target entity to the preset dictionary based on the knowledge graph, so as to obtain the expanded dictionary as the target dictionary.
  • the entity recognition model is jointly trained based on the manual annotation data and the dictionary annotation data to determine whether the quality of the selected dictionary annotation data meets the requirements. If not, then from the target dictionary Selecting the optimized dictionary labeling data realizes the selection of higher-quality dictionary labeling data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

一种标注数据的选择方法、装置、计算机设备和存储介质,涉及区块链技术领域,包括:基于存储于区块链中的agent模型从目标字典中选择出字典标注数据(S2);将预设的人工标注数据分成人工训练集以及人工测试集(S3);将字典标注数据以及人工训练集构成模型训练集,输入至预设的实体识别模型中进行训练(S4);将人工测试集输入至训练后的实体识别模型中进行测试,得到人工测试集的预测标注为正确标注的正确概率(S5);计算正确概率与预设概率的差值,并判断差值是否小于阈值,若不小于,则基于agent模型从目标字典中选择出优化的字典标注数据(S6)。所述方法、装置、计算机设备和存储介质可以选择出质量高的标注数据,还可以应用智慧城市的智慧医疗领域中,从而推动智慧城市的建设。

Description

标注数据的选择方法、装置、计算机设备和存储介质
本申请要求于2020年06月24日提交中国专利局、申请号为202010592331.4,发明名称为“标注数据的选择方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及区块链技术领域,特别涉及一种标注数据的选择方法、装置、计算机设备和存储介质。
背景技术
实体识别是自然语言处理任务中的第一步,同时也是非常关键的一步。特别是在金融、电商、医疗等垂直领域,实体识别都是自然语言处理任务的关键,例如实体链接,实体间的关系抽取,关系分类等下游任务会层层传递上游任务带来的误差。
随着深度学习的发展,神经网络方法加上传统的条件随机场(crf)在实体识别任务上可以取得非常好的效果。但是发明人意识到,对于业务场景来说,深度学习的应用也带来了一些问题。例如,神经网络自主学习特征的能力虽然很强大,但往往需要大量的符合真实分布的训练数据,然而对于一个新领域的实体识别任务,高质量的标注数据会耗费大量的标注时间和人力标注成本。在垂直领域,虽然可以利用一个相关领域词典,通过远程监督的方法,对数据进行字典标注,但这可能会引入有噪音的数据或者标注不完整的实体,对实体识别任务有较大影响。例如在医疗领域对疾病的表述:“糖尿病伴酮症”,只标注了“糖尿病”,“过敏性哮喘”标注为“过敏”和“哮喘”等实体不完整的情况。但在医学上,这些不同实体的描述和治疗都不尽相同。只采用字典标注,会使得模型学习不到这种组合病症的特征,导致最后的实体标注效果不理想,在之后的下游任务也会因为错误的传导而效果不佳。
技术问题
本申请的主要目的为提供一种标注数据的选择方法、装置、计算机设备和存储介质,旨在克服目前标注数据不完整以及无法选择质量高的标注数据的缺陷。
技术解决方案
为实现上述目的,本申请提供了一种标注数据的选择方法,包括以下步骤:
基于知识图谱,构建目标实体扩充至预设字典中,以得到扩充后的字典作为目标字典;其中,所述目标字典中均为标注数据;所述目标实体与所述预设字典中的实体具备关联关系;
基于agent模型从所述目标字典中选择出字典标注数据;
将预设的人工标注数据分成人工训练集以及人工测试集;
将所述字典标注数据以及所述人工训练集构成模型训练集,并将所述模型训练集输入至预设的实体识别模型中进行训练;
将所述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率;
计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则基于所述agent模型从所述目标字典中选择出优化的字典标注数据,并重新进入将所述字典标注数据以及所述人工训练集构成模型训练集的步骤。
本申请还提供了一种标注数据的选择装置,包括:
构建单元,用于基于知识图谱,构建目标实体扩充至预设字典中,以得到扩充后的字典作为目标字典;其中,所述目标字典中均为标注数据;所述目标实体与所述预设字典中的实体具备关联关系;
选择单元,用于基于agent模型从所述目标字典中选择出字典标注数据;
分类单元,用于将预设的人工标注数据分成人工训练集以及人工测试集;
训练单元,用于将所述字典标注数据以及所述人工训练集构成模型训练集,并将所述模型训练集输入至预设的实体识别模型中进行训练;
测试单元,用于将所述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率;
判断单元,用于计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则基于所述agent模型从所述目标字典中选择出优化的字典标注数据,并重新执行将所述字典标注数据以及所述人工训练集构成模型训练集。
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序时实现上述标注数据的选择方法,包括以下步骤:
基于知识图谱,构建目标实体扩充至预设字典中,以得到扩充后的字典作为目标字典;其中,所述目标字典中均为标注数据;所述目标实体与所述预设字典中的实体具备关联关系;
基于agent模型从所述目标字典中选择出字典标注数据;
将预设的人工标注数据分成人工训练集以及人工测试集;
将所述字典标注数据以及所述人工训练集构成模型训练集,并将所述模型训练集输入至预设的实体识别模型中进行训练;
将所述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率;
计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则基于所述agent模型从所述目标字典中选择出优化的字典标注数据,并重新进入将所述字典标注数据以及所述人工训练集构成模型训练集的步骤。
本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述标注数据的选择方法,包括以下步骤:
基于知识图谱,构建目标实体扩充至预设字典中,以得到扩充后的字典作为目标字典;其中,所述目标字典中均为标注数据;所述目标实体与所述预设字典中的实体具备关联关系;
基于agent模型从所述目标字典中选择出字典标注数据;
将预设的人工标注数据分成人工训练集以及人工测试集;
将所述字典标注数据以及所述人工训练集构成模型训练集,并将所述模型训练集输入至预设的实体识别模型中进行训练;
将所述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率;
计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则基于所述agent模型从所述目标字典中选择出优化的字典标注数据,并重新进入将所述字典标注数据以及所述人工训练集构成模型训练集的步骤。
有益效果
本申请提供的标注数据的选择方法、装置、计算机设备和存储介质,基于知识图谱,构建目标实体添加至预设字典中,以得到扩充后的字典作为目标字典,使得目标字典中的字典标注数据更完整;同时,基于人工标注数据与字典标注数据共同训练实体识别模型,判断选择出的字典标注数据的质量是否符合要求,若不符合,则从目标字典中选择出优化的字典标注数据,即实现了选择出质量更高的字典标注数据。
附图说明
图1 是本申请一实施例中标注数据的选择方法步骤示意图;
图2 是本申请一实施例中标注数据的选择装置结构框图;
图3 为本申请一实施例的计算机设备的结构示意框图。
本发明的最佳实施方式
参照图1,本申请一实施例中提供了一种标注数据的选择方法,包括以下步骤:
步骤S1,基于知识图谱,构建目标实体扩充至预设字典中,以得到扩充后的字典作为目标字典;其中,所述目标字典中均为标注数据;所述目标实体与所述预设字典中的实体具备关联关系;
步骤S2,基于agent模型从所述目标字典中选择出字典标注数据;
步骤S3,将预设的人工标注数据分成人工训练集以及人工测试集;
步骤S4,将所述字典标注数据以及所述人工训练集构成模型训练集,并将所述模型训练集输入至预设的实体识别模型中进行训练;
步骤S5,将所述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率;
步骤S6,计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则基于所述agent模型从所述目标字典中选择出优化的字典标注数据,并重新进入将所述字典标注数据以及所述人工训练集构成模型训练集的步骤。
在本实施例中,上述方法应用于训练实体识别模型过程中对训练所需要的标注数据进行筛选,该实体识别模型用于识别医疗文本领域中的实体。本实施例中的方案还可以应用智慧城市的智慧医疗领域中,从而推动智慧城市的建设。在智慧医疗领域的业务场景中,其中用于训练实体识别模型的高质量标注数据较少,高质量的标注数据通常为人工标注。因此本实施例中结合了少量的高质量人工标注数据以及相近领域中的词典得到训练样本,可有效增加数据量,使得模型得到较大的训练集,提高模型泛化性。
具体地,如上述步骤S1所述的,上述预设字典中为利用垂直领域的实体字典标注句子得到的标注数据,为了进一步增强上述字典中标注数据的完整性、准确性,基于知识图谱,构建与上述预设字典中的实体具备关联关系的目标实体添加在上述预设字典中,以扩充上述预设字典。上述关联关系指的是:针对预设字典中的疾病、症状的实体构建出对应的别名,例如“慢性支气管炎”扩充“慢支”;构建与预设字典中实体相似度较高的目标实体,其中相似度的计算方法可以基于字符串最短编辑距离,拼音,偏旁部首等特征进行单独或是组合计算;此外,针对预设字典中实体的一些性状描述进行相似词或是反义词的替换,例如将“急性哮喘”扩充“慢性哮喘”,“糖尿病伴高血压”扩充“糖尿病不伴高血压”等。经过上述扩充之后,不仅增加上述预设字典中的标注数据数据量,而且对于医疗领域中的实体描述更加完整、准确。
如上述步骤S2所述的,上述agent模型(智能体模型)基于强化学习训练得到,其用于从目标字典标注的标注数据中挑选出标注正确的字典标注数据,其每次挑选出的数据具有导向性,使得标注质量越来越高,被挑选出来的数据再用于训练实体识别模型;因为字典标注的数据会有不完整或时不正确的情况,因此需要由agent模型不断挑选出更加准确的数据,即优化用于训练实体识别模型的字典标注数据。
如上述步骤S3所述的,上述人工标注数据为人工标注所得,其为高质量标注数据,在训练模型时需要经历训练阶段以及测试阶段,因此,需要将上述人工标注数据分成人工训练集以及人工测试集。
如上述步骤S4所述的,上述人工训练集的数据量较小,因此,需要将其与上述目标字典中选择出的字典标注数据共同组合成训练数据,得到模型训练集,增加训练数据的数据量;将所述模型训练集输入至预设的实体识别模型中进行训练,以提升实体识别模型的泛化性。上述实体识别模型包括BiLSTM-CRF模型。
在使用上述模型训练集训练上述实体识别模型之后,由于其训练数据中不仅仅包括高质量的人工标注数据,还可能包括一些不完整、不准确的字典标注数据。可以理解的是,若上述字典标注数据不完整、不准确,将会使得上述训练后的实体识别模型采用上述人工测试集进行测试时,得到的标注准确率下降。而正常采用上述人工测试集进行测试时的准确率应当为1,上述1可以作为一个预设概率。
因此,如上述步骤S5-S6所述的,将上述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率,进而再计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值;若上述正确概率接近于上述预设概率(即差值较小),则表明上述字典标注数据质量较好;若上述正确概率不接近于上述预设概率(即差值较大),则表明上述字典标注数据质量不好,必定有较多的不完整、不准确的标注数据,影响了上述实体识别模型的识别准确率。此时,则可以触发上述agent模型重新从所述目标字典中选择出更加优化的字典标注数据,进而重新进入将所述字典标注数据以及所述人工训练集构成模型训练集的步骤。由于上述agent模型基于强化学习训练,其迭代挑选出的字典标注数据,均是根据测试结果定向选择出的更加准确的标注数据。其选择出的标注数据继续输入至上述实体识别模型中进行训练,依次迭代训练,直至测试结果趋于稳定之后,则完成训练。
在本实施例中,首先通过人工标注少量的标注数据,利用垂直领域的实体字典,用字典标注句子得到字典标注数据,增强数据,生成大量的数据集,使得模型得到较大的训练集,提高模型泛化性。再通过强化学习的方法,对由远程监督生成的不完整和带噪音的数据进行筛选,在人工标注小数据集这一先验知识的指导下进行训练,使得模型同时在人工标注的数据以及字典标注的数据上训练,减少人工标注的时间成本,提高模型的召回率。
在一实施例中,所述将所述模型训练集输入至预设的实体识别模型中进行训练的步骤S4,包括:
步骤S401,分别构建所述模型训练集中每一个文本数据对应的字向量以及词向量,将同一个文本数据对应的字向量以及词向量进行拼接得到拼接向量;
步骤S402,将所述拼接向量输入至预设的实体识别模型中,输出得到第一特征向量;
步骤S403,将所述第一特征向量与所述拼接向量进行组合,并输入至预设的实体识别模型中,输出得到第二特征向量;
步骤S404,将所述第二特征向量输入至预设的实体识别模型的分类层中,进行训练以优化所述分类层的网络参数。
在本实施例中,训练上述预设的实体识别模型时,为了加强上述训练集中每一个文本数据的词与字的特性表达,分别构建所述模型训练集中每一个文本数据对应的字向量以及词向量,将同一个文本数据对应的字向量以及词向量进行拼接得到拼接向量;然后将所述拼接向量输入至预设的实体识别模型中,输出得到第一特征向量;为了进一步地提升上述实体识别模型对上述文本数据的特征表达,提升特征提取深度;因此,将上述第一特征向量与所述拼接向量进行组合之后,再次输入至预设的实体识别模型中,输出得到第二特征向量,该第二特征向量作为上述文本数据对应的特征向量。最后,输入至分类层中进行迭代训练,优化网络参数得到训练完成的实体识别模型。
在一实施例中,所述将所述模型训练集输入至预设的实体识别模型中进行训练的步骤S4之前,包括:
获取公开数据集;
基于所述公开数据集,训练初始长短记忆模型,以得到预设的实体识别模型。
在本实施例中,在采用模型训练集训练模型之前,需要首先训练得到上述预设的实体识别模型。在本实施例中,可以采用公开数据集训练初始长短记忆模型,以初始化其中神经网络参数,得到上述预设的实体识别模型。随后再采用模型训练集进行训练,这种方法能够有效提升模型的鲁棒性。
在一实施例中,所述基于知识图谱,构建与预设字典中的实体具备关联关系的实体添加至所述预设字典中,以得到扩充后的字典作为目标字典的步骤S1之前,还包括:
步骤S1a,接收用户输入的模型训练指令,其中所述模型训练指令中携带有所要训练的模型的应用领域信息;
步骤S1b,根据所述应用领域信息,获取对应领域的预设字典。
在本实施例中,为了使得训练得到的实体识别模型更好的识别效果,应当是采用对应领域的标注数据进行模型训练。用户在发出训练模型的需求时,可以输入相应的模型训练指令,在该模型训练指令中可以携带有所要训练的模型的应用领域信息。根据该应用领域信息,便可以获取到对应领域的标注数据,采用对应领域的标注数据,便于更好训练上述模型,得到的实体识别模型在识别对应领域的文本时,其效果更佳。
在一实施例中,所述计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则重新基于所述agent模型从所述目标字典中选择出优化的字典标注数据的步骤S6之后,包括:
步骤S7,迭代训练预设的实体识别模型,直至所述正确概率与预设概率的差值小于所述阈值,得到目标实体识别模型;
步骤S8,接收用户输入的目标文本,以及接收对所述目标文本中的实体识别请求指令;
步骤S9,基于所述请求指令,识别所述目标文本的领域信息;
步骤S10,判断所述目标文本的领域信息与所述目标实体识别模型的应用领域信息是否相同;
步骤S11,若相同,则基于所述目标实体识别模型对所述目标文本进行命名实体识别;若不相同,则获取对应所述目标文本的领域信息的训练数据重新训练所述目标实体识别模型。
在本实施例中,在利用上述目标实体识别模型进行目标文本中的实体识别时,上述目标文本可能不是医疗领域的文本,因此,为了提高识别的准确率,避免识别错误,需要首先识别所述目标文本的领域信息,若该目标文本的领域信息与上述目标实体识别模型的应用领域信息相同,则利用目标实体识别模型进行命名实体识别时,可以显著提升准确率。若目标文本的领域信息与上述目标实体识别模型的应用领域信息不相同,则需要获取对应所述目标文本的领域信息的训练数据重新训练所述目标实体识别模型。
在一实施例中,上述预设字典、目标字典、agent模型、人工标注数据、预设的实体识别模型,存储于区块链中,区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层。
参照图2,本申请一实施例中还提供了一种标注数据的选择装置,包括:
构建单元,用于基于知识图谱,构建目标实体扩充至预设字典中,以得到扩充后的字典作为目标字典;其中,所述目标字典中均为标注数据;所述目标实体与所述预设字典中的实体具备关联关系;
选择单元,用于基于agent模型从所述目标字典中选择出字典标注数据;
分类单元,用于将预设的人工标注数据分成人工训练集以及人工测试集;
训练单元,用于将所述字典标注数据以及所述人工训练集构成模型训练集,并将所述模型训练集输入至预设的实体识别模型中进行训练;
测试单元,用于将所述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率;
判断单元,用于计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则基于所述agent模型从所述目标字典中选择出优化的字典标注数据,并重新执行将所述字典标注数据以及所述人工训练集构成模型训练集。
在一实施例中,所述训练单元,包括:
构建子单元,用于分别构建所述模型训练集中每一个文本数据对应的字向量以及词向量,将同一个文本数据对应的字向量以及词向量进行拼接得到拼接向量;
第一输出子单元,用于将所述拼接向量输入至预设的实体识别模型中,输出得到第一特征向量;
第二输出子单元,用于将所述第一特征向量与所述拼接向量进行组合,并输入至预设的实体识别模型中,输出得到第二特征向量;
训练子单元,用于将所述第二特征向量输入至预设的实体识别模型的分类层中,进行训练以优化所述分类层的网络参数。
在一实施例中,还包括:
第一获取单元,用于获取公开数据集;
初始训练单元,用于基于所述公开数据集,训练初始长短记忆模型,以得到预设的实体识别模型。
在一实施例中,还包括:
第一接收单元,用于接收用户输入的模型训练指令,其中所述模型训练指令中携带有所要训练的模型的应用领域信息;
第二获取单元,用于根据所述应用领域信息,获取对应领域的预设字典。
在一实施例中,还包括:
迭代单元,用于迭代训练预设的实体识别模型,直至所述正确概率与预设概率的差值小于所述阈值,得到目标实体识别模型;
第二接收单元,用于接收用户输入的目标文本,以及接收对所述目标文本中的实体识别请求指令;
识别单元,用于基于所述请求指令,识别所述目标文本的领域信息;
领域判断单元,用于判断所述目标文本的领域信息与所述目标实体识别模型的应用领域信息是否相同;
处理单元,用于若相同,则基于所述目标实体识别模型对所述目标文本进行命名实体识别;若不相同,则获取对应所述目标文本的领域信息的训练数据重新训练所述目标实体识别模型。
在一实施例中,所述装置还包括:
存储单元,用于将所述目标字典、agent模型、人工标注数据、预设的实体识别模型存储于区块链中。
在本实施例中,上述单元、子单元的具体实现请参照上述方法实施例中所述,在此不再进行赘述。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储标注数据、模型等。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种标注数据的选择方法:
上述标注数据的选择方法,包括以下步骤:
基于知识图谱,构建目标实体扩充至预设字典中,以得到扩充后的字典作为目标字典;其中,所述目标字典中均为标注数据;所述目标实体与所述预设字典中的实体具备关联关系;
基于agent模型从所述目标字典中选择出字典标注数据;
将预设的人工标注数据分成人工训练集以及人工测试集;
将所述字典标注数据以及所述人工训练集构成模型训练集,并将所述模型训练集输入至预设的实体识别模型中进行训练;
将所述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率;
计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则基于所述agent模型从所述目标字典中选择出优化的字典标注数据,并重新进入将所述字典标注数据以及所述人工训练集构成模型训练集的步骤。
本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。
本申请一实施例还提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现一种标注数据的选择方法。
上述标注数据的选择方法,包括以下步骤:
基于知识图谱,构建目标实体扩充至预设字典中,以得到扩充后的字典作为目标字典;其中,所述目标字典中均为标注数据;所述目标实体与所述预设字典中的实体具备关联关系;
基于agent模型从所述目标字典中选择出字典标注数据;
将预设的人工标注数据分成人工训练集以及人工测试集;
将所述字典标注数据以及所述人工训练集构成模型训练集,并将所述模型训练集输入至预设的实体识别模型中进行训练;
将所述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率;
计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则基于所述agent模型从所述目标字典中选择出优化的字典标注数据,并重新进入将所述字典标注数据以及所述人工训练集构成模型训练集的步骤。
可以理解的是,本实施例中的计算机可读存储介质可以是易失性可读存储介质,也可以为非易失性可读存储介质。
综上所述,为本申请实施例中提供的标注数据的选择方法、装置、计算机设备和存储介质,基于知识图谱,构建目标实体添加至预设字典中,以得到扩充后的字典作为目标字典,使得目标字典中的字典标注数据更完整;同时,基于人工标注数据与字典标注数据共同训练实体识别模型,判断选择出的字典标注数据的质量是否符合要求,若不符合,则从目标字典中选择出优化的字典标注数据,即实现了选择出质量更高的字典标注数据。
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其它相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种标注数据的选择方法,其中,包括以下步骤:
    基于知识图谱,构建目标实体扩充至预设字典中,以得到扩充后的字典作为目标字典;其中,所述目标字典中均为标注数据;所述目标实体与所述预设字典中的实体具备关联关系;
    基于agent模型从所述目标字典中选择出字典标注数据;
    将预设的人工标注数据分成人工训练集以及人工测试集;
    将所述字典标注数据以及所述人工训练集构成模型训练集,并将所述模型训练集输入至预设的实体识别模型中进行训练;
    将所述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率;
    计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则基于所述agent模型从所述目标字典中选择出优化的字典标注数据,并重新进入将所述字典标注数据以及所述人工训练集构成模型训练集的步骤。
  2. 根据权利要求1所述的标注数据的选择方法,其中,所述将所述模型训练集输入至预设的实体识别模型中进行训练的步骤,包括:
    分别构建所述模型训练集中每一个文本数据对应的字向量以及词向量,将同一个文本数据对应的字向量以及词向量进行拼接得到拼接向量;
    将所述拼接向量输入至预设的实体识别模型中,输出得到第一特征向量;
    将所述第一特征向量与所述拼接向量进行组合,并输入至预设的实体识别模型中,输出得到第二特征向量;
    将所述第二特征向量输入至预设的实体识别模型的分类层中,进行训练以优化所述分类层的网络参数。
  3. 根据权利要求1所述的标注数据的选择方法,其中,所述将所述模型训练集输入至预设的实体识别模型中进行训练的步骤之前,包括:
    获取公开数据集;
    基于所述公开数据集,训练初始长短记忆模型,以得到预设的实体识别模型。
  4. 根据权利要求1所述的标注数据的选择方法,其中,所述基于知识图谱,构建与预设字典中的实体具备关联关系的实体添加至所述预设字典中,以得到扩充后的字典作为目标字典的步骤之前,还包括:
    接收用户输入的模型训练指令,其中所述模型训练指令中携带有所要训练的模型的应用领域信息;
    根据所述应用领域信息,获取对应领域的预设字典。
  5. 根据权利要求4所述的标注数据的选择方法,其中,所述计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则重新基于所述agent模型从所述目标字典中选择出优化的字典标注数据的步骤之后,包括:
    迭代训练预设的实体识别模型,直至所述正确概率与预设概率的差值小于所述阈值,得到目标实体识别模型;
    接收用户输入的目标文本,以及接收对所述目标文本中的实体识别请求指令;
    基于所述请求指令,识别所述目标文本的领域信息;
    判断所述目标文本的领域信息与所述目标实体识别模型的应用领域信息是否相同;
    若相同,则基于所述目标实体识别模型对所述目标文本进行命名实体识别;若不相同,则获取对应所述目标文本的领域信息的训练数据重新训练所述目标实体识别模型。
  6. 根据权利要求1所述的标注数据的选择方法,其中,还包括:
    将所述目标字典、agent模型、人工标注数据、预设的实体识别模型存储于区块链中。
  7. 一种标注数据的选择装置,其中,包括:
    构建单元,用于基于知识图谱,构建目标实体扩充至预设字典中,以得到扩充后的字典作为目标字典;其中,所述目标字典中均为标注数据;所述目标实体与所述预设字典中的实体具备关联关系;
    选择单元,用于基于agent模型从所述目标字典中选择出字典标注数据;
    分类单元,用于将预设的人工标注数据分成人工训练集以及人工测试集;
    训练单元,用于将所述字典标注数据以及所述人工训练集构成模型训练集,并将所述模型训练集输入至预设的实体识别模型中进行训练;
    测试单元,用于将所述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率;
    判断单元,用于计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则重新基于所述agent模型从所述目标字典中选择出优化的字典标注数据,并重新执行将所述字典标注数据以及所述人工训练集构成模型训练集。
  8. 根据权利要求7所述的标注数据的选择装置,其特征在于,所述训练单元,包括:
    构建子单元,用于分别构建所述模型训练集中每一个文本数据对应的字向量以及词向量,将同一个文本数据对应的字向量以及词向量进行拼接得到拼接向量;
    第一输出子单元,用于将所述拼接向量输入至预设的实体识别模型中,输出得到第一特征向量;
    第二输出子单元,用于将所述第一特征向量与所述拼接向量进行组合,并输入至预设的实体识别模型中,输出得到第二特征向量;
    训练子单元,用于将所述第二特征向量输入至预设的实体识别模型的分类层中,进行训练以优化所述分类层的网络参数。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,其中,所述处理器执行所述计算机程序时实现一种标注数据的选择方法的步骤:
    基于知识图谱,构建目标实体扩充至预设字典中,以得到扩充后的字典作为目标字典;其中,所述目标字典中均为标注数据;所述目标实体与所述预设字典中的实体具备关联关系;
    基于agent模型从所述目标字典中选择出字典标注数据;
    将预设的人工标注数据分成人工训练集以及人工测试集;
    将所述字典标注数据以及所述人工训练集构成模型训练集,并将所述模型训练集输入至预设的实体识别模型中进行训练;
    将所述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率;
    计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则基于所述agent模型从所述目标字典中选择出优化的字典标注数据,并重新进入将所述字典标注数据以及所述人工训练集构成模型训练集的步骤。
  10. 根据权利要求9所述的计算机设备,其中,所述将所述模型训练集输入至预设的实体识别模型中进行训练的步骤,包括:
    分别构建所述模型训练集中每一个文本数据对应的字向量以及词向量,将同一个文本数据对应的字向量以及词向量进行拼接得到拼接向量;
    将所述拼接向量输入至预设的实体识别模型中,输出得到第一特征向量;
    将所述第一特征向量与所述拼接向量进行组合,并输入至预设的实体识别模型中,输出得到第二特征向量;
    将所述第二特征向量输入至预设的实体识别模型的分类层中,进行训练以优化所述分类层的网络参数。
  11. 根据权利要求9所述的计算机设备,其中,所述将所述模型训练集输入至预设的实体识别模型中进行训练的步骤之前,包括:
    获取公开数据集;
    基于所述公开数据集,训练初始长短记忆模型,以得到预设的实体识别模型。
  12. 根据权利要求9所述的计算机设备,其中,所述基于知识图谱,构建与预设字典中的实体具备关联关系的实体添加至所述预设字典中,以得到扩充后的字典作为目标字典的步骤之前,还包括:
    接收用户输入的模型训练指令,其中所述模型训练指令中携带有所要训练的模型的应用领域信息;
    根据所述应用领域信息,获取对应领域的预设字典。
  13. 根据权利要求12所述的计算机设备,其中,所述计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则重新基于所述agent模型从所述目标字典中选择出优化的字典标注数据的步骤之后,包括:
    迭代训练预设的实体识别模型,直至所述正确概率与预设概率的差值小于所述阈值,得到目标实体识别模型;
    接收用户输入的目标文本,以及接收对所述目标文本中的实体识别请求指令;
    基于所述请求指令,识别所述目标文本的领域信息;
    判断所述目标文本的领域信息与所述目标实体识别模型的应用领域信息是否相同;
    若相同,则基于所述目标实体识别模型对所述目标文本进行命名实体识别;若不相同,则获取对应所述目标文本的领域信息的训练数据重新训练所述目标实体识别模型。
  14. 根据权利要求9所述的计算机设备,其中,还包括:
    将所述目标字典、agent模型、人工标注数据、预设的实体识别模型存储于区块链中。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现一种标注数据的选择方法的步骤:
    基于知识图谱,构建目标实体扩充至预设字典中,以得到扩充后的字典作为目标字典;其中,所述目标字典中均为标注数据;所述目标实体与所述预设字典中的实体具备关联关系;
    基于agent模型从所述目标字典中选择出字典标注数据;
    将预设的人工标注数据分成人工训练集以及人工测试集;
    将所述字典标注数据以及所述人工训练集构成模型训练集,并将所述模型训练集输入至预设的实体识别模型中进行训练;
    将所述人工测试集输入至训练后的实体识别模型中进行测试,得到所述人工测试集的预测标注为正确标注的正确概率;
    计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则基于所述agent模型从所述目标字典中选择出优化的字典标注数据,并重新进入将所述字典标注数据以及所述人工训练集构成模型训练集的步骤。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述将所述模型训练集输入至预设的实体识别模型中进行训练的步骤,包括:
    分别构建所述模型训练集中每一个文本数据对应的字向量以及词向量,将同一个文本数据对应的字向量以及词向量进行拼接得到拼接向量;
    将所述拼接向量输入至预设的实体识别模型中,输出得到第一特征向量;
    将所述第一特征向量与所述拼接向量进行组合,并输入至预设的实体识别模型中,输出得到第二特征向量;
    将所述第二特征向量输入至预设的实体识别模型的分类层中,进行训练以优化所述分类层的网络参数。
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述将所述模型训练集输入至预设的实体识别模型中进行训练的步骤之前,包括:
    获取公开数据集;
    基于所述公开数据集,训练初始长短记忆模型,以得到预设的实体识别模型。
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述基于知识图谱,构建与预设字典中的实体具备关联关系的实体添加至所述预设字典中,以得到扩充后的字典作为目标字典的步骤之前,还包括:
    接收用户输入的模型训练指令,其中所述模型训练指令中携带有所要训练的模型的应用领域信息;
    根据所述应用领域信息,获取对应领域的预设字典。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述计算所述正确概率与预设概率的差值,并判断所述差值是否小于阈值,若不小于,则重新基于所述agent模型从所述目标字典中选择出优化的字典标注数据的步骤之后,包括:
    迭代训练预设的实体识别模型,直至所述正确概率与预设概率的差值小于所述阈值,得到目标实体识别模型;
    接收用户输入的目标文本,以及接收对所述目标文本中的实体识别请求指令;
    基于所述请求指令,识别所述目标文本的领域信息;
    判断所述目标文本的领域信息与所述目标实体识别模型的应用领域信息是否相同;
    若相同,则基于所述目标实体识别模型对所述目标文本进行命名实体识别;若不相同,则获取对应所述目标文本的领域信息的训练数据重新训练所述目标实体识别模型。
  20. 根据权利要求15所述的计算机可读存储介质,其中,还包括:
    将所述目标字典、agent模型、人工标注数据、预设的实体识别模型存储于区块链中。
PCT/CN2020/118533 2020-06-24 2020-09-28 标注数据的选择方法、装置、计算机设备和存储介质 WO2021139257A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010592331.4 2020-06-24
CN202010592331.4A CN111832294B (zh) 2020-06-24 2020-06-24 标注数据的选择方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2021139257A1 true WO2021139257A1 (zh) 2021-07-15

Family

ID=72898915

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118533 WO2021139257A1 (zh) 2020-06-24 2020-09-28 标注数据的选择方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN111832294B (zh)
WO (1) WO2021139257A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115757784A (zh) * 2022-11-21 2023-03-07 中科世通亨奇(北京)科技有限公司 基于标注模型和标签模板筛选的语料标注方法及装置

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807097A (zh) * 2020-10-30 2021-12-17 北京中科凡语科技有限公司 命名实体识别模型建立方法及命名实体识别方法
CN113158652B (zh) * 2021-04-19 2024-03-19 平安科技(深圳)有限公司 基于深度学习模型的数据增强方法、装置、设备及介质
CN112926697B (zh) * 2021-04-21 2021-10-12 北京科技大学 一种基于语义分割的磨粒图像分类方法及装置
CN113268593A (zh) * 2021-05-18 2021-08-17 Oppo广东移动通信有限公司 意图分类和模型的训练方法、装置、终端及存储介质
CN113378570B (zh) * 2021-06-01 2023-12-12 车智互联(北京)科技有限公司 一种实体识别模型的生成方法、计算设备及可读存储介质
CN113434491B (zh) * 2021-06-18 2022-09-02 深圳市曙光信息技术有限公司 面向深度学习ocr识别的字模数据清洗方法、系统及介质
CN113591467B (zh) * 2021-08-06 2023-11-03 北京金堤征信服务有限公司 事件主体识别方法及装置、电子设备、介质
CN114004233B (zh) * 2021-12-30 2022-05-06 之江实验室 一种基于半训练和句子选择的远程监督命名实体识别方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108874878A (zh) * 2018-05-03 2018-11-23 众安信息技术服务有限公司 一种知识图谱的构建系统及方法
CN110008473A (zh) * 2019-04-01 2019-07-12 云知声(上海)智能科技有限公司 一种基于迭代方法的医疗文本命名实体识别标注方法
CN110020438A (zh) * 2019-04-15 2019-07-16 上海冰鉴信息科技有限公司 基于序列识别的企业或组织中文名称实体消歧方法和装置
CN110287481A (zh) * 2019-05-29 2019-09-27 西南电子技术研究所(中国电子科技集团公司第十研究所) 命名实体语料标注训练系统
CN110335676A (zh) * 2019-07-09 2019-10-15 泰康保险集团股份有限公司 数据处理方法、装置、介质及电子设备
US20190347571A1 (en) * 2017-02-03 2019-11-14 Koninklijke Philips N.V. Classifier training

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908085B (zh) * 2010-06-28 2012-09-05 北京航空航天大学 一种基于多Agent的分布式推演仿真系统与方法
CN107808124B (zh) * 2017-10-09 2019-03-26 平安科技(深圳)有限公司 电子装置、医疗文本实体命名的识别方法及存储介质
CN109697289B (zh) * 2018-12-28 2023-01-13 北京工业大学 一种改进的用于命名实体识别的主动学习方法
CN110134969B (zh) * 2019-05-27 2023-07-14 北京奇艺世纪科技有限公司 一种实体识别方法和装置
CN110717040A (zh) * 2019-09-18 2020-01-21 平安科技(深圳)有限公司 词典扩充方法及装置、电子设备、存储介质
CN111178045A (zh) * 2019-10-14 2020-05-19 深圳软通动力信息技术有限公司 基于领域的非监督式中文语义概念词典的自动构建方法、电子设备及存储介质
CN111259134B (zh) * 2020-01-19 2023-08-08 出门问问信息科技有限公司 一种实体识别方法、设备及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190347571A1 (en) * 2017-02-03 2019-11-14 Koninklijke Philips N.V. Classifier training
CN108874878A (zh) * 2018-05-03 2018-11-23 众安信息技术服务有限公司 一种知识图谱的构建系统及方法
CN110008473A (zh) * 2019-04-01 2019-07-12 云知声(上海)智能科技有限公司 一种基于迭代方法的医疗文本命名实体识别标注方法
CN110020438A (zh) * 2019-04-15 2019-07-16 上海冰鉴信息科技有限公司 基于序列识别的企业或组织中文名称实体消歧方法和装置
CN110287481A (zh) * 2019-05-29 2019-09-27 西南电子技术研究所(中国电子科技集团公司第十研究所) 命名实体语料标注训练系统
CN110335676A (zh) * 2019-07-09 2019-10-15 泰康保险集团股份有限公司 数据处理方法、装置、介质及电子设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115757784A (zh) * 2022-11-21 2023-03-07 中科世通亨奇(北京)科技有限公司 基于标注模型和标签模板筛选的语料标注方法及装置
CN115757784B (zh) * 2022-11-21 2023-07-07 中科世通亨奇(北京)科技有限公司 基于标注模型和标签模板筛选的语料标注方法及装置

Also Published As

Publication number Publication date
CN111832294A (zh) 2020-10-27
CN111832294B (zh) 2022-08-16

Similar Documents

Publication Publication Date Title
WO2021139257A1 (zh) 标注数据的选择方法、装置、计算机设备和存储介质
WO2021135910A1 (zh) 基于机器阅读理解的信息抽取方法、及其相关设备
WO2021218024A1 (zh) 命名实体识别模型的训练方法、装置、计算机设备
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
WO2021189971A1 (zh) 基于知识图谱表征学习的医疗方案推荐系统及方法
JP7143456B2 (ja) 医学的事実の検証方法及び検証装置、電子機器、コンピュータ可読記憶媒体並びにコンピュータプログラム
WO2021151353A1 (zh) 医学实体关系抽取方法、装置、计算机设备及可读存储介质
WO2021139247A1 (zh) 医学领域知识图谱的构建方法、装置、设备及存储介质
WO2021179693A1 (zh) 医疗文本翻译方法、装置及存储介质
CN109857846B (zh) 用户问句与知识点的匹配方法和装置
WO2019232893A1 (zh) 文本的情感分析方法、装置、计算机设备和存储介质
CN113140254B (zh) 元学习药物-靶点相互作用预测系统及预测方法
CN110162675B (zh) 应答语句的生成方法、装置、计算机可读介质及电子设备
WO2023207096A1 (zh) 一种实体链接方法、装置、设备及非易失性可读存储介质
CN111159770A (zh) 文本数据脱敏方法、装置、介质及电子设备
CN113707299A (zh) 基于问诊会话的辅助诊断方法、装置及计算机设备
CN115798661A (zh) 临床医学领域的知识挖掘方法和装置
CN114357195A (zh) 基于知识图谱的问答对生成方法、装置、设备及介质
CN111723870B (zh) 基于人工智能的数据集获取方法、装置、设备和介质
CN113705207A (zh) 语法错误识别方法及装置
CN115081452B (zh) 一种实体关系的抽取方法
WO2022271369A1 (en) Training of an object linking model
CN110147556B (zh) 一种多向神经网络翻译系统的构建方法
CN113539520A (zh) 实现问诊会话的方法、装置、计算机设备及存储介质
WO2022141855A1 (zh) 文本正则方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911851

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20911851

Country of ref document: EP

Kind code of ref document: A1