CN111832294B - Method and device for selecting marking data, computer equipment and storage medium - Google Patents

Method and device for selecting marking data, computer equipment and storage medium Download PDF

Info

Publication number
CN111832294B
CN111832294B CN202010592331.4A CN202010592331A CN111832294B CN 111832294 B CN111832294 B CN 111832294B CN 202010592331 A CN202010592331 A CN 202010592331A CN 111832294 B CN111832294 B CN 111832294B
Authority
CN
China
Prior art keywords
dictionary
model
data
target
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010592331.4A
Other languages
Chinese (zh)
Other versions
CN111832294A (en
Inventor
梁欣
顾婷婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010592331.4A priority Critical patent/CN111832294B/en
Priority to PCT/CN2020/118533 priority patent/WO2021139257A1/en
Publication of CN111832294A publication Critical patent/CN111832294A/en
Application granted granted Critical
Publication of CN111832294B publication Critical patent/CN111832294B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the technical field of block chains, and provides a method and a device for selecting labeled data, computer equipment and a storage medium, wherein the method comprises the following steps: selecting dictionary annotation data from the target dictionary based on the agent model stored in the block chain; dividing preset manual marking data into a manual training set and a manual testing set; constructing a model training set by the dictionary labeling data and the manual training set, and inputting the model training set into a preset entity recognition model for training; inputting the manual test set into the trained entity recognition model for testing to obtain the correct probability of the prediction label of the manual test set as the correct label; and calculating a difference value between the correct probability and a preset probability, judging whether the difference value is smaller than a threshold value, and if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model. According to the method and the device, the high-quality marking data can be selected. The scheme in this application can also use in the wisdom medical treatment field in wisdom city to promote wisdom city's construction.

Description

Method and device for selecting marking data, computer equipment and storage medium
Technical Field
The present application relates to the field of block chain technologies, and in particular, to a method and an apparatus for selecting labeled data, a computer device, and a storage medium.
Background
Entity recognition is the first step in the natural language processing task and is also a very critical step. Particularly in the vertical fields of finance, e-commerce, medical treatment and the like, entity identification is the key of natural language processing tasks, and for example, downstream tasks such as entity linking, relationship extraction between entities, relationship classification and the like can transmit errors caused by upstream tasks layer by layer.
With the development of deep learning, the neural network method and the traditional conditional random field (crf) can achieve a very good effect on an entity recognition task. However, for business scenarios, the application of deep learning also brings some problems. For example, although the neural network has a strong capability of autonomously learning features, a large amount of training data conforming to real distribution is often required, and for an entity recognition task in a new field, a large amount of annotation time and labor annotation cost are consumed for high-quality annotation data. In the vertical domain, although a related domain dictionary can be used for performing dictionary labeling on data through a remote supervision method, noisy data or incompletely labeled entities can be introduced, and the entity recognition task is greatly influenced. For example, in the medical field, the expression of diseases: the diabetes with ketosis is marked only by the incomplete conditions of the entities such as diabetes, allergic asthma, and the like. However, in medicine, the description and treatment of these different entities are not identical. Using only dictionary labeling would make the model unable to learn the features of such combined symptoms, resulting in the final labeling of entities being less effective, and the subsequent downstream tasks being less effective due to incorrect propagation.
Disclosure of Invention
The present application mainly aims to provide a method, an apparatus, a computer device and a storage medium for selecting annotation data, and aims to overcome the defects that the current annotation data is incomplete and cannot be selected with high quality.
In order to achieve the above object, the present application provides a method for selecting annotation data, comprising the following steps:
constructing a target entity and expanding the target entity into a preset dictionary based on the knowledge graph to obtain an expanded dictionary as a target dictionary; wherein, the target dictionaries are all marked data; the target entity has an incidence relation with an entity in the preset dictionary;
selecting dictionary labeling data from the target dictionary based on the agent model;
dividing preset manual marking data into a manual training set and a manual testing set;
forming a model training set by the dictionary labeling data and the artificial training set, and inputting the model training set into a preset entity recognition model for training;
inputting the manual test set into a trained entity recognition model for testing to obtain the correct probability that the prediction label of the manual test set is the correct label;
and calculating a difference value between the correct probability and a preset probability, judging whether the difference value is smaller than a threshold value, if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model, and re-entering a step of forming a model training set by the dictionary labeling data and the artificial training set.
Further, the step of inputting the model training set into a preset entity recognition model for training includes:
respectively constructing a word vector and a word vector corresponding to each text data in the model training set, and splicing the word vectors and the word vectors corresponding to the same text data to obtain a spliced vector;
inputting the splicing vector into a preset entity recognition model, and outputting to obtain a first feature vector;
combining the first feature vector and the splicing vector, inputting the combined first feature vector and the splicing vector into a preset entity recognition model, and outputting to obtain a second feature vector;
and inputting the second feature vector into a classification layer of a preset entity recognition model, and training to optimize network parameters of the classification layer.
Further, before the step of inputting the model training set into a preset entity recognition model for training, the method includes:
obtaining a public data set;
and training an initial long-short memory model based on the public data set to obtain a preset entity recognition model.
Further, before the step of constructing, based on the knowledge graph, an entity having an association relationship with an entity in a preset dictionary and adding the entity to the preset dictionary to obtain an expanded dictionary as a target dictionary, the method further includes:
receiving a model training instruction input by a user, wherein the model training instruction carries application field information of a model to be trained;
and acquiring a preset dictionary of the corresponding field according to the application field information.
Further, the step of calculating a difference between the correct probability and a preset probability, and determining whether the difference is smaller than a threshold, and if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model again includes:
iteratively training a preset entity recognition model until the difference value between the correct probability and the preset probability is smaller than the threshold value to obtain a target entity recognition model;
receiving a target text input by a user, and receiving an entity identification request instruction in the target text;
identifying the domain information of the target text based on the request instruction;
judging whether the field information of the target text is the same as the application field information of the target entity recognition model;
if the target text is the same as the target text, conducting named entity recognition on the target text based on the target entity recognition model; and if not, acquiring training data of the field information corresponding to the target text to retrain the target entity recognition model.
Further, the method further comprises:
and storing the target dictionary, the agent model, the manual labeling data and the preset entity recognition model in a block chain.
The present application further provides a device for selecting annotation data, including:
the construction unit is used for constructing a target entity and expanding the target entity into a preset dictionary based on the knowledge graph so as to obtain an expanded dictionary as a target dictionary; wherein, the target dictionaries are all marked data; the target entity has an incidence relation with an entity in the preset dictionary;
the selecting unit is used for selecting dictionary labeling data from the target dictionary based on the agent model;
the classification unit is used for dividing preset manual labeling data into a manual training set and a manual testing set;
the training unit is used for forming a model training set by the dictionary labeling data and the artificial training set and inputting the model training set into a preset entity recognition model for training;
the testing unit is used for inputting the manual testing set into the trained entity recognition model for testing to obtain the correct probability that the prediction label of the manual testing set is the correct label;
and the judging unit is used for calculating a difference value between the correct probability and a preset probability, judging whether the difference value is smaller than a threshold value, if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model, and re-executing the step of forming a model training set by the dictionary labeling data and the artificial training set.
Further, the training unit includes:
the building subunit is used for respectively building a word vector and a word vector corresponding to each text data in the model training set, and splicing the word vectors and the word vectors corresponding to the same text data to obtain a spliced vector;
the first output subunit is used for inputting the splicing vector into a preset entity recognition model and outputting to obtain a first feature vector;
the second output subunit is used for combining the first feature vector and the splicing vector, inputting the combined first feature vector and the splicing vector into a preset entity recognition model, and outputting the combined first feature vector and the splicing vector to obtain a second feature vector;
and the training subunit is used for inputting the second feature vector into a classification layer of a preset entity recognition model, and training to optimize network parameters of the classification layer.
Further, the method also comprises the following steps:
a first acquisition unit for acquiring a public data set;
and the initial training unit is used for training an initial long-short memory model based on the public data set so as to obtain a preset entity recognition model.
Further, still include:
the device comprises a first receiving unit, a second receiving unit and a control unit, wherein the first receiving unit is used for receiving a model training instruction input by a user, and the model training instruction carries application field information of a model to be trained;
and the second acquisition unit is used for acquiring the preset dictionary of the corresponding field according to the application field information.
Further, still include:
the iteration unit is used for iteratively training a preset entity recognition model until the difference value between the correct probability and the preset probability is smaller than the threshold value, so as to obtain a target entity recognition model;
the second receiving unit is used for receiving a target text input by a user and receiving an entity identification request instruction in the target text;
the identification unit is used for identifying the domain information of the target text based on the request instruction;
a domain judging unit, configured to judge whether domain information of the target text is the same as application domain information of the target entity recognition model;
the processing unit is used for carrying out named entity recognition on the target text based on the target entity recognition model if the target text is the same as the target text; and if not, acquiring training data of the field information corresponding to the target text to retrain the target entity recognition model.
The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of any one of the above methods when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.
According to the method and the device for selecting the labeled data, the computer equipment and the storage medium, the target entity is constructed and added into the preset dictionary based on the knowledge graph, so that the expanded dictionary is obtained and used as the target dictionary, and the dictionary labeled data in the target dictionary is more complete; meanwhile, an entity recognition model is trained on the basis of the manual labeling data and the dictionary labeling data together, whether the quality of the selected dictionary labeling data meets the requirement or not is judged, if not, the optimized dictionary labeling data is selected from the target dictionary, and accordingly the dictionary labeling data with higher quality is selected.
Drawings
FIG. 1 is a schematic diagram illustrating a method for selecting annotation data according to an embodiment of the present application;
FIG. 2 is a block diagram illustrating an apparatus for selecting annotation data according to an embodiment of the present application;
fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, an embodiment of the present application provides a method for selecting annotation data, including the following steps:
step S1, constructing a target entity and expanding the target entity to a preset dictionary based on the knowledge graph to obtain an expanded dictionary as a target dictionary; wherein, the target dictionaries are all marked data; the target entity has an incidence relation with an entity in the preset dictionary;
step S2, selecting dictionary annotation data from the target dictionary based on the agent model;
step S3, dividing the preset manual marking data into a manual training set and a manual testing set;
step S4, forming a model training set by the dictionary labeling data and the artificial training set, and inputting the model training set into a preset entity recognition model for training;
step S5, inputting the manual test set into the trained entity recognition model for testing to obtain the correct probability that the prediction label of the manual test set is the correct label;
step S6, calculating a difference value between the correct probability and a preset probability, judging whether the difference value is smaller than a threshold value, if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model, and re-entering the step of forming a model training set by the dictionary labeling data and the artificial training set.
In this embodiment, the method is applied to the process of training an entity recognition model for recognizing entities in the medical literature field to filter labeled data required for training. The scheme in the embodiment can also be applied to the field of smart medical treatment of smart cities, so that the construction of the smart cities is promoted. In a business scenario in the field of intelligent medical treatment, there are few high-quality labeled data for training an entity recognition model, and the high-quality labeled data is usually labeled manually. Therefore, in the embodiment, a small amount of high-quality manual labeling data and dictionaries in similar fields are combined to obtain training samples, so that the data volume can be effectively increased, a large training set is obtained for the model, and the generalization of the model is improved.
Specifically, as described in step S1, the predetermined dictionary is labeled data obtained by labeling sentences with a vertical domain entity dictionary, and in order to further enhance the integrity and accuracy of the labeled data in the dictionary, a target entity having an association relationship with an entity in the predetermined dictionary is constructed based on a knowledge graph and added to the predetermined dictionary to expand the predetermined dictionary. The above association relationship means: constructing corresponding alias aiming at entities with diseases and symptoms in a preset dictionary, such as 'chronic bronchitis' expansion 'slow bronchitis'; constructing a target entity with higher similarity to an entity in a preset dictionary, wherein the similarity calculation method can be used for carrying out independent or combined calculation based on the shortest editing distance of character strings, pinyin, radicals and other characteristics; in addition, similar words or replacement of antisense words is carried out aiming at some character descriptions of entities in a preset dictionary, such as 'acute asthma' is expanded to 'chronic asthma', 'diabetes accompanied by hypertension' is expanded to 'diabetes not accompanied by hypertension', and the like. After the expansion, the data volume of the labeled data in the preset dictionary is increased, and the entity description in the medical field is more complete and accurate.
As described in the step S2, the agent model (agent model) is obtained based on reinforcement learning training, and is used to select dictionary labeling data labeled correctly from the labeling data labeled in the target dictionary, where the data selected each time has guidance so that the labeling quality is higher and higher, and the selected data is used to train the entity recognition model; because the data labeled by the dictionary is incomplete or incorrect, more accurate data needs to be continuously selected by the agent model, namely, the dictionary labeling data used for training the entity recognition model is optimized.
As described in step S3, the manual labeling data is obtained by manual labeling, which is high-quality labeling data, and needs to go through a training phase and a testing phase when training the model, so the manual labeling data needs to be divided into a manual training set and a manual testing set.
As described in step S4, the data size of the artificial training set is small, and therefore, the artificial training set and the dictionary label data selected from the target dictionary need to be combined together to form training data, so as to obtain a model training set and increase the data size of the training data; and inputting the model training set into a preset entity recognition model for training so as to improve the generalization of the entity recognition model. The entity recognition model comprises a BilSTM-CRF model.
After the entity recognition model is trained by using the model training set, the training data may include some incomplete and inaccurate dictionary labeling data, as well as high-quality artificial labeling data. It can be understood that if the dictionary labeling data is incomplete and inaccurate, the obtained labeling accuracy rate is reduced when the trained entity recognition model is tested by adopting the manual test set. The accuracy rate when the manual test set is normally used for testing should be 1, and the 1 can be used as a preset probability.
Therefore, as described in the above steps S5-S6, the manual test set is input into the trained entity recognition model for testing, so as to obtain the correct probability that the prediction label of the manual test set is the correct label, and then the difference between the correct probability and the preset probability is calculated, and whether the difference is smaller than the threshold is judged; if the correct probability is close to the preset probability (namely the difference is smaller), the dictionary marking data quality is better; if the correct probability is not close to the preset probability (i.e. the difference is larger), it indicates that the quality of the dictionary labeling data is not good, and there must be more incomplete and inaccurate labeling data, which affects the recognition accuracy of the entity recognition model. At this time, the agent model may be triggered to select more optimized dictionary labeling data from the target dictionary again, and then the step of forming a model training set by the dictionary labeling data and the artificial training set is re-entered. Because the agent model is based on reinforcement learning training, the dictionary annotation data selected by iteration are more accurate annotation data selected directionally according to the test result. And continuously inputting the selected marking data into the entity recognition model for training, and sequentially carrying out iterative training until the test result tends to be stable, thereby finishing the training.
In this embodiment, a small amount of labeled data is manually labeled, a vertical-domain entity dictionary is utilized, a dictionary is used for labeling sentences to obtain dictionary labeled data, data is enhanced, and a large amount of data sets are generated, so that a model obtains a large training set, and the generalization of the model is improved. And then, incomplete and noisy data generated by remote supervision are screened by a reinforcement learning method, and training is carried out under the guidance of the priori knowledge of the manually marked small data set, so that the model is trained on the manually marked data and the dictionary marked data at the same time, the time cost of manual marking is reduced, and the recall rate of the model is improved.
In an embodiment, the step S4 of inputting the model training set into a preset entity recognition model for training includes:
step S401, respectively constructing a word vector and a word vector corresponding to each text data in the model training set, and splicing the word vectors and the word vectors corresponding to the same text data to obtain a spliced vector;
step S402, inputting the splicing vector into a preset entity recognition model, and outputting to obtain a first characteristic vector;
step S403, combining the first feature vector and the splicing vector, inputting the combined first feature vector and the splicing vector into a preset entity recognition model, and outputting to obtain a second feature vector;
step S404, inputting the second feature vector into a classification layer of a preset entity recognition model, and training to optimize network parameters of the classification layer.
In this embodiment, when the preset entity recognition model is trained, in order to enhance the expression of the word and character characteristics of each text data in the training set, a word vector and a word vector corresponding to each text data in the model training set are respectively constructed, and the word vector corresponding to the same text data are spliced to obtain a spliced vector; then inputting the splicing vector into a preset entity recognition model, and outputting to obtain a first characteristic vector; in order to further improve the feature expression of the entity recognition model on the text data, the feature extraction depth is improved; therefore, after the first feature vector and the splicing vector are combined, the combined first feature vector and the splicing vector are input into a preset entity recognition model again, and a second feature vector is output and obtained and serves as a feature vector corresponding to the text data. And finally, inputting the data into a classification layer for iterative training, and optimizing network parameters to obtain a trained entity recognition model.
In an embodiment, before the step S4 of inputting the model training set into a preset entity recognition model for training, the method includes:
obtaining a public data set;
and training an initial long-short memory model based on the public data set to obtain a preset entity recognition model.
In this embodiment, before training the model using the model training set, the entity recognition model needs to be trained first. In this embodiment, the initial long-short memory model may be trained by using a public data set to initialize neural network parameters therein, so as to obtain the preset entity recognition model. And then, training by adopting a model training set, wherein the method can effectively improve the robustness of the model.
In an embodiment, before the step S1 of constructing, based on the knowledge graph, an entity having an association relationship with an entity in a preset dictionary and adding the entity to the preset dictionary to obtain an expanded dictionary as a target dictionary, the method further includes:
step S1a, receiving a model training instruction input by a user, wherein the model training instruction carries application field information of a model to be trained;
and S1b, acquiring a preset dictionary of the corresponding field according to the application field information.
In this embodiment, in order to achieve a better recognition effect of the trained entity recognition model, the model should be trained by using labeled data in the corresponding field. When a user sends a requirement for training a model, the user can input a corresponding model training instruction, and the model training instruction can carry application field information of the model to be trained. According to the application field information, the label data of the corresponding field can be obtained, the model can be trained better by adopting the label data of the corresponding field, and the obtained entity recognition model has better effect when recognizing the text of the corresponding field.
In an embodiment, after the step S6 of calculating the difference between the correct probability and the preset probability, and determining whether the difference is smaller than a threshold, and if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model again includes:
step S7, iteratively training a preset entity recognition model until the difference value between the correct probability and the preset probability is smaller than the threshold value, so as to obtain a target entity recognition model;
step S8, receiving a target text input by a user and receiving an entity identification request instruction in the target text;
step S9, identifying the domain information of the target text based on the request instruction;
step S10, judging whether the field information of the target text is the same as the application field information of the target entity recognition model;
step S11, if the text data are the same, conducting named entity recognition on the target text based on the target entity recognition model; and if not, acquiring training data of the field information corresponding to the target text to retrain the target entity recognition model.
In this embodiment, when entity recognition in a target text is performed by using the target entity recognition model, the target text may not be a text in a medical field, so that in order to improve recognition accuracy and avoid recognition errors, field information of the target text needs to be recognized first, and if the field information of the target text is the same as application field information of the target entity recognition model, accuracy can be significantly improved when named entity recognition is performed by using the target entity recognition model. And if the field information of the target text is different from the application field information of the target entity recognition model, acquiring training data corresponding to the field information of the target text to retrain the target entity recognition model.
In an embodiment, the predetermined dictionary, the target dictionary, the agent model, the manual tagging data, and the predetermined entity recognition model are stored in a block chain, and the block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
Referring to fig. 2, in an embodiment of the present application, a device for selecting annotation data is further provided, including:
the construction unit is used for constructing a target entity and expanding the target entity into a preset dictionary based on the knowledge graph so as to obtain an expanded dictionary as a target dictionary; wherein, the target dictionaries are all marked data; the target entity has an incidence relation with an entity in the preset dictionary;
the selecting unit is used for selecting dictionary labeling data from the target dictionary based on the agent model;
the classification unit is used for dividing preset manual labeling data into a manual training set and a manual testing set;
the training unit is used for forming a model training set by the dictionary labeling data and the artificial training set and inputting the model training set into a preset entity recognition model for training;
the testing unit is used for inputting the manual testing set into the trained entity recognition model for testing to obtain the correct probability that the prediction label of the manual testing set is the correct label;
and the judging unit is used for calculating a difference value between the correct probability and a preset probability, judging whether the difference value is smaller than a threshold value, if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model, and re-executing the step of forming a model training set by the dictionary labeling data and the artificial training set.
In one embodiment, the training unit includes:
the building subunit is used for respectively building a word vector and a word vector corresponding to each text data in the model training set, and splicing the word vectors and the word vectors corresponding to the same text data to obtain a spliced vector;
the first output subunit is used for inputting the splicing vector into a preset entity recognition model and outputting to obtain a first feature vector;
the second output subunit is used for combining the first feature vector and the splicing vector, inputting the combined first feature vector and the splicing vector into a preset entity recognition model, and outputting the combined first feature vector and the splicing vector to obtain a second feature vector;
and the training subunit is used for inputting the second feature vector into a classification layer of a preset entity recognition model, and training to optimize network parameters of the classification layer.
In one embodiment, the method further comprises:
a first acquisition unit for acquiring a public data set;
and the initial training unit is used for training an initial long-short memory model based on the public data set so as to obtain a preset entity recognition model.
In one embodiment, the method further comprises:
the device comprises a first receiving unit, a second receiving unit and a control unit, wherein the first receiving unit is used for receiving a model training instruction input by a user, and the model training instruction carries application field information of a model to be trained;
and the second acquisition unit is used for acquiring the preset dictionary of the corresponding field according to the application field information.
In one embodiment, the method further comprises:
the iteration unit is used for iteratively training a preset entity recognition model until the difference value between the correct probability and the preset probability is smaller than the threshold value, so as to obtain a target entity recognition model;
the second receiving unit is used for receiving a target text input by a user and receiving an entity identification request instruction in the target text;
the identification unit is used for identifying the field information of the target text based on the request instruction;
a domain judging unit, configured to judge whether domain information of the target text is the same as application domain information of the target entity recognition model;
the processing unit is used for carrying out named entity recognition on the target text based on the target entity recognition model if the target text is the same as the target text; and if not, acquiring training data of the field information corresponding to the target text to retrain the target entity recognition model.
In one embodiment, the apparatus further comprises:
and the storage unit is used for storing the target dictionary, the agent model, the manual labeling data and the preset entity identification model in a block chain.
In this embodiment, please refer to the above method embodiments for specific implementation of the units and sub-units, which will not be described herein again.
Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the annotation data, the model and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of selecting annotation data.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a method for selecting annotation data. It is to be understood that the computer-readable storage medium in the present embodiment may be a volatile-readable storage medium or a non-volatile-readable storage medium.
In summary, according to the method, the apparatus, the computer device and the storage medium for selecting the labeled data provided in the embodiment of the present application, based on the knowledge graph, the target entity is constructed and added to the preset dictionary to obtain the expanded dictionary as the target dictionary, so that the dictionary labeled data in the target dictionary is more complete; meanwhile, an entity recognition model is trained on the basis of the manual labeling data and the dictionary labeling data together, whether the quality of the selected dictionary labeling data meets the requirement or not is judged, if not, the optimized dictionary labeling data is selected from the target dictionary, and accordingly the dictionary labeling data with higher quality is selected.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only for the preferred embodiment of the present application and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims (10)

1. A method for selecting annotation data, comprising the steps of:
constructing a target entity and expanding the target entity into a preset dictionary based on the knowledge graph to obtain an expanded dictionary as a target dictionary; wherein, the target dictionaries are all marked data; the target entity has an incidence relation with an entity in the preset dictionary;
selecting dictionary labeling data from the target dictionary based on the agent model;
dividing preset manual marking data into a manual training set and a manual testing set;
forming a model training set by the dictionary labeling data and the artificial training set, and inputting the model training set into a preset entity recognition model for training;
inputting the manual test set into a trained entity recognition model for testing to obtain the correct probability that the prediction label of the manual test set is the correct label;
and calculating a difference value between the correct probability and a preset probability, judging whether the difference value is smaller than a threshold value, if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model, and re-entering a step of forming a model training set by the dictionary labeling data and the artificial training set.
2. The method for selecting annotation data according to claim 1, wherein the step of inputting the model training set into a predetermined entity recognition model for training comprises:
respectively constructing a word vector and a word vector corresponding to each text data in the model training set, and splicing the word vectors and the word vectors corresponding to the same text data to obtain a spliced vector;
inputting the splicing vector into a preset entity recognition model, and outputting to obtain a first characteristic vector;
combining the first feature vector and the splicing vector, inputting the combined first feature vector and the splicing vector into a preset entity recognition model, and outputting to obtain a second feature vector;
and inputting the second feature vector into a classification layer of a preset entity recognition model, and training to optimize network parameters of the classification layer.
3. The method for selecting annotation data according to claim 1, wherein the step of inputting the model training set into a predetermined entity recognition model for training is preceded by:
obtaining a public data set;
and training an initial long-short memory model based on the public data set to obtain a preset entity recognition model.
4. The method for selecting annotation data according to claim 1, wherein before the step of constructing, based on the knowledge graph, entities having an association relationship with entities in a predetermined dictionary and adding the entities to the predetermined dictionary to obtain an expanded dictionary as a target dictionary, the method further comprises:
receiving a model training instruction input by a user, wherein the model training instruction carries application field information of a model to be trained;
and acquiring a preset dictionary of the corresponding field according to the application field information.
5. The method for selecting annotation data according to claim 4, wherein the step of calculating the difference between the correct probability and the preset probability, determining whether the difference is smaller than a threshold, and if not, selecting the optimized dictionary annotation data from the target dictionary based on the agent model again comprises:
iteratively training a preset entity recognition model until the difference value between the correct probability and the preset probability is smaller than the threshold value to obtain a target entity recognition model;
receiving a target text input by a user, and receiving an entity identification request instruction in the target text;
identifying the domain information of the target text based on the request instruction;
judging whether the field information of the target text is the same as the application field information of the target entity recognition model;
if the target text is the same as the target text, conducting named entity recognition on the target text based on the target entity recognition model; and if not, acquiring training data of the field information corresponding to the target text to retrain the target entity recognition model.
6. The method for selecting annotation data of claim 1, further comprising:
and storing the target dictionary, the agent model, the manual labeling data and the preset entity recognition model in a block chain.
7. An apparatus for selecting annotation data, comprising:
the construction unit is used for constructing a target entity and expanding the target entity into a preset dictionary based on the knowledge graph so as to obtain an expanded dictionary as a target dictionary; wherein, the target dictionaries are all marked data; the target entity has an incidence relation with an entity in the preset dictionary;
the selecting unit is used for selecting dictionary labeling data from the target dictionary based on the agent model;
the classification unit is used for dividing preset manual labeling data into a manual training set and a manual testing set;
the training unit is used for forming a model training set by the dictionary labeling data and the artificial training set and inputting the model training set into a preset entity recognition model for training;
the testing unit is used for inputting the manual testing set into the trained entity recognition model for testing to obtain the correct probability that the prediction label of the manual testing set is the correct label;
and the judging unit is used for calculating a difference value between the correct probability and a preset probability, judging whether the difference value is smaller than a threshold value, if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model again, and executing a model training set consisting of the dictionary labeling data and the artificial training set again.
8. The apparatus for selecting annotation data of claim 7, wherein the training unit comprises:
the building subunit is used for respectively building a word vector and a word vector corresponding to each text data in the model training set, and splicing the word vectors and the word vectors corresponding to the same text data to obtain a spliced vector;
the first output subunit is used for inputting the splicing vector into a preset entity recognition model and outputting to obtain a first feature vector;
the second output subunit is used for combining the first feature vector and the splicing vector, inputting the combined first feature vector and the splicing vector into a preset entity recognition model, and outputting the combined first feature vector and the splicing vector to obtain a second feature vector;
and the training subunit is used for inputting the second feature vector into a classification layer of a preset entity recognition model, and training to optimize network parameters of the classification layer.
9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any of claims 1 to 6.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202010592331.4A 2020-06-24 2020-06-24 Method and device for selecting marking data, computer equipment and storage medium Active CN111832294B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010592331.4A CN111832294B (en) 2020-06-24 2020-06-24 Method and device for selecting marking data, computer equipment and storage medium
PCT/CN2020/118533 WO2021139257A1 (en) 2020-06-24 2020-09-28 Method and apparatus for selecting annotated data, and computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010592331.4A CN111832294B (en) 2020-06-24 2020-06-24 Method and device for selecting marking data, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111832294A CN111832294A (en) 2020-10-27
CN111832294B true CN111832294B (en) 2022-08-16

Family

ID=72898915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010592331.4A Active CN111832294B (en) 2020-06-24 2020-06-24 Method and device for selecting marking data, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN111832294B (en)
WO (1) WO2021139257A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807097B (en) * 2020-10-30 2024-07-26 北京中科凡语科技有限公司 Named entity recognition model building method and named entity recognition method
CN113158652B (en) * 2021-04-19 2024-03-19 平安科技(深圳)有限公司 Data enhancement method, device, equipment and medium based on deep learning model
CN112926697B (en) * 2021-04-21 2021-10-12 北京科技大学 Abrasive particle image classification method and device based on semantic segmentation
CN113268593A (en) * 2021-05-18 2021-08-17 Oppo广东移动通信有限公司 Intention classification and model training method and device, terminal and storage medium
CN113378570B (en) * 2021-06-01 2023-12-12 车智互联(北京)科技有限公司 Entity identification model generation method, computing device and readable storage medium
CN113434491B (en) * 2021-06-18 2022-09-02 深圳市曙光信息技术有限公司 Character model data cleaning method, system and medium for deep learning OCR recognition
CN113591467B (en) * 2021-08-06 2023-11-03 北京金堤征信服务有限公司 Event main body recognition method and device, electronic equipment and medium
CN114004233B (en) * 2021-12-30 2022-05-06 之江实验室 Remote supervision named entity recognition method based on semi-training and sentence selection
CN115757784B (en) * 2022-11-21 2023-07-07 中科世通亨奇(北京)科技有限公司 Corpus labeling method and device based on labeling model and label template screening
CN118035444A (en) * 2024-02-20 2024-05-14 安徽彼亿网络科技有限公司 Information extraction method and device based on big data
CN118332136B (en) * 2024-06-12 2024-08-16 电子科技大学长三角研究院(衢州) Chinese radical embedding method based on knowledge graph

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908085A (en) * 2010-06-28 2010-12-08 北京航空航天大学 Multi-Agent-based distributive deduction simulation system and method
WO2019071661A1 (en) * 2017-10-09 2019-04-18 平安科技(深圳)有限公司 Electronic apparatus, medical text entity name identification method, system, and storage medium
CN109697289A (en) * 2018-12-28 2019-04-30 北京工业大学 It is a kind of improved for naming the Active Learning Method of Entity recognition
CN110134969A (en) * 2019-05-27 2019-08-16 北京奇艺世纪科技有限公司 A kind of entity recognition method and device
CN110717040A (en) * 2019-09-18 2020-01-21 平安科技(深圳)有限公司 Dictionary expansion method and device, electronic equipment and storage medium
CN111178045A (en) * 2019-10-14 2020-05-19 深圳软通动力信息技术有限公司 Automatic construction method of non-supervised Chinese semantic concept dictionary based on field, electronic equipment and storage medium
CN111259134A (en) * 2020-01-19 2020-06-09 出门问问信息科技有限公司 Entity identification method, equipment and computer readable storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110249341A (en) * 2017-02-03 2019-09-17 皇家飞利浦有限公司 Classifier training
CN108874878B (en) * 2018-05-03 2021-02-26 众安信息技术服务有限公司 Knowledge graph construction system and method
CN110008473B (en) * 2019-04-01 2022-11-25 云知声(上海)智能科技有限公司 Medical text named entity identification and labeling method based on iteration method
CN110020438B (en) * 2019-04-15 2020-12-08 上海冰鉴信息科技有限公司 Sequence identification based enterprise or organization Chinese name entity disambiguation method and device
CN110287481B (en) * 2019-05-29 2022-06-14 西南电子技术研究所(中国电子科技集团公司第十研究所) Named entity corpus labeling training system
CN110335676A (en) * 2019-07-09 2019-10-15 泰康保险集团股份有限公司 Data processing method, device, medium and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908085A (en) * 2010-06-28 2010-12-08 北京航空航天大学 Multi-Agent-based distributive deduction simulation system and method
WO2019071661A1 (en) * 2017-10-09 2019-04-18 平安科技(深圳)有限公司 Electronic apparatus, medical text entity name identification method, system, and storage medium
CN109697289A (en) * 2018-12-28 2019-04-30 北京工业大学 It is a kind of improved for naming the Active Learning Method of Entity recognition
CN110134969A (en) * 2019-05-27 2019-08-16 北京奇艺世纪科技有限公司 A kind of entity recognition method and device
CN110717040A (en) * 2019-09-18 2020-01-21 平安科技(深圳)有限公司 Dictionary expansion method and device, electronic equipment and storage medium
CN111178045A (en) * 2019-10-14 2020-05-19 深圳软通动力信息技术有限公司 Automatic construction method of non-supervised Chinese semantic concept dictionary based on field, electronic equipment and storage medium
CN111259134A (en) * 2020-01-19 2020-06-09 出门问问信息科技有限公司 Entity identification method, equipment and computer readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Case Restoration Approach to Named Entity Tagging in Degraded Documents;Rohini K. Srihari et.al;《Proceedings of the Seventh International Conference on Document Analysis and Recognition》;20031231;第1-6页 *
一种面向突发事件的文本语料自动标注方法;刘炜等;《中文信息学报》;20170331;第76-85页 *

Also Published As

Publication number Publication date
CN111832294A (en) 2020-10-27
WO2021139257A1 (en) 2021-07-15

Similar Documents

Publication Publication Date Title
CN111832294B (en) Method and device for selecting marking data, computer equipment and storage medium
CN111553164A (en) Training method and device for named entity recognition model and computer equipment
CN110704588A (en) Multi-round dialogue semantic analysis method and system based on long-term and short-term memory network
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN110569500A (en) Text semantic recognition method and device, computer equipment and storage medium
CN110688853B (en) Sequence labeling method and device, computer equipment and storage medium
CN110162681B (en) Text recognition method, text processing method, text recognition device, text processing device, computer equipment and storage medium
CN111767400A (en) Training method and device of text classification model, computer equipment and storage medium
CN112580346B (en) Event extraction method and device, computer equipment and storage medium
CN109033427B (en) Stock screening method and device, computer equipment and readable storage medium
US20230075290A1 (en) Method for linking a cve with at least one synthetic cpe
CN113254613B (en) Dialogue question-answering method, device, equipment and storage medium
CN112395857B (en) Speech text processing method, device, equipment and medium based on dialogue system
CN111783460A (en) Enterprise abbreviation extraction method and device, computer equipment and storage medium
US20220215292A1 (en) Method to identify incorrect account numbers
CN113688215A (en) Information extraction method, information extraction device, model training method, model training device, computer equipment and storage medium
CN112380837A (en) Translation model-based similar sentence matching method, device, equipment and medium
CN111723870A (en) Data set acquisition method, device, equipment and medium based on artificial intelligence
CN113255343A (en) Semantic identification method and device for label data, computer equipment and storage medium
CN112652295A (en) Language model training method, device, equipment and medium, and video subtitle checking method, device and medium
CN115409111A (en) Training method of named entity recognition model and named entity recognition method
CN110413994B (en) Hot topic generation method and device, computer equipment and storage medium
CN114547087A (en) Method, device, equipment and medium for automatically identifying proposal and generating report
CN114462423A (en) Method and device for training intention recognition model, model and electronic equipment
CN113177405A (en) Method, device and equipment for correcting data errors based on BERT and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant