CN111832294B

CN111832294B - Method and device for selecting marking data, computer equipment and storage medium

Info

Publication number: CN111832294B
Application number: CN202010592331.4A
Authority: CN
Inventors: 梁欣; 顾婷婷
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2022-08-16
Anticipated expiration: 2040-06-24
Also published as: CN111832294A; WO2021139257A1

Abstract

The application relates to the technical field of block chains, and provides a method and a device for selecting labeled data, computer equipment and a storage medium, wherein the method comprises the following steps: selecting dictionary annotation data from the target dictionary based on the agent model stored in the block chain; dividing preset manual marking data into a manual training set and a manual testing set; constructing a model training set by the dictionary labeling data and the manual training set, and inputting the model training set into a preset entity recognition model for training; inputting the manual test set into the trained entity recognition model for testing to obtain the correct probability of the prediction label of the manual test set as the correct label; and calculating a difference value between the correct probability and a preset probability, judging whether the difference value is smaller than a threshold value, and if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model. According to the method and the device, the high-quality marking data can be selected. The scheme in this application can also use in the wisdom medical treatment field in wisdom city to promote wisdom city's construction.

Description

Method and device for selecting marking data, computer equipment and storage medium

Technical Field

The present application relates to the field of block chain technologies, and in particular, to a method and an apparatus for selecting labeled data, a computer device, and a storage medium.

Background

Entity recognition is the first step in the natural language processing task and is also a very critical step. Particularly in the vertical fields of finance, e-commerce, medical treatment and the like, entity identification is the key of natural language processing tasks, and for example, downstream tasks such as entity linking, relationship extraction between entities, relationship classification and the like can transmit errors caused by upstream tasks layer by layer.

With the development of deep learning, the neural network method and the traditional conditional random field (crf) can achieve a very good effect on an entity recognition task. However, for business scenarios, the application of deep learning also brings some problems. For example, although the neural network has a strong capability of autonomously learning features, a large amount of training data conforming to real distribution is often required, and for an entity recognition task in a new field, a large amount of annotation time and labor annotation cost are consumed for high-quality annotation data. In the vertical domain, although a related domain dictionary can be used for performing dictionary labeling on data through a remote supervision method, noisy data or incompletely labeled entities can be introduced, and the entity recognition task is greatly influenced. For example, in the medical field, the expression of diseases: the diabetes with ketosis is marked only by the incomplete conditions of the entities such as diabetes, allergic asthma, and the like. However, in medicine, the description and treatment of these different entities are not identical. Using only dictionary labeling would make the model unable to learn the features of such combined symptoms, resulting in the final labeling of entities being less effective, and the subsequent downstream tasks being less effective due to incorrect propagation.

Disclosure of Invention

The present application mainly aims to provide a method, an apparatus, a computer device and a storage medium for selecting annotation data, and aims to overcome the defects that the current annotation data is incomplete and cannot be selected with high quality.

In order to achieve the above object, the present application provides a method for selecting annotation data, comprising the following steps:

constructing a target entity and expanding the target entity into a preset dictionary based on the knowledge graph to obtain an expanded dictionary as a target dictionary; wherein, the target dictionaries are all marked data; the target entity has an incidence relation with an entity in the preset dictionary;

selecting dictionary labeling data from the target dictionary based on the agent model;

dividing preset manual marking data into a manual training set and a manual testing set;

forming a model training set by the dictionary labeling data and the artificial training set, and inputting the model training set into a preset entity recognition model for training;

inputting the manual test set into a trained entity recognition model for testing to obtain the correct probability that the prediction label of the manual test set is the correct label;

and calculating a difference value between the correct probability and a preset probability, judging whether the difference value is smaller than a threshold value, if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model, and re-entering a step of forming a model training set by the dictionary labeling data and the artificial training set.

Further, the step of inputting the model training set into a preset entity recognition model for training includes:

respectively constructing a word vector and a word vector corresponding to each text data in the model training set, and splicing the word vectors and the word vectors corresponding to the same text data to obtain a spliced vector;

inputting the splicing vector into a preset entity recognition model, and outputting to obtain a first feature vector;

combining the first feature vector and the splicing vector, inputting the combined first feature vector and the splicing vector into a preset entity recognition model, and outputting to obtain a second feature vector;

and inputting the second feature vector into a classification layer of a preset entity recognition model, and training to optimize network parameters of the classification layer.

Further, before the step of inputting the model training set into a preset entity recognition model for training, the method includes:

obtaining a public data set;

and training an initial long-short memory model based on the public data set to obtain a preset entity recognition model.

Further, before the step of constructing, based on the knowledge graph, an entity having an association relationship with an entity in a preset dictionary and adding the entity to the preset dictionary to obtain an expanded dictionary as a target dictionary, the method further includes:

receiving a model training instruction input by a user, wherein the model training instruction carries application field information of a model to be trained;

and acquiring a preset dictionary of the corresponding field according to the application field information.

Further, the step of calculating a difference between the correct probability and a preset probability, and determining whether the difference is smaller than a threshold, and if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model again includes:

iteratively training a preset entity recognition model until the difference value between the correct probability and the preset probability is smaller than the threshold value to obtain a target entity recognition model;

receiving a target text input by a user, and receiving an entity identification request instruction in the target text;

identifying the domain information of the target text based on the request instruction;

judging whether the field information of the target text is the same as the application field information of the target entity recognition model;

if the target text is the same as the target text, conducting named entity recognition on the target text based on the target entity recognition model; and if not, acquiring training data of the field information corresponding to the target text to retrain the target entity recognition model.

Further, the method further comprises:

and storing the target dictionary, the agent model, the manual labeling data and the preset entity recognition model in a block chain.

The present application further provides a device for selecting annotation data, including:

the construction unit is used for constructing a target entity and expanding the target entity into a preset dictionary based on the knowledge graph so as to obtain an expanded dictionary as a target dictionary; wherein, the target dictionaries are all marked data; the target entity has an incidence relation with an entity in the preset dictionary;

the selecting unit is used for selecting dictionary labeling data from the target dictionary based on the agent model;

the classification unit is used for dividing preset manual labeling data into a manual training set and a manual testing set;

the training unit is used for forming a model training set by the dictionary labeling data and the artificial training set and inputting the model training set into a preset entity recognition model for training;

the testing unit is used for inputting the manual testing set into the trained entity recognition model for testing to obtain the correct probability that the prediction label of the manual testing set is the correct label;

and the judging unit is used for calculating a difference value between the correct probability and a preset probability, judging whether the difference value is smaller than a threshold value, if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model, and re-executing the step of forming a model training set by the dictionary labeling data and the artificial training set.

Further, the training unit includes:

the building subunit is used for respectively building a word vector and a word vector corresponding to each text data in the model training set, and splicing the word vectors and the word vectors corresponding to the same text data to obtain a spliced vector;

the first output subunit is used for inputting the splicing vector into a preset entity recognition model and outputting to obtain a first feature vector;

the second output subunit is used for combining the first feature vector and the splicing vector, inputting the combined first feature vector and the splicing vector into a preset entity recognition model, and outputting the combined first feature vector and the splicing vector to obtain a second feature vector;

and the training subunit is used for inputting the second feature vector into a classification layer of a preset entity recognition model, and training to optimize network parameters of the classification layer.

Further, the method also comprises the following steps:

a first acquisition unit for acquiring a public data set;

and the initial training unit is used for training an initial long-short memory model based on the public data set so as to obtain a preset entity recognition model.

Further, still include:

the device comprises a first receiving unit, a second receiving unit and a control unit, wherein the first receiving unit is used for receiving a model training instruction input by a user, and the model training instruction carries application field information of a model to be trained;

and the second acquisition unit is used for acquiring the preset dictionary of the corresponding field according to the application field information.

Further, still include:

the iteration unit is used for iteratively training a preset entity recognition model until the difference value between the correct probability and the preset probability is smaller than the threshold value, so as to obtain a target entity recognition model;

the second receiving unit is used for receiving a target text input by a user and receiving an entity identification request instruction in the target text;

the identification unit is used for identifying the domain information of the target text based on the request instruction;

a domain judging unit, configured to judge whether domain information of the target text is the same as application domain information of the target entity recognition model;

the processing unit is used for carrying out named entity recognition on the target text based on the target entity recognition model if the target text is the same as the target text; and if not, acquiring training data of the field information corresponding to the target text to retrain the target entity recognition model.

The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of any one of the above methods when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.

According to the method and the device for selecting the labeled data, the computer equipment and the storage medium, the target entity is constructed and added into the preset dictionary based on the knowledge graph, so that the expanded dictionary is obtained and used as the target dictionary, and the dictionary labeled data in the target dictionary is more complete; meanwhile, an entity recognition model is trained on the basis of the manual labeling data and the dictionary labeling data together, whether the quality of the selected dictionary labeling data meets the requirement or not is judged, if not, the optimized dictionary labeling data is selected from the target dictionary, and accordingly the dictionary labeling data with higher quality is selected.

Drawings

FIG. 1 is a schematic diagram illustrating a method for selecting annotation data according to an embodiment of the present application;

FIG. 2 is a block diagram illustrating an apparatus for selecting annotation data according to an embodiment of the present application;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a method for selecting annotation data, including the following steps:

step S1, constructing a target entity and expanding the target entity to a preset dictionary based on the knowledge graph to obtain an expanded dictionary as a target dictionary; wherein, the target dictionaries are all marked data; the target entity has an incidence relation with an entity in the preset dictionary;

step S2, selecting dictionary annotation data from the target dictionary based on the agent model;

step S3, dividing the preset manual marking data into a manual training set and a manual testing set;

step S4, forming a model training set by the dictionary labeling data and the artificial training set, and inputting the model training set into a preset entity recognition model for training;

step S5, inputting the manual test set into the trained entity recognition model for testing to obtain the correct probability that the prediction label of the manual test set is the correct label;

step S6, calculating a difference value between the correct probability and a preset probability, judging whether the difference value is smaller than a threshold value, if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model, and re-entering the step of forming a model training set by the dictionary labeling data and the artificial training set.

In this embodiment, the method is applied to the process of training an entity recognition model for recognizing entities in the medical literature field to filter labeled data required for training. The scheme in the embodiment can also be applied to the field of smart medical treatment of smart cities, so that the construction of the smart cities is promoted. In a business scenario in the field of intelligent medical treatment, there are few high-quality labeled data for training an entity recognition model, and the high-quality labeled data is usually labeled manually. Therefore, in the embodiment, a small amount of high-quality manual labeling data and dictionaries in similar fields are combined to obtain training samples, so that the data volume can be effectively increased, a large training set is obtained for the model, and the generalization of the model is improved.

Specifically, as described in step S1, the predetermined dictionary is labeled data obtained by labeling sentences with a vertical domain entity dictionary, and in order to further enhance the integrity and accuracy of the labeled data in the dictionary, a target entity having an association relationship with an entity in the predetermined dictionary is constructed based on a knowledge graph and added to the predetermined dictionary to expand the predetermined dictionary. The above association relationship means: constructing corresponding alias aiming at entities with diseases and symptoms in a preset dictionary, such as 'chronic bronchitis' expansion 'slow bronchitis'; constructing a target entity with higher similarity to an entity in a preset dictionary, wherein the similarity calculation method can be used for carrying out independent or combined calculation based on the shortest editing distance of character strings, pinyin, radicals and other characteristics; in addition, similar words or replacement of antisense words is carried out aiming at some character descriptions of entities in a preset dictionary, such as 'acute asthma' is expanded to 'chronic asthma', 'diabetes accompanied by hypertension' is expanded to 'diabetes not accompanied by hypertension', and the like. After the expansion, the data volume of the labeled data in the preset dictionary is increased, and the entity description in the medical field is more complete and accurate.

As described in the step S2, the agent model (agent model) is obtained based on reinforcement learning training, and is used to select dictionary labeling data labeled correctly from the labeling data labeled in the target dictionary, where the data selected each time has guidance so that the labeling quality is higher and higher, and the selected data is used to train the entity recognition model; because the data labeled by the dictionary is incomplete or incorrect, more accurate data needs to be continuously selected by the agent model, namely, the dictionary labeling data used for training the entity recognition model is optimized.

As described in step S3, the manual labeling data is obtained by manual labeling, which is high-quality labeling data, and needs to go through a training phase and a testing phase when training the model, so the manual labeling data needs to be divided into a manual training set and a manual testing set.

As described in step S4, the data size of the artificial training set is small, and therefore, the artificial training set and the dictionary label data selected from the target dictionary need to be combined together to form training data, so as to obtain a model training set and increase the data size of the training data; and inputting the model training set into a preset entity recognition model for training so as to improve the generalization of the entity recognition model. The entity recognition model comprises a BilSTM-CRF model.

After the entity recognition model is trained by using the model training set, the training data may include some incomplete and inaccurate dictionary labeling data, as well as high-quality artificial labeling data. It can be understood that if the dictionary labeling data is incomplete and inaccurate, the obtained labeling accuracy rate is reduced when the trained entity recognition model is tested by adopting the manual test set. The accuracy rate when the manual test set is normally used for testing should be 1, and the 1 can be used as a preset probability.

Therefore, as described in the above steps S5-S6, the manual test set is input into the trained entity recognition model for testing, so as to obtain the correct probability that the prediction label of the manual test set is the correct label, and then the difference between the correct probability and the preset probability is calculated, and whether the difference is smaller than the threshold is judged; if the correct probability is close to the preset probability (namely the difference is smaller), the dictionary marking data quality is better; if the correct probability is not close to the preset probability (i.e. the difference is larger), it indicates that the quality of the dictionary labeling data is not good, and there must be more incomplete and inaccurate labeling data, which affects the recognition accuracy of the entity recognition model. At this time, the agent model may be triggered to select more optimized dictionary labeling data from the target dictionary again, and then the step of forming a model training set by the dictionary labeling data and the artificial training set is re-entered. Because the agent model is based on reinforcement learning training, the dictionary annotation data selected by iteration are more accurate annotation data selected directionally according to the test result. And continuously inputting the selected marking data into the entity recognition model for training, and sequentially carrying out iterative training until the test result tends to be stable, thereby finishing the training.

In this embodiment, a small amount of labeled data is manually labeled, a vertical-domain entity dictionary is utilized, a dictionary is used for labeling sentences to obtain dictionary labeled data, data is enhanced, and a large amount of data sets are generated, so that a model obtains a large training set, and the generalization of the model is improved. And then, incomplete and noisy data generated by remote supervision are screened by a reinforcement learning method, and training is carried out under the guidance of the priori knowledge of the manually marked small data set, so that the model is trained on the manually marked data and the dictionary marked data at the same time, the time cost of manual marking is reduced, and the recall rate of the model is improved.

In an embodiment, the step S4 of inputting the model training set into a preset entity recognition model for training includes:

step S401, respectively constructing a word vector and a word vector corresponding to each text data in the model training set, and splicing the word vectors and the word vectors corresponding to the same text data to obtain a spliced vector;

step S402, inputting the splicing vector into a preset entity recognition model, and outputting to obtain a first characteristic vector;

step S403, combining the first feature vector and the splicing vector, inputting the combined first feature vector and the splicing vector into a preset entity recognition model, and outputting to obtain a second feature vector;

step S404, inputting the second feature vector into a classification layer of a preset entity recognition model, and training to optimize network parameters of the classification layer.

In this embodiment, when the preset entity recognition model is trained, in order to enhance the expression of the word and character characteristics of each text data in the training set, a word vector and a word vector corresponding to each text data in the model training set are respectively constructed, and the word vector corresponding to the same text data are spliced to obtain a spliced vector; then inputting the splicing vector into a preset entity recognition model, and outputting to obtain a first characteristic vector; in order to further improve the feature expression of the entity recognition model on the text data, the feature extraction depth is improved; therefore, after the first feature vector and the splicing vector are combined, the combined first feature vector and the splicing vector are input into a preset entity recognition model again, and a second feature vector is output and obtained and serves as a feature vector corresponding to the text data. And finally, inputting the data into a classification layer for iterative training, and optimizing network parameters to obtain a trained entity recognition model.

In an embodiment, before the step S4 of inputting the model training set into a preset entity recognition model for training, the method includes:

obtaining a public data set;

In this embodiment, before training the model using the model training set, the entity recognition model needs to be trained first. In this embodiment, the initial long-short memory model may be trained by using a public data set to initialize neural network parameters therein, so as to obtain the preset entity recognition model. And then, training by adopting a model training set, wherein the method can effectively improve the robustness of the model.

In an embodiment, before the step S1 of constructing, based on the knowledge graph, an entity having an association relationship with an entity in a preset dictionary and adding the entity to the preset dictionary to obtain an expanded dictionary as a target dictionary, the method further includes:

step S1a, receiving a model training instruction input by a user, wherein the model training instruction carries application field information of a model to be trained;

and S1b, acquiring a preset dictionary of the corresponding field according to the application field information.

In this embodiment, in order to achieve a better recognition effect of the trained entity recognition model, the model should be trained by using labeled data in the corresponding field. When a user sends a requirement for training a model, the user can input a corresponding model training instruction, and the model training instruction can carry application field information of the model to be trained. According to the application field information, the label data of the corresponding field can be obtained, the model can be trained better by adopting the label data of the corresponding field, and the obtained entity recognition model has better effect when recognizing the text of the corresponding field.

In an embodiment, after the step S6 of calculating the difference between the correct probability and the preset probability, and determining whether the difference is smaller than a threshold, and if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model again includes:

step S7, iteratively training a preset entity recognition model until the difference value between the correct probability and the preset probability is smaller than the threshold value, so as to obtain a target entity recognition model;

step S8, receiving a target text input by a user and receiving an entity identification request instruction in the target text;

step S9, identifying the domain information of the target text based on the request instruction;

step S10, judging whether the field information of the target text is the same as the application field information of the target entity recognition model;

step S11, if the text data are the same, conducting named entity recognition on the target text based on the target entity recognition model; and if not, acquiring training data of the field information corresponding to the target text to retrain the target entity recognition model.

In this embodiment, when entity recognition in a target text is performed by using the target entity recognition model, the target text may not be a text in a medical field, so that in order to improve recognition accuracy and avoid recognition errors, field information of the target text needs to be recognized first, and if the field information of the target text is the same as application field information of the target entity recognition model, accuracy can be significantly improved when named entity recognition is performed by using the target entity recognition model. And if the field information of the target text is different from the application field information of the target entity recognition model, acquiring training data corresponding to the field information of the target text to retrain the target entity recognition model.

In an embodiment, the predetermined dictionary, the target dictionary, the agent model, the manual tagging data, and the predetermined entity recognition model are stored in a block chain, and the block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

Referring to fig. 2, in an embodiment of the present application, a device for selecting annotation data is further provided, including:

In one embodiment, the training unit includes:

In one embodiment, the method further comprises:

a first acquisition unit for acquiring a public data set;

In one embodiment, the method further comprises:

the identification unit is used for identifying the field information of the target text based on the request instruction;

In one embodiment, the apparatus further comprises:

and the storage unit is used for storing the target dictionary, the agent model, the manual labeling data and the preset entity identification model in a block chain.

In this embodiment, please refer to the above method embodiments for specific implementation of the units and sub-units, which will not be described herein again.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the annotation data, the model and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of selecting annotation data.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a method for selecting annotation data. It is to be understood that the computer-readable storage medium in the present embodiment may be a volatile-readable storage medium or a non-volatile-readable storage medium.

In summary, according to the method, the apparatus, the computer device and the storage medium for selecting the labeled data provided in the embodiment of the present application, based on the knowledge graph, the target entity is constructed and added to the preset dictionary to obtain the expanded dictionary as the target dictionary, so that the dictionary labeled data in the target dictionary is more complete; meanwhile, an entity recognition model is trained on the basis of the manual labeling data and the dictionary labeling data together, whether the quality of the selected dictionary labeling data meets the requirement or not is judged, if not, the optimized dictionary labeling data is selected from the target dictionary, and accordingly the dictionary labeling data with higher quality is selected.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only for the preferred embodiment of the present application and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A method for selecting annotation data, comprising the steps of:

2. The method for selecting annotation data according to claim 1, wherein the step of inputting the model training set into a predetermined entity recognition model for training comprises:

inputting the splicing vector into a preset entity recognition model, and outputting to obtain a first characteristic vector;

3. The method for selecting annotation data according to claim 1, wherein the step of inputting the model training set into a predetermined entity recognition model for training is preceded by:

obtaining a public data set;

4. The method for selecting annotation data according to claim 1, wherein before the step of constructing, based on the knowledge graph, entities having an association relationship with entities in a predetermined dictionary and adding the entities to the predetermined dictionary to obtain an expanded dictionary as a target dictionary, the method further comprises:

5. The method for selecting annotation data according to claim 4, wherein the step of calculating the difference between the correct probability and the preset probability, determining whether the difference is smaller than a threshold, and if not, selecting the optimized dictionary annotation data from the target dictionary based on the agent model again comprises:

6. The method for selecting annotation data of claim 1, further comprising:

7. An apparatus for selecting annotation data, comprising:

and the judging unit is used for calculating a difference value between the correct probability and a preset probability, judging whether the difference value is smaller than a threshold value, if not, selecting optimized dictionary labeling data from the target dictionary based on the agent model again, and executing a model training set consisting of the dictionary labeling data and the artificial training set again.

8. The apparatus for selecting annotation data of claim 7, wherein the training unit comprises:

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.