CN114022738A

CN114022738A - Training sample acquisition method and device, computer equipment and readable storage medium

Info

Publication number: CN114022738A
Application number: CN202111355498.XA
Authority: CN
Inventors: 贾乐成
Original assignee: Shenzhen United Imaging Research Institute of Innovative Medical Equipment
Current assignee: Shenzhen United Imaging Research Institute of Innovative Medical Equipment
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-02-08

Abstract

The method comprises the steps of obtaining an initial sample set, training according to marked samples in the initial sample set to obtain a target model and an identification model; inputting unmarked samples in the initial sample set into a target model to obtain marked data; determining input parameters of the identification model according to the marked data and the unmarked sample, and inputting the input parameters into the identification model to obtain an evaluation score; and determining whether to update the marked samples in the initial sample set according to the marked data and the unmarked samples according to the evaluation scores. The training sample obtaining method provided by the application does not need staff to label a large number of training samples, can reduce manpower consumption, and therefore improves the speed of obtaining the training samples.

Description

Training sample acquisition method and device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of medical image processing, and in particular, to a training sample acquisition method, apparatus, computer device, and readable storage medium.

Background

At present, methods for realizing corresponding functions through a pre-trained deep learning model are increasingly widely applied. For example: and realizing the segmentation of the image through the trained image segmentation model. To obtain an accurate deep learning model, a large number of high quality labeled samples are required.

In the conventional technology, a worker usually labels a training sample to obtain a labeled sample. However, labeling of a large number of training samples requires a large amount of manpower.

Disclosure of Invention

In view of the above, it is necessary to provide a training sample acquiring method, apparatus, computer device and readable storage medium for solving the above technical problems.

In a first aspect, an embodiment of the present application provides a training sample obtaining method, including

Acquiring an initial sample set, and training according to labeled samples in the initial sample set to obtain a target model and an identification model; the identification model is used for carrying out model evaluation on the target model;

inputting unmarked samples in the initial sample set into a target model to obtain marked data;

determining input parameters of the identification model according to the marked data and the unmarked sample; inputting the input parameters into the identification model to obtain an evaluation score;

and determining whether to update the marked samples of the initial sample set according to the marked data and the unmarked samples according to the evaluation scores.

In one embodiment, determining whether to update the marked samples of the initial sample set based on the annotation data and the unmarked samples based on the evaluation score comprises:

and if the evaluation score is larger than or equal to a preset threshold value, adding a sample consisting of the unmarked sample and the marked data into the initial sample set as a new marked sample, wherein the preset threshold value is used for representing the accuracy of the output result of the target model.

In one embodiment, determining input parameters of the authentication model from the annotation data and the unlabeled sample comprises:

taking the unmarked sample and the marked data as input parameters;

and/or performing feature extraction processing on the unmarked sample and the marked data, and taking the extracted features as input parameters.

In one embodiment, the identification model includes a deep learning model and/or a machine learning model.

In one embodiment, when the authentication model includes a deep learning model and a machine learning model, inputting the input parameters into the authentication model to obtain the evaluation score, including:

inputting the input parameters into a deep learning model and a machine learning model respectively;

and carrying out averaging processing on the output of the deep learning model and the output of the machine learning model to obtain an evaluation score.

In one embodiment, the training sample acquiring method further includes:

and performing sample evaluation on the plurality of initial marked samples, and obtaining marked samples from the plurality of initial marked samples based on the evaluation result.

In one embodiment, performing a sample evaluation on a plurality of initial marked samples, the marked samples being obtained from the plurality of initial marked samples based on the evaluation result, comprises:

the following screening treatments were performed on a plurality of initially labeled samples:

determining an L group of sample sets according to the initial labeled samples, determining a training set and a verification set according to the L group of sample sets, and training initial models corresponding to target models according to the training set and the verification set to obtain N groups of sample screening models; l and N are integers greater than zero;

inputting a plurality of initial marked samples into N groups of sample screening models, determining an evaluation result according to the output of the N groups of sample screening models, and determining abnormal samples in the plurality of initial marked samples according to the evaluation result;

and updating the plurality of initial marked samples according to the abnormal samples, and performing screening processing on the plurality of updated initial marked samples until the plurality of initial marked samples do not contain the abnormal samples.

In one embodiment, determining a training set and a verification set according to L groups of sample sets, and training an initial model corresponding to a target model according to the training set and the verification set to obtain N groups of sample screening models, including:

traversing each group of sample sets in the L groups of sample sets, taking one group of sample sets as the verification set, and taking the rest sample sets in the L groups of sample sets as training sets;

training the initial model by using a training set to obtain a training result, and verifying the trained initial model by using a verification set to obtain a verification result;

determining a sample screening model corresponding to a group of sample sets according to the training result, the verification result and a preset constraint condition;

and determining N groups of sample screening models according to the sample screening model corresponding to each group of sample sets in the L groups of sample sets.

In one embodiment, the training sample acquiring method further includes:

receiving subjective evaluation scores of the annotation data and modification results of the annotation data; the subjective evaluation score is used for representing subjective evaluation on the annotation data;

and if the subjective evaluation score is smaller than the evaluation score, training the target model and the identification model according to the modification result, the unlabeled sample and the subjective evaluation score.

In a second aspect, an embodiment of the present application provides a training sample acquiring apparatus, including:

the acquisition module is used for acquiring an initial sample set, and training according to the labeled samples in the initial sample set to obtain a target model and an identification model; the identification model is used for carrying out model evaluation on the target model;

the first determining module is used for inputting unmarked samples in the initial sample set into the target model to obtain marked data;

the second determining module is used for determining the input parameters of the identification model according to the marked data and the unmarked samples; inputting the input parameters into the identification model to obtain an evaluation score;

and the third determining module is used for determining whether to update the marked samples of the initial sample set according to the marked data and the unmarked samples according to the evaluation scores.

In a third aspect, an embodiment of the present application provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method provided by the foregoing embodiment when executing the computer program.

In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method provided by the foregoing embodiment.

The embodiment of the application provides a training sample acquisition method, a training sample acquisition device, computer equipment and a readable storage medium, wherein the method acquires an initial sample set, and acquires a target model and an identification model according to label training in the initial sample set; inputting unmarked samples in the initial sample set into a target model to obtain marked data; determining input parameters of the identification model according to the marked data and the unmarked sample; inputting the input parameters into the identification model to obtain an evaluation score; and determining whether to update the marked samples of the initial sample set according to the marked data and the unmarked samples according to the evaluation scores. According to the training sample obtaining method provided by the embodiment of the application, the unmarked samples can be marked by using the target model obtained by training a small amount of marked samples in the initial sample set, so that a large amount of marked samples, namely training samples, can be obtained. Therefore, a large amount of unlabeled samples do not need to be labeled manually, the consumption of manpower can be reduced, and the efficiency of acquiring the training samples is improved. In addition, the accuracy of marking the unmarked sample by using the target model is identified by using the identification model obtained by training the marked sample, so that the accuracy of the finally determined marked sample can be improved. By using the training sample acquisition method provided by the embodiment of the application, a large amount of high-quality labeled samples can be obtained through a small amount of labeled samples.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the conventional technologies of the present application, the drawings used in the description of the embodiments or the conventional technologies will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flowchart illustrating steps of a training sample obtaining method according to an embodiment of the present application;

FIG. 2 is a schematic flowchart illustrating steps of a training sample obtaining method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating steps of a training sample obtaining method according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating steps of a training sample obtaining method according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating steps of a training sample obtaining method according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating steps of a training sample obtaining method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a training sample acquiring device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the present application are described in detail below with reference to the accompanying drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of embodiments in many different forms than those described herein and that modifications may be made by one skilled in the art without departing from the spirit and scope of the application and it is therefore not intended to be limited to the specific embodiments disclosed below.

The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning.

At present, methods for realizing corresponding functions through a pre-trained deep learning model are increasingly widely applied. To obtain an accurate deep learning model, a large number of high quality labeled samples are required. In the conventional technology, a worker is usually required to label a training sample to obtain a labeled sample, however, a large amount of manpower is required to label a large amount of training samples. For example, although a technology for segmenting a medical image by using a trained image segmentation model has made great progress, the technology still depends on high-quality annotation data (namely, the annotation data is very accurate), the medical image annotation data used for model training has the problems of difficulty in obtaining (hospital clinical images and annotation data are needed, so that privacy information of patients is involved, a strict data use protocol is signed with hospitals), difficulty in annotation (different hospitals and different doctors have different annotation habits, and doctors can temporarily adjust an annotation mode according to actual conditions of the patients, so that the data obtained from the hospitals cannot be directly used for image segmentation training), high annotation cost (high-annual-capital doctors are needed to modify and uniformly annotate the medical image), and a large amount of manpower is needed. In view of this, the present application provides a training sample acquisition method.

The training sample obtaining method provided by the application can be realized through computer equipment. Computer devices include, but are not limited to, control chips, personal computers, laptops, smartphones, tablets, and portable wearable devices. The method provided by the application can be realized through JAVA software and can also be applied to other software.

The following describes the technical solutions of the present application and how to solve the technical problems with the technical solutions of the present application in detail with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Referring to fig. 1, an embodiment of the present application provides a training sample obtaining method. The embodiment of the application specifically describes a training sample acquisition method by taking computer equipment as an execution subject, and the method comprises the following steps:

step 100, obtaining an initial sample set, and training according to labeled samples in the initial sample set to obtain a target model and an identification model; the identification model is used for model evaluation of the target model.

The computer device obtains an initial sample set that includes marked samples and unmarked samples, which may be small numbers of high quality marked samples marked by a human worker. After the computer equipment obtains the initial sample set, the initial model of the target model is trained by using the labeled samples in the initial sample set to obtain the target model, and the initial model of the identification model is trained to obtain the identification model. The initial model of the target model may be a neural network model, that is, training the neural network model may result in the target model. The initial model of the identification model may be a deep learning model or a machine learning model, that is, training the deep learning model or the machine learning model may result in the identification model. The target model refers to a model required by a worker. Assuming that a worker needs to segment the medical image, the target model is an image segmentation model; assuming that the staff member needs to identify the medical image, the target model is an identification model. The identification model is used for performing model evaluation on the target model, in other words, the identification model is used for determining whether the trained target model can accurately realize the function of the trained target model. If the target model is an image segmentation model, the identification model is used for determining whether the image segmentation model can accurately segment the medical image. The initial sample set may be pre-stored by the staff member in the memory of the computer device, and the computer device may directly retrieve the initial sample set from the memory when needed. The present embodiment does not limit the method for obtaining the initial sample set and the type of the target model, as long as the functions thereof can be realized.

And 110, inputting unmarked samples in the initial sample set into a target model to obtain marked data.

After the computer device obtains the target model through training of the marked samples in the initial sample set, the unmarked samples in the initial sample set are input into the target model to realize the marking of the unmarked samples, and marking data of the unmarked samples are obtained. After the unmarked sample is marked by using the marking data, the marked sample corresponding to the unmarked sample after marking can be obtained.

Step 120, determining input parameters of the identification model according to the marked data and the unmarked sample; and inputting the input parameters into the identification model to obtain an evaluation score.

After obtaining the marking data of the unmarked sample, the computer equipment determines the input parameters of the identification model according to the marking data and the unmarked sample, inputs the input parameters into the identification model, and determines whether the marked sample obtained after the unmarked sample is marked by using the target model is accurate or not according to the output result of the identification model. In this embodiment, the output result of the identification model is an evaluation score, and different evaluation scores may represent the accuracy of the labeled sample obtained by labeling the unlabeled sample with the target model. The present embodiment does not limit the method for determining the input parameters of the authentication model from the labeled data and the unlabeled sample as long as the function thereof can be achieved.

And step 130, determining whether to update the marked samples of the initial sample set according to the marked data and the unmarked samples according to the evaluation scores.

After obtaining the evaluation score, the computer device determines whether to update the labeled samples of the initial sample set based on the labeling data and the unlabeled samples based on the evaluation score. In other words, the computer device determines whether the labeled data obtained according to the target model is accurate according to the evaluation score, if so, obtains a labeled sample according to the labeled data and the unlabeled sample, adds the labeled sample to the initial sample set, and increases the number of labeled samples in the initial sample set. The present embodiment does not limit the method for determining whether the annotation data obtained by the target model is accurate according to the evaluation score, as long as the function thereof can be achieved.

According to the training sample obtaining method provided by the embodiment of the application, a target model and an identification model are obtained through obtaining an initial sample set and training according to marks in the initial sample set; inputting unmarked samples in the initial sample set into a target model to obtain marked data; determining input parameters of the identification model according to the marked data and the unmarked sample; inputting the input parameters into the identification model to obtain an evaluation score; and determining whether to update the marked samples of the initial sample set according to the marked data and the unmarked samples according to the evaluation scores. According to the training sample obtaining method provided by the embodiment of the application, the unmarked samples can be marked by using the target model obtained by training a small amount of marked samples in the initial sample set, so that a large amount of marked samples, namely training samples, can be obtained. Therefore, a large amount of unlabeled samples do not need to be labeled manually, the consumption of manpower can be reduced, and the efficiency of acquiring the training samples is improved. In addition, the accuracy of labeling the unmarked sample by using the target model is identified by using the identification model obtained by training the marked sample, so that the accuracy of the finally determined marked sample can be improved. By using the training sample acquisition method provided by the embodiment of the application, a large amount of high-quality labeled samples can be obtained through a small amount of labeled samples.

In one embodiment, one possible implementation that involves determining whether to update the labeled samples of the initial sample set based on the annotation data and the unlabeled samples based on the evaluation score includes:

After obtaining the output result of the identification model, namely the evaluation score, the computer equipment compares the evaluation score with a preset threshold value, wherein the preset threshold value is used for representing the accuracy of the output result of the target model. The preset threshold may be a value previously stored in the computer device by a worker. If the computer equipment determines that the evaluation score is greater than or equal to the preset threshold, the evaluation score indicates that the labeled data obtained through the target model is accurate, a new labeled sample (namely, a high-quality labeled sample) is formed according to the unlabeled sample and the labeled data, the labeled sample is added into the initial sample set, and the number of labeled samples in the initial sample set is increased. If the computer equipment determines that the evaluation score is smaller than the preset threshold value, the evaluation score indicates that the labeled data obtained through the target model is inaccurate, and the labeled sample composed of the unlabeled sample and the labeled data at the moment cannot be added into the initial sample set.

In this embodiment, the method for determining whether the labeled data obtained through the target model is accurate is simple and easy to understand and implement.

Referring to fig. 2, in one embodiment, a possible implementation manner related to determining input parameters of an authentication model according to annotation data and unlabeled samples includes:

and step 200, taking the unmarked sample and the marked data as input parameters.

After the computer equipment obtains the marked data, the marked data and the unmarked sample corresponding to the marked data can be directly used as the input parameters of the identification model. That is, after the labeled data is obtained, the computer device can input the unlabeled sample and the labeled data into the authentication model as input parameters of the authentication model without performing related processing on the unlabeled sample and the labeled data.

And/or step 210, performing feature extraction processing on the unlabeled samples and the labeled data, and taking the extracted features as input parameters.

After the labeled data is obtained, the computer equipment can perform feature extraction processing on the unlabeled samples and the labeled data, and the extracted characteristics are used as input parameters. That is, after obtaining the labeled data, the computer device may perform feature extraction on the labeled data and the unlabeled data, and input the extracted features into the identification model as input parameters of the identification model. The extracted features are related to the types of the labeled data and the unlabeled sample, and if the unlabeled sample is a medical image and the labeled data is segmented data of the medical image, the extracted features may include features such as the size and shape of the medical image, and features such as the position, shape, and area of the labeled data. The present embodiment does not limit the specific feature extraction method as long as the function thereof can be achieved.

In this embodiment, two methods for determining the input parameters of the identification model according to the labeled data and the unlabeled sample are provided, so that the user can select the method according to the actual situation, and the applicability of the training sample acquisition method can be improved.

In one embodiment, the authentication model includes a deep learning model and/or a machine learning model.

The input parameters of the identification model provided in the above embodiment have two modes, and the corresponding identification model also includes two models, i.e. a deep learning model and a machine learning model. The input parameters corresponding to the deep learning model are unmarked samples and marked data, and the input parameters corresponding to the machine learning model are the features obtained after feature extraction processing is carried out on the unmarked samples and the marked data. In this embodiment, the identification model may include one of a deep learning model and a machine learning model, and may also include both the deep learning model and the machine learning model.

In this embodiment, the type of the provided identification model can be selected by the user according to the actual situation, and the applicability of the training sample acquisition method can be improved.

In one embodiment, when the identification model comprises a deep learning model, the unlabeled samples and the labeled data are input into the deep learning model as input parameters, and the output of the deep learning model is the evaluation score.

When the identification model comprises a machine learning model, inputting the characteristics obtained by processing the unmarked samples and the marked data through characteristic extraction as input parameters into the machine learning model, wherein the output of the machine learning model is the evaluation score.

Referring to fig. 3, in one embodiment, a possible implementation manner of inputting input parameters into an authentication model to obtain an evaluation score when the authentication model includes a deep learning model and a machine learning model includes:

and step 300, inputting the input parameters into the deep learning model and the machine learning model respectively.

And 310, carrying out averaging processing on the output of the deep learning model and the output of the machine learning model to obtain an evaluation score.

And inputting the unmarked samples and the marked data into the deep learning model by the computer equipment as input parameters to obtain an output result of the deep learning model, and recording the output result as a first evaluation score. And inputting the characteristics obtained by the computer equipment through characteristic extraction processing of the unmarked samples and the marked data into the machine learning model as input parameters to obtain an output result of the machine learning model, and recording the output result as a second evaluation score. And averaging the first evaluation score and the second evaluation score to obtain the evaluation score.

In this embodiment, the identification model includes a deep learning model and a machine learning model, and the evaluation score obtained according to the mean value of the output results of the two models is more accurate, so that a high-quality labeled sample can be obtained, and the reliability and the practicability of the training sample obtaining method can be improved.

In one embodiment, the training sample acquisition method further comprises:

The plurality of initial marking samples may refer to a small number of manually labeled initial marking samples acquired by a worker. The plurality of initial marker samples may be pre-stored in a memory of the computer device and retrieved directly from the memory when needed by the computer device. The plurality of initial marked samples can also be pre-stored in a specific storage device, and the computer device can obtain the initial marked samples from the specific storage device when needed. The present embodiment does not limit the method for acquiring a plurality of initial marker samples by a computer device as long as the functions thereof can be realized.

The computer device obtains a marked sample from the plurality of initial marked samples according to the evaluation result after obtaining the plurality of initial marked samples. In other words, the computer device evaluates each of the initial marked samples and determines whether each of the initial marked samples is a high quality marked sample based on the evaluation. And if the initial marked sample is determined to be a high-quality marked sample according to the evaluation result, adding the initial marked sample as a marked sample to the initial sample set. And if the initial marked sample is determined not to be the high-quality marked sample according to the evaluation result, determining that the initial marked sample is re-marked or rejected by the staff. And if so, re-evaluating the marked sample. The present embodiment does not limit the method for evaluating a plurality of initial marked samples by a computer device, as long as the function thereof can be realized.

In this embodiment, by evaluating the obtained multiple initial labeled samples, the quality of the obtained labeled samples can be ensured to be higher, so that the target model and the identification model trained by the labeled samples are more accurate, the labeling of the unlabeled samples is more accurate, and a large number of high-quality labeled samples, namely training samples, can be obtained.

Referring to fig. 4, in one embodiment, a possible implementation manner of obtaining a labeled sample from a plurality of initial labeled samples based on an evaluation result involves performing sample evaluation on the plurality of initial labeled samples, including:

after the plurality of initially labeled samples, the computer device performs the following screening process on the plurality of initially labeled samples:

step 400, determining an L group of sample sets according to a plurality of initial labeled samples, determining a training set and a verification set according to the L group of sample sets, and training initial models corresponding to target models according to the training set and the verification set to obtain N groups of sample screening models; l and N are integers greater than zero.

And the computer equipment carries out grouping processing on the obtained multiple initial marked samples to obtain an L group of sample sets. Specifically, the computer device performs an averaging process on a plurality of initial marked samples, that is, the number of initial marked samples in each set of samples is the same.

After the L groups of sample sets are determined, the computer equipment determines a training set and a verification set according to the L groups of sample sets, trains an initial model corresponding to the target model according to the training set, and verifies the trained initial model according to the verification set to obtain N groups of sample screening models. Multiple initial marker samples can be screened using the N-set sample screening model. The present embodiment does not limit the method for specifically determining the N groups of sample screening models, as long as the functions thereof can be realized.

Referring to fig. 5, in an embodiment, a possible implementation manner of determining a training set and a verification set according to L groups of sample sets, and training an initial model corresponding to a target model according to the training set and the verification set to obtain N groups of sample screening models includes:

step 500, traversing each group of sample sets in the L groups of sample sets, taking one group of sample sets as a verification set, and taking the rest sample sets in the L groups of sample sets as training sets.

After obtaining the L groups of sample sets, the computer device traverses each group of sample sets in the L groups of sample sets, uses one group of sample sets as a verification set, and uses the remaining sample sets in the L groups of sample sets as a training set. That is, each of the L sets of samples can be used as a training set and a validation set.

Assume that there are 3 sets of sample sets, denoted as a first sample set, a second sample set, and a third sample set, respectively. The first sample set is a verification set, and the second sample set and the third sample set are training sets; the second sample set is a verification set, and the first sample set and the third sample set are training sets; the third sample set is a validation set, and the first sample set and the second sample set are training sets.

And 510, training the initial model by using a training set to obtain a training result, and verifying the trained initial model by using a verification set to obtain a verification result.

The computer equipment firstly trains an initial model corresponding to the target model according to the obtained training set to obtain a training result; and verifying the trained initial model by using a verification set corresponding to the training set to obtain a verification result. The present embodiment does not limit the specific training process and the verification process as long as the functions thereof can be realized.

And step 520, determining a sample screening model corresponding to a group of sample sets according to the training result, the verification result and the preset constraint condition.

And the computer equipment can determine a sample screening model corresponding to a group of sample sets according to the training result obtained by training, the verification result obtained by verifying and preset constraint conditions. The preset constraint condition is related to the number of sample screening models corresponding to a group of sample sets required to be obtained. The preset constraint condition may include a constraint on the training result and a constraint on the verification result. The present embodiment does not limit the method for specifically determining the sample screening model corresponding to a group of sample sets, as long as the function thereof can be realized.

And step 530, determining N groups of sample screening models according to the sample screening model corresponding to each group of sample sets in the L groups of sample sets.

The computer device may determine the N groups of sample screening models by traversing each group of sample sets in the L groups of sample sets and determining the sample screening model corresponding to each group of sample sets. And obtaining N sample screening models for the sample screening model corresponding to each group of sample sets, and obtaining N groups of sample screening models for the sample screening model corresponding to the L groups of sample sets. That is, the first sample screening model in the sample screening models corresponding to each group of sample sets is used as the first group of sample screening models, the second sample screening model in the sample screening models corresponding to each group of sample sets is used as the second group of sample screening models, and so on, and the nth sample screening model in the sample screening models corresponding to each group of sample sets is used as the nth group of sample screening models. That is, the number of groups of sample screening models corresponding to the L groups of sample sets is related to the number of sample screening models corresponding to the group of sample sets.

In an optional embodiment, the preset constraint condition includes a constraint corresponding to the verification result and a constraint corresponding to the training result. The constraint corresponding to the verification result may be that an error between the verification result and the training result reaches a preset threshold, and the constraint corresponding to the training result may be that the error between the verification result and the training result does not decrease or an overlap coefficient between the verification structure and the training result does not increase. Thus, 2 sample screening models corresponding to a group of sample sets can be obtained according to the training results, the verification results and the preset constraint conditions, and 2 groups of sample screening models can be obtained according to the sample screening model corresponding to each group of sample sets in the L groups of sample sets.

And step 410, inputting the plurality of initial marked samples into the N groups of sample screening models, determining an evaluation result according to the output of the N groups of sample screening models, and determining abnormal samples in the plurality of initial marked samples according to the evaluation result.

After obtaining the N groups of sample screening models, the computer device inputs each initial marking sample in the plurality of initial marking samples into the N groups of sample screening models respectively, determines an evaluation result of each initial marking sample according to the output of the N groups of sample screening models, and determines an abnormal sample in the plurality of initial marking samples according to the evaluation result. The present embodiment does not limit the method of determining an abnormal sample among a plurality of initial marked samples from the evaluation result as long as the function thereof can be achieved.

In an optional embodiment, the computer device inputs each initial marked sample into N groups of sample screening models, and has an evaluation result for each group of sample screening models, performs comprehensive evaluation on the evaluation results of the N groups of sample screening models (if the evaluation result corresponding to the sample screening model is a numerical value, the comprehensive evaluation is an average value of the evaluation results of the N groups of sample screening models), sorts the comprehensive evaluations corresponding to all the initial marked samples from large to small, and takes the initial sample mark corresponding to the worst comprehensive evaluation as the abnormal sample. The worst overall evaluation may refer to the last 10% to 15% of all overall evaluations (the smaller 2 or 3 of the mean). And if the comprehensive evaluation corresponding to all the initial marked samples is smaller than the preset condition, indicating that no abnormal sample exists in the plurality of initial marked samples. The preset condition may be set by a worker according to an actual situation, and specifically, the preset condition may be smaller than the mean threshold. That is, if the mean of the evaluation results corresponding to all the initial marked samples is smaller than the threshold, it indicates that there is no abnormal sample in the plurality of initial marked samples.

And step 420, updating the plurality of initial marked samples according to the abnormal samples, and performing screening processing on the plurality of updated initial marked samples until the plurality of initial marked samples do not contain the abnormal samples.

When determining an abnormal sample existing in the plurality of initial marked samples, the computer device may remove the abnormal sample from the plurality of initial marked samples to obtain a plurality of new initial marked samples; or the staff can re-label the abnormal sample, and add the re-labeled sample into the plurality of initial labeled samples to obtain a plurality of new initial labeled samples. After obtaining the new plurality of initial marked samples, the computer device performs the above screening process on the new plurality of initial marked samples until the plurality of initial marked samples do not contain abnormal samples, and the plurality of initial marked samples are marked samples at this time.

The method provided by the embodiment can improve the quality of the marked samples in the initial sample set, so that accurate target models and identification models can be obtained, unmarked samples can be accurately marked, and a large number of high-quality marked samples, namely training samples, can be obtained.

Referring to fig. 6, in an embodiment, the training sample obtaining method further includes:

step 600, receiving subjective evaluation scores of the annotation data and modification results of the annotation data; the subjective assessment score is used to characterize the subjective assessment of the annotation data.

The subjective evaluation score is used for representing subjective evaluation of the annotation data, namely evaluation of the annotation data by a worker. The result of modifying the annotation data refers to the data obtained by modifying the annotation data by the staff. In other words, the staff member evaluates the annotation data obtained by the target model, modifies it, and inputs the subjective evaluation score and the modified annotation data into the computer device.

And step 610, if the subjective evaluation score is smaller than the evaluation score, training the target model and the identification model according to the modification result, the unlabeled sample and the subjective evaluation score.

After obtaining the subjective assessment score, the computer device compares the subjective assessment score with the assessment score obtained by the identification model. And if the subjective evaluation score is smaller than the evaluation score and indicates that the target model and the identification model are possibly inaccurate, training the target model according to the marked sample consisting of the modification result and the unmarked sample, and training the identification model according to the marked sample consisting of the modification result and the unmarked sample and the subjective evaluation score, so that the target model and the identification model are optimized. And if the subjective evaluation score is determined to be greater than or equal to the evaluation score, the target model and the identification model are more accurate.

In the embodiment, under the condition that a large number of marked samples are not introduced, the target model and the identification model are optimized by receiving evaluation and modification of the marking data by a worker under the daily condition, so that the more accurate target model and identification model can be obtained, and the practicability of the training sample obtaining method can be improved.

It should be understood that, although the steps in the flowcharts in the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in the figures may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

Referring to fig. 7, an embodiment of the present application provides a training sample acquiring apparatus 10, which includes an acquiring module 11, a first determining module 12, a second determining module 13, and a third determining module 14. Wherein,

the acquisition module 11 is configured to acquire an initial sample set, and train to obtain a target model and an identification model according to a labeled sample in the initial sample set; the identification model is used for carrying out model evaluation on the target model;

the first determining module 12 is configured to input unlabeled samples in the initial sample set into the target model to obtain labeled data;

the second determining module 13 is configured to determine an input parameter of the authentication model according to the labeled data and the unlabeled sample; inputting the input parameters into the identification model to obtain an evaluation score;

the third determination module 14 is configured to determine whether to update the marked samples of the initial sample set according to the annotation data and the unmarked samples according to the evaluation score.

In an embodiment, the third determining module 14 is specifically configured to add a sample composed of an unlabeled sample and labeled data as a new labeled sample to the initial sample set if the evaluation score is greater than or equal to a preset threshold, where the preset threshold is used to characterize accuracy of an output result of the target model.

In one embodiment, the second determination module 13 is specifically configured to use the unlabeled sample and the labeled data as input parameters; and/or performing feature extraction processing on the unmarked sample and the marked data, and taking the extracted features as input parameters.

In one embodiment, the second determining module 13 is further specifically configured to input the input parameters into the deep learning model and the machine learning model, respectively; and carrying out averaging processing on the output of the deep learning model and the output of the machine learning model to obtain an evaluation score.

In one embodiment, the training sample acquiring device 10 further comprises an evaluation module. The evaluation module is used for carrying out sample evaluation on the plurality of initial marked samples, and obtaining marked samples from the plurality of initial marked samples based on the evaluation result.

In one embodiment, the evaluation module is specifically configured to perform the following screening process on a plurality of initially labeled samples:

determining an L group of sample sets according to the initial labeled samples, determining a training set and a verification set according to the L group of sample sets, and training initial models corresponding to target models according to the training set and the verification set to obtain N groups of sample screening models; l and N are integers greater than zero; inputting a plurality of initial marked samples into N groups of sample screening models, determining an evaluation result according to the output of the N groups of sample screening models, and determining abnormal samples in the plurality of initial marked samples according to the evaluation result; and updating the plurality of initial marked samples according to the abnormal samples, and performing screening processing on the plurality of updated initial marked samples until the plurality of initial marked samples do not contain the abnormal samples.

In one embodiment, the evaluation module is further specifically configured to traverse each group of sample sets in the L groups of sample sets, take one group of sample sets as a verification set, and take the remaining sample sets in the L groups of sample sets as a training set; training the initial model by using a training set to obtain a training result, and verifying the trained initial model by using a verification set to obtain a verification result; determining a sample screening model corresponding to a group of sample sets according to the training result, the verification result and a preset constraint condition; and determining N groups of sample screening models according to the sample screening model corresponding to each group of sample sets in the L groups of sample sets.

In one embodiment, the training sample acquiring device 10 further comprises a receiving module and a training module.

The receiving module is used for receiving the subjective evaluation score of the annotation data and the modification result of the annotation data; the subjective evaluation score is used for representing subjective evaluation on the annotation data;

and the training module is used for training the target model and the identification model according to the modification result, the unlabeled sample and the subjective evaluation score if the subjective evaluation score is smaller than the evaluation score.

For the specific limitations of the training sample acquiring device 10, reference may be made to the limitations of the training sample acquiring method above, which are not described herein again. The various modules in the training sample acquiring device 10 may be implemented in whole or in part by software, hardware, and combinations thereof. The above devices, modules or units may be embedded in hardware or independent from a processor in a computer device, or may be stored in a memory in the computer device in software, so that the processor can call and execute operations corresponding to the above devices or modules.

Referring to fig. 8, in one embodiment, a computer device is provided, and the computer device may be a server, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store an initial sample set, an initial model, etc. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer device, when executed by a processor, implements a training sample acquisition method.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the following steps when executing the computer program:

In one embodiment, the processor, when executing the computer program, further performs the steps of: and if the evaluation score is larger than or equal to a preset threshold value, adding a sample consisting of the unmarked sample and the marked data into the initial sample set as a new marked sample, wherein the preset threshold value is used for representing the accuracy of the output result of the target model.

In one embodiment, the processor, when executing the computer program, further performs the steps of: taking the unmarked sample and the marked data as input parameters; and/or performing feature extraction processing on the unmarked sample and the marked data, and taking the extracted features as input parameters.

In one embodiment, the processor, when executing the computer program, further performs the steps of: inputting the input parameters into a deep learning model and a machine learning model respectively; and carrying out averaging processing on the output of the deep learning model and the output of the machine learning model to obtain an evaluation score.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and performing sample evaluation on the plurality of initial marked samples, and obtaining marked samples from the plurality of initial marked samples based on the evaluation result.

In one embodiment, the processor, when executing the computer program, further performs the steps of: the following screening treatments were performed on a plurality of initially labeled samples: determining an L group of sample sets according to the initial labeled samples, determining a training set and a verification set according to the L group of sample sets, and training initial models corresponding to target models according to the training set and the verification set to obtain N groups of sample screening models; l and N are integers greater than zero; inputting a plurality of initial marked samples into N groups of sample screening models, determining an evaluation result according to the output of the N groups of sample screening models, and determining abnormal samples in the plurality of initial marked samples according to the evaluation result; and updating the plurality of initial marked samples according to the abnormal samples, and performing screening processing on the plurality of updated initial marked samples until the plurality of initial marked samples do not contain the abnormal samples.

In one embodiment, the processor, when executing the computer program, further performs the steps of: traversing each group of sample sets in the L groups of sample sets, taking one group of sample sets as a verification set, and taking the rest sample sets in the L groups of sample sets as training sets; training the initial model by using a training set to obtain a training result, and verifying the trained initial model by using a verification set to obtain a verification result; determining a sample screening model corresponding to a group of sample sets according to the training result, the verification result and a preset constraint condition; and determining N groups of sample screening models according to the sample screening model corresponding to each group of sample sets in the L groups of sample sets.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of: and if the evaluation score is larger than or equal to a preset threshold value, adding a sample consisting of the unmarked sample and the marked data into the initial sample set as a new marked sample, wherein the preset threshold value is used for representing the accuracy of the output result of the target model.

In one embodiment, the computer program when executed by the processor further performs the steps of: taking the unmarked sample and the marked data as input parameters; and/or performing feature extraction processing on the unmarked sample and the marked data, and taking the extracted features as input parameters.

In one embodiment, the computer program when executed by the processor further performs the steps of: inputting the input parameters into a deep learning model and a machine learning model respectively; and carrying out averaging processing on the output of the deep learning model and the output of the machine learning model to obtain an evaluation score.

In one embodiment, the computer program when executed by the processor further performs the steps of: and performing sample evaluation on the plurality of initial marked samples, and obtaining marked samples from the plurality of initial marked samples based on the evaluation result.

In one embodiment, the computer program when executed by the processor further performs the steps of: the following screening treatments were performed on a plurality of initially labeled samples: determining an L group of sample sets according to the initial labeled samples, determining a training set and a verification set according to the L group of sample sets, and training initial models corresponding to target models according to the training set and the verification set to obtain N groups of sample screening models; l and N are integers greater than zero; inputting a plurality of initial marked samples into N groups of sample screening models, determining an evaluation result according to the output of the N groups of sample screening models, and determining abnormal samples in the plurality of initial marked samples according to the evaluation result; and updating the plurality of initial marked samples according to the abnormal samples, and performing screening processing on the plurality of updated initial marked samples until the plurality of initial marked samples do not contain the abnormal samples.

In one embodiment, the computer program when executed by the processor further performs the steps of: traversing each group of sample sets in the L groups of sample sets, taking one group of sample sets as a verification set, and taking the rest sample sets in the L groups of sample sets as training sets; training the initial model by using a training set to obtain a training result, and verifying the trained initial model by using a verification set to obtain a verification result; determining a sample screening model corresponding to a group of sample sets according to the training result, the verification result and a preset constraint condition; and determining N groups of sample screening models according to the sample screening model corresponding to each group of sample sets in the L groups of sample sets.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A training sample acquisition method is characterized by comprising

Acquiring an initial sample set, and training according to a labeled sample in the initial sample set to obtain a target model and an identification model; the identification model is used for carrying out model evaluation on the target model;

inputting unmarked samples in the initial sample set into the target model to obtain marked data;

determining input parameters of the identification model according to the labeling data and the unlabeled sample; inputting the input parameters into the identification model to obtain an evaluation score;

and determining whether to update the marked samples of the initial sample set according to the marked data and the unmarked samples according to the evaluation score.

2. The training sample acquisition method according to claim 1, wherein the determining whether to update the labeled samples of the initial sample set according to the labeled data and the unlabeled samples according to the evaluation score includes:

and if the evaluation score is larger than or equal to a preset threshold value, adding a sample formed by the unmarked sample and the marked data into the initial sample set as a new marked sample, wherein the preset threshold value is used for representing the accuracy of the output result of the target model.

3. The method for obtaining training samples according to claim 1, wherein the determining input parameters of the identification model according to the labeled data and the unlabeled samples comprises:

taking the unlabeled sample and the labeled data as the input parameters;

and/or performing feature extraction processing on the unmarked sample and the marked data, and taking the extracted features as the input parameters.

4. The training sample acquisition method according to claim 3, wherein the identification model includes a deep learning model and/or a machine learning model.

5. The training sample acquisition method according to claim 4, wherein when the identification model includes a deep learning model and a machine learning model, the inputting the input parameter into the identification model to obtain an evaluation score includes:

inputting the input parameters into the deep learning model and the machine learning model respectively;

and carrying out averaging processing on the output of the deep learning model and the output of the machine learning model to obtain the evaluation score.

6. The training sample acquisition method of claim 1, further comprising:

sample evaluation is performed on a plurality of initial marked samples, and the marked samples are obtained from the plurality of initial marked samples based on the evaluation result.

7. The training sample acquisition method according to claim 6, wherein the performing sample evaluation on a plurality of initial labeled samples, and obtaining the labeled sample from the plurality of initial labeled samples based on the evaluation result, comprises:

subjecting the plurality of initial labeled samples to the following screening process:

determining an L group of sample sets according to the initial labeled samples, determining a training set and a verification set according to the L group of sample sets, and training initial models corresponding to the target models according to the training set and the verification set to obtain N groups of sample screening models; l and N are integers greater than zero;

inputting the plurality of initial marked samples into the N groups of sample screening models, determining the evaluation result according to the output of the N groups of sample screening models, and determining abnormal samples in the plurality of initial marked samples according to the evaluation result;

and updating the plurality of initial marked samples according to the abnormal samples, and executing the screening processing on the plurality of updated initial marked samples until the plurality of initial marked samples do not contain abnormal samples.

8. The method for obtaining the training sample according to claim 7, wherein the determining a training set and a verification set according to the L groups of sample sets, and training an initial model corresponding to the target model according to the training set and the verification set to obtain N groups of sample screening models includes:

traversing each group of sample sets in the L groups of sample sets, taking the group of sample sets as the verification set, and taking the rest sample sets in the L groups of sample sets as the training set;

training the initial model by using the training set to obtain a training result, and verifying the trained initial model by using the verification set to obtain a verification result;

determining a sample screening model corresponding to the group of sample sets according to the training result, the verification result and a preset constraint condition;

and determining the N groups of sample screening models according to the sample screening model corresponding to each group of sample sets in the L groups of sample sets.

9. The training sample acquisition method of claim 1, further comprising:

receiving subjective evaluation scores of the marking data and modification results of the marking data; the subjective evaluation score is used for representing subjective evaluation on the marking data;

10. A training sample acquisition device, comprising:

the acquisition module is used for acquiring an initial sample set, and training according to a labeled sample in the initial sample set to obtain a target model and an identification model; the identification model is used for carrying out model evaluation on the target model;

the second determining module is used for determining the input parameters of the identification model according to the labeling data and the unlabeled samples; inputting the input parameters into the identification model to obtain an evaluation score;

and the third determining module is used for determining whether to update the marked samples of the initial sample set according to the marked data and the unmarked samples according to the evaluation score.

11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 9 when executing the computer program.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 9.