CN110610197A

CN110610197A - Method and device for mining difficult sample and training model and electronic equipment

Info

Publication number: CN110610197A
Application number: CN201910762856.5A
Authority: CN
Inventors: 郝春雨; 邵帅; 俞刚
Original assignee: Beijing Maigewei Technology Co Ltd
Current assignee: Beijing Maigewei Technology Co Ltd
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2019-12-24
Anticipated expiration: 2039-08-19
Also published as: CN110610197B

Abstract

The invention provides a method, a device and electronic equipment for mining a difficult sample and training a model, wherein the method for mining the difficult sample comprises the following steps: obtaining sample data which is not marked; obtaining a plurality of trained models; inputting each unlabeled sample in the unlabeled sample data into a plurality of trained models to obtain a sample result; and screening out difficult samples from the unlabeled samples according to the sample result. Therefore, unlabeled samples can be screened, and difficult samples can be determined from the unlabeled samples; compared with manual data observation, the screening can save a large amount of processing time; and most simple samples in the unlabeled samples can be removed through screening, so that the number of samples to be labeled is greatly reduced, the training speed and the training effect of the model are greatly improved, and the labeling cost is reduced.

Description

Method and device for mining difficult sample and training model and electronic equipment

Technical Field

The invention relates to the technical field of air conditioners, in particular to a method and a device for mining a difficult sample and training a model and electronic equipment.

Background

Deep learning, one of the fields of machine learning techniques and research, requires a large amount of computing resources and data. In terms of computational resources, we have a CPU to the following GPU/TPU. In terms of data, the data is in the big data age, the data volume is large, and the data types are various. In practical application, most of the training data needs to be labeled. In the process of training a model which is continuously iterated, a great part of sample data in training data is found to be easy to judge for the model, and the sample data which is easy to judge is useless or little in use for the training of the model; meanwhile, the labeling of the sample data occupies high cost.

For this reason, some existing methods improve the training effect by the rules during training, such as modifying the loss function, but this method cannot reduce the data size to be labeled; some methods observe data conditions and perform selective labeling through manpower, but the method occupies a large amount of manpower cost.

Therefore, a method and an apparatus for quickly determining a difficult sample from sample data are needed.

Disclosure of Invention

The problem solved by the invention is how to quickly determine difficult samples from sample data.

In order to solve the above problems, the present invention provides a method for mining a difficult sample, comprising:

obtaining sample data which is not marked;

obtaining a plurality of trained models;

inputting each unlabeled sample in the unlabeled sample data into a plurality of trained models to obtain a sample result;

and screening out difficult samples from the unlabeled samples according to the sample result.

Therefore, unlabeled samples can be screened, and difficult samples can be determined from the unlabeled samples; compared with manual data observation, the screening can save a large amount of processing time; and most simple samples in the unlabeled samples can be removed through screening, so that the number of samples to be labeled is greatly reduced, the training speed and the training effect of the model are greatly improved, and the labeling cost is reduced.

Optionally, in the obtaining of the plurality of trained models, the number of the trained models is 3 to 5.

Optionally, in the obtaining of the plurality of trained models, the number of the trained models is 3.

Therefore, the additional time and workload are minimum on the basis of ensuring the accuracy of judging the specific situation of the unlabeled data.

Optionally, each unlabeled sample in the unlabeled sample data is input into a plurality of trained models, so as to obtain a sample result, where the sample result includes an output result obtained after the unlabeled sample is input into the trained models and a confidence of the output result.

Optionally, the screening out the difficult samples from the unlabeled samples according to the sample result includes:

obtaining the sample result of the same unlabeled sample;

judging whether the number of the same output results in the output results of the unlabeled samples falls into a first number interval or not;

and if the number of the unlabeled samples falls into the first number interval, the unlabeled samples are difficult samples.

Therefore, whether the unmarked sample is a difficult sample with training value or not can be judged through the first quantity interval, so that the difficult sample can be screened out; therefore, on one hand, a large amount of processing time can be saved, and on the other hand, simple samples with low training value can be eliminated, so that the number of samples needing to be labeled is greatly reduced, the training speed and the training effect of the model are greatly improved, and the labeling cost is reduced.

Optionally, the first number interval is (0.35n, 0.85n), and n is the number of the trained models.

Optionally, after obtaining the sample result of the same unlabeled sample, the method further includes:

judging whether the number of the same output results in the output results of the unlabeled samples falls into a second number interval or not;

if the confidence degrees corresponding to the same output results are all high confidence degrees, judging whether the confidence degrees corresponding to the same output results are all high confidence degrees;

and if the non-uniformity is the high confidence level, the unlabeled sample is a difficult sample.

Whether the unlabeled samples are difficult samples with training value can be judged through the second quantity interval and the confidence coefficient, so that the difficult samples are screened out;

optionally, the second number interval is (0.85n, n), where n is the number of the trained models.

Optionally, the difficult samples include high-value samples and second-highest-value samples;

if the number of the unlabeled samples falls into the first number interval, the unlabeled samples are difficult samples, and the labeling method comprises the following steps:

if the output result falls into the first quantity interval, judging whether the confidence degrees corresponding to the same output result are all low confidence degrees;

if the confidence coefficient is low, the unlabeled sample is the second highest value sample;

and if the confidence coefficients are not all low, the unlabeled sample is the high-value sample.

Optionally, if the difference is the high confidence level, the unlabeled sample is in the difficult samples, and if the difference is the high confidence level, the unlabeled sample is the second highest value sample.

Therefore, by subdividing the difficult samples, each sample in the difficult samples can be further distinguished according to the training value, and different samples can be selected for marking and training according to actual needs when the model is subsequently trained.

Optionally, the method further includes the steps of screening a difficult sample from the unlabeled sample according to the sample result, and screening an accurate sample from the unlabeled sample according to the sample result;

after the difficult samples are screened from the unlabelled samples according to the sample result, the method further comprises the following steps:

and generating the difficult sample according to the accurate sample and the current trained model.

Optionally, if the confidence level falls into the second number interval, after determining whether the confidence levels corresponding to the same output result are all high confidence levels, the method further includes:

and if the confidence level is high, the unlabeled sample is an accurate sample.

Like this, through accurate sample and the current model of having trained, can generate the difficulty sample to can be on the basis of screening the difficulty sample, generate the difficulty sample through accurate sample, further increase the quantity of difficulty sample, thereby play better training effect to the model after the mark, further promote the degree of accuracy of the discernment of the model of having trained or accurate output.

Optionally, the generating the difficult sample according to the accurate sample and the currently trained model includes:

performing data dithering processing on the accurate sample to generate dithering sample data;

acquiring a current trained model, and inputting the jitter sample data into the current trained model to obtain a jitter sample result;

acquiring a label value of the accurate sample, and judging whether the label value is the same as the jitter sample result or not;

if so, the jittered sample is a high-value sample of the difficult samples.

Like this, can pass through the data shake to accurate sample, generate the shake sample to select difficult sample from it, thereby can be on the basis of screening difficult sample, generate difficult sample through accurate sample, further increase the quantity of difficult sample, thereby play better training effect to the model after the mark, further promote the discernment of trained model or the accuracy of accurate output.

Optionally, the generating the difficult sample according to the accurate sample and the currently trained model further includes:

judging the common points of the difficult samples according to the accurate samples and the difficult samples generated by the accurate samples;

and adjusting the data dithering processing mode of the accurate samples based on the common point of the difficult samples.

Optionally, the data dithering processing mode includes at least one of picture angle rotation, picture size conversion, and picture brightness conversion.

Therefore, the common points of the screened difficult samples can be determined, the common points correspond to the defects of the currently trained model, and the currently trained model can be trained through the screened difficult samples, so that the corresponding defects can be eliminated.

Secondly, a model training method is provided, which comprises the following steps:

determining a difficult sample according to the difficult sample mining method;

labeling the difficult samples;

and training the current trained model according to the labeled difficult sample.

Therefore, unlabeled samples can be screened, and the model is trained after difficult samples are determined to be labeled; compared with manual data observation, the screening method can save a large amount of processing time

Optionally, before labeling the difficult sample, the method further includes:

and acquiring the sample demand of model training and the number of high-value samples and second high-value samples in the difficult samples, and determining the difficult samples needing to be labeled.

Therefore, the difficult samples to be marked can be determined according to the actual sample demand of model training, the difficult samples to be marked can be determined in time according to actual requirements, and then model training is carried out after marking.

Optionally, the obtaining of the required sample amount of the model training and the number of the high-value samples and the second high-value samples in the difficult samples to determine the difficult samples to be labeled includes:

acquiring the sample demand of model training and the number of high-value samples and second high-value samples in the difficult samples;

judging the size relationship between the sample demand and the high-value sample quantity;

and if the sample demand is smaller than the number of the high-value samples, randomly selecting the high-value samples with the same number as the sample demand from the high-value samples as the difficult samples needing to be marked.

Optionally, the obtaining of the required sample amount of the model training and the number of the high-value samples and the second high-value samples in the difficult samples to determine the difficult samples to be labeled further includes:

if the sample demand is greater than the high-value sample quantity, calculating a difference value between the sample demand and the high-value sample quantity;

randomly selecting the second highest value samples with the same quantity as the difference value from the second highest value samples, and taking the second highest value samples and the selected second highest value samples as difficult samples needing to be marked.

Therefore, the required unlabeled samples can be rapidly selected, and the labeling and model training can be conveniently and rapidly carried out. And the selected unmarked sample has high training value, thereby further improving the effect of model training.

There is again provided a difficult sample excavating device comprising:

the acquisition unit is used for acquiring sample data which is not marked;

and further for obtaining a plurality of trained models;

the model unit is used for inputting each unlabeled sample in the unlabeled sample data into a plurality of trained models to obtain a sample result;

and the screening unit is used for screening the difficult samples from the unlabeled samples according to the sample result.

Therefore, most simple samples in the unlabeled samples can be removed by screening and labeling, so that the number of samples to be labeled is greatly reduced, the training speed and the training effect of the model are greatly improved, and the labeling cost is reduced.

There is provided a model training apparatus, comprising:

the difficult sample excavating device is used for determining the difficult samples;

the labeling unit is used for labeling the difficult samples;

and the training unit is used for training the currently trained model according to the labeled difficult sample. Therefore, unlabeled samples can be screened, and the model can be trained after difficult samples are determined to be labeled.

There is still further provided an electronic device comprising a processor and a memory, wherein the memory stores a control program, and the control program, when executed by the processor, implements the above-mentioned difficult sample mining method or implements the above-mentioned model training method.

Finally, a computer readable storage medium is provided, storing instructions, wherein the instructions, when loaded and executed by a processor, implement the above-described difficult sample mining method, or the above-described model training method.

Drawings

FIG. 1 is a first flowchart of a method for mining a difficult sample according to an embodiment of the present invention;

FIG. 2A is an exemplary diagram of an unlabeled sample 1 according to an embodiment of the disclosure;

FIG. 2B is an exemplary diagram of an unlabeled sample 2 according to an embodiment of the disclosure;

FIG. 2C is an exemplary diagram of an unlabeled sample 3 according to an embodiment of the disclosure;

FIG. 2D is an exemplary diagram of an unlabeled sample 4 according to an embodiment of the disclosure;

FIG. 2E is an exemplary diagram of an unlabeled sample 5 according to an embodiment of the disclosure;

FIG. 2F is an exemplary diagram of an unlabeled sample 6 according to an embodiment of the disclosure;

FIG. 2G is an exemplary diagram of an unlabeled sample 7 according to an embodiment of the disclosure;

FIG. 2H is an exemplary diagram of an unlabeled sample 8 according to an embodiment of the disclosure;

FIG. 2I is an exemplary diagram of an unlabeled exemplar 9 according to an embodiment of the present invention;

FIG. 2J is an exemplary diagram of an unlabeled example 10 according to an embodiment of the disclosure;

FIG. 3 is a sample result of unlabeled sample data according to an embodiment of the present invention;

FIG. 4 is a first flowchart of a difficult sample mining method step 40 according to an embodiment of the present invention;

FIG. 5 is a flowchart of a second step 40 of the difficult sample mining method according to an embodiment of the present invention;

FIG. 6 is a flowchart of a difficult sample mining method step 43 according to an embodiment of the present invention;

FIG. 7 is a flowchart of a difficult sample mining method according to an embodiment of the present invention;

FIG. 8 is a flowchart of step 40 of a difficult sample mining method according to an embodiment of the present invention;

FIG. 9 is a first flowchart of a difficult sample mining method step 50 according to an embodiment of the present invention;

FIG. 10 is an exemplary diagram of an accurate sample according to an embodiment of the invention;

FIG. 11A is an exemplary diagram generated for an accurate sample 0.2 zoom ratio according to an embodiment of the present invention;

FIG. 11B is an exemplary diagram generated for accurate sample 1.2 scaling ratio according to an embodiment of the invention;

FIG. 12A is an exemplary diagram generated for accurate sample clockwise rotation according to an embodiment of the invention;

FIG. 12B is an exemplary diagram generated for accurate sample counterclockwise rotation according to an embodiment of the present invention;

FIG. 13A is an exemplary diagram generated for accurate sample dimming according to an embodiment of the present invention;

FIG. 13B is an exemplary diagram generated for accurate sample brightening according to an embodiment of the present invention;

FIG. 14 is a flowchart II of a difficult sample mining method step 50 according to an embodiment of the present invention;

FIG. 15 is a first flowchart of a model training method according to an embodiment of the present invention;

FIG. 16 is a second flowchart of a model training method according to an embodiment of the present invention;

FIG. 17 is a first flowchart of a first step 60 of a model training method according to an embodiment of the present invention;

FIG. 18 is a flowchart II of a step 60 of a model training method according to an embodiment of the present invention;

FIG. 19 is a block diagram of a difficult sample excavation apparatus according to an embodiment of the present invention;

FIG. 20 is a block diagram of a model training apparatus according to an embodiment of the present invention;

fig. 21 is a block diagram of an electronic device according to an embodiment of the present invention;

FIG. 22 is a block diagram of another electronic device according to an embodiment of the invention.

Description of reference numerals:

1-acquisition unit, 2-model unit, 3-screening unit, 4-generation unit, 5-quantity determination unit, 6-labeling unit, 7-training unit, 12-electronic device, 14-external device, 16-processing unit, 18-bus, 20-network adapter, 22-input/output (I/O) interface, 24-display, 28-system memory, 30-random access memory, 32-cache memory, 34-storage system, 40-utility, 42-program module.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For easy understanding, in the present invention, technical problems therein need to be elaborated.

Deep learning, one of the fields of machine learning techniques and research, requires a large amount of computing resources and data. In terms of computational resources, we have a CPU to the following GPU/TPU. In terms of data, the data is in the big data age, the data volume is large, and the data types are various.

In specific application, deep learning mainly includes continuously iterating a model; that is, firstly, a large amount of sample data is obtained, and then the contents which need to be determined in the sample data are marked; identifying by using the labeled sample data through a model, and if the identified content is the same as the label, identifying correctly; and modifying the model after analyzing the sample data with wrong identification results, namely iterating, and repeating the steps until the accuracy of the model for determining or identifying the content in the sample data reaches a certain degree.

It can be clearly seen that a large amount of sample data for training a model all need to be labeled, and the behavior of this labeling is a process that occupies a large proportion in the model training. In practice, however, due to the problem of the obtained channel, a large amount of sample data for training the model is easy to identify and accurate for the model, and the sample data has little or no training effect on the model; therefore, for the existing model training, the existence of the sample data in this form reduces the effect of the model training and increases the workload of labeling, which is a problem to be solved at present.

Some existing solutions continue to use the original idea of solving the problem, and adopt a mode of modifying a loss function to improve the effect of model training, but on one hand, the solution does not find the core of the problem, and is only a conventional solution; on the other hand, the loss function is modified to improve the model training effect, and the actually improved model training effect is smaller and smaller due to the existence of the marginal effect.

Some methods selectively label data by observing data conditions through manpower, and although the methods find the core of a problem, sample data used for model training generally has data of more than one hundred thousand, and data of the order of magnitude is observed and selected through manpower, and the manpower consumption is still beyond imagination.

In addition, for the sake of understanding, we here illustrate the technical principles in the solution:

in the application, the training of the model can be generally divided into staged training and training with different iteration times; for example: in march of the year, the model is trained, in april of the year, the model is improved, the model structure of the model (the model structure is the sequence and the quantity of various operations used in the process of building the model, and the operations mentioned here specifically refer to a convolutional layer, a pooling layer, a full connection layer and the like) changes, so that the model in march and the model in april are models in different stages, and the model in april is kept as an optimal model; when training a model, different training rounds are generally set (the training round may be set as a hyper-parameter, that is, the number of rounds to be trained is determined before training begins), for example, 30, 40, and 50 rounds of training are set; thus, three models are obtained through training, the iteration times of the three models respectively correspond to 30, 40 and 50 times, or the three models can be continuously trained on the basis of initial data and the models, and the corresponding models are reserved when the iteration times of the models reach 30, 40 and 50 times (one model is reserved when the iteration times reaches 30 times, and the model is continuously trained until the iteration times reach 40 and 50 times, and the model is reserved again); then, after the training is finished, the model with the best judgment effect or recognition effect is used as a final model of the training; the three models with 30, 40 and 50 iterations are models with different iterations, and the models with different iterations have certain differences on the output of the same input only if the parameters of the models are different due to different rounds; after the training is finished, a model with the best recognition effect is reserved; the best model is the currently trained model.

The disclosed embodiment provides a difficult sample mining method, which can be executed by a difficult sample mining device, and the difficult sample mining device can be integrated in electronic equipment such as a mobile phone, a computer, a server and the like. FIG. 1 is a flowchart illustrating a first method for mining a difficult sample according to an embodiment of the present invention; the difficult sample mining method comprises the following steps:

step 10, obtaining sample data which is not marked;

the obtaining mode of the unlabeled sample data can be obtained from the existing database such as a monitoring and recording database or other modes, and the specific obtaining mode is subject to actual conditions. If the image recognition is performed with deep learning, the unlabeled sample data is generally an image, and as shown in fig. 2, the unlabeled sample data is the unlabeled sample data in the image recognition deep learning.

Step 20, obtaining a plurality of trained models;

in this step, if the number of the trained models is two or one, it is difficult to determine whether the unlabeled sample data is a sample that is easy to determine or a sample that is difficult to determine according to the results output by the two trained models in the subsequent steps. For example, in the scheme of three trained models, the output results of two models are the same, and generally, the probability of accurate judgment is relatively high; however, for the scheme of two trained models, the output results of the two models are different, and it is difficult to determine whether the judgment of one of the models is accurate.

Wherein the training purposes of the plurality of trained models are the same; for example, the training purpose of the trained models is to recognize the license plate. If the models are completely different and are used for license plate recognition, the increase of the diversity of the models is beneficial, and the recognition results of the models on the same license plate/picture are not completely the same, so that the license plate/picture is valuable, and the best model can be further trained after marking.

The obtained trained models can be models in different stages, models with different iteration times, models with parts in different stages and models with parts in different iteration times; however, models in different stages or models with different iteration times are preferred, so that the obtained trained models have higher similarity, and the practical situation of the unlabeled sample can be judged more easily through the trained models.

The obtained trained pattern may or may not include the optimal model, and is only the other stage model of the optimal model or the other iteration model of the optimal model.

Optionally, the plurality of trained models includes an optimal model and a trained model with a different number of iterations from the optimal model; therefore, the trained models are different in training turns, the similarity is high, and non-difficult samples can be judged more easily.

Optionally, the number of the trained models is 3 to 5, because as the number of the trained models increases, when the untrained sample data is processed by the trained models, the workload and the time spent on processing the unlabeled sample data also increase rapidly; the number model in the range can judge the specific condition of the unmarked data under the condition of less time and workload, and has high accuracy.

Optionally, the number of the acquired trained models is 3, so that the additional time and workload are minimum on the basis of ensuring the accuracy of judging the specific condition of the unlabeled data.

Step 30, inputting each unlabeled sample in the unlabeled sample data into a plurality of trained models to obtain a sample result;

in this step, the sample result at least includes an output result of the unlabeled sample after being input into the trained model and a confidence of the output result.

We take three trained models as an example.

As shown in fig. 2, fig. 2A, fig. 2B, fig. 2C, fig. 2D, fig. 2E, fig. 2F, fig. 2G, fig. 2H, fig. 2I, and fig. 2J are 10 unlabeled picture samples, respectively; for ease of identification, the 10 pictures may be numbered sequentially as samples 1-10.

The sample result without labeled sample data is shown in fig. 3; from left to right, the first column is the sample number of the sample data which is not marked (the sample number is only used for conveniently distinguishing each sample in the application), the second column is the figure number and corresponds to the picture in the figure, and the third column and the fourth column are the output result of the sample which is not marked and input into the trained model 1 and the confidence coefficient of the output result; the fifth column and the sixth column are output results after the unlabeled samples are input into the trained model 2 and the confidence coefficient of the output results; the seventh column and the eighth column are output results after the unlabeled samples are input into the trained model 3 and the confidence of the output results.

And step 40, screening out difficult samples from the unlabeled samples according to the sample results.

According to the sample results obtained through the trained models, whether the corresponding unlabeled sample is a difficult sample or not can be judged according to the characteristics shown by the sample results; therefore, unlabeled samples can be screened, and difficult samples can be screened from the unlabeled samples.

Through the steps 10-40, the unlabeled samples can be screened, and difficult samples are determined from the unlabeled samples; compared with manual data observation, the screening can save a large amount of processing time; and most simple samples (samples easy to identify) in the unlabeled samples can be removed through screening, so that the number of samples needing to be labeled is greatly reduced, the training speed and the training effect of the model are greatly improved, and the labeling cost is reduced.

In addition, the method can be used for screening difficult samples through a plurality of trained models, so that the screening accuracy is high.

FIG. 4 is a flowchart illustrating a first step 40 of a method for mining difficult samples according to an embodiment of the present invention; wherein the step 40 comprises:

step 41, obtaining the sample result of the same unlabeled sample;

in the step 30, the unlabeled sample is input into the trained model to obtain a sample result, and the sample result is shown in fig. 3; in the step of obtaining the sample result, based on the above steps, the sample result in the step 30 may be directly read. Referring to fig. 3, if the sample result of sample 1 needs to be obtained, the sample result in step 30 is directly read (as shown in fig. 3).

The obtaining method may also be to obtain the sample result through other channels, for example, directly as described in step 30, and output the sample result through models 1, 2, and 3.

Step 42, judging whether the number of the same output results in the output results of the unlabeled samples falls into a first number interval or not;

in this step, the output result is a part of the sample result, which is obtained after the unlabeled sample is input into the trained model, such as wan CEL, and xiang CEL in sample 1 in fig. 3, that is, the output result corresponding to the same unlabeled sample and different trained models.

The output results of the unlabeled samples correspond to the number of the trained models, and if the number of the unlabeled samples is n, the number of the output results in the output results is the same, namely, some output results in the n output results are the same, and the number of the output results in the part is the number of the output results in the output results. Taking sample 1 in fig. 3 as an example, two of the output results are wan CEL, and one output result is xiang CEL, and the number of the same output results in the output result of sample 1 is two.

It should be clear that, among the n output results, some (assumed to be m) output results may be the same, and among the remaining output results, some (assumed to be k) output results may be the same; then the largest number of the same output results, i.e. the larger of m and k, is taken as the criterion.

For example, if there are 5 output results of an unlabeled sample, each output result is wan CEL1, there are three output results, each output result is wan CEL1, and there is one output result, each output result is wan CELL, then the number of identical output results in the output results of the unlabeled sample is 5.

When n is 3, the first number interval is (1.05, 2.55); that is, when the number of the same output results in the output results of the unlabeled sample is 2, the output results fall into the first number interval.

When n is 4, the first number interval is (1.4, 3.4); that is, when the number of the same output results in the output results of the unlabeled sample is 2 or 3, the output results fall into the first number interval.

When n is 5, the first number interval is (1.75, 4.25); that is, when the number of the same output results in the output results of the unlabeled sample is 2, 3, or 4, the unlabeled sample falls into the first number interval.

And 43, if the sample number falls into the first number interval, the unlabeled sample is a difficult sample.

The fact that the model number falls into the first quantity interval means that the output results of most of the trained models are the same and only a small part of the trained models are different in all the trained models; in this case, generally, the output results of most of the trained models can be considered to be more accurate, but the output results of a small part of the trained models are still inaccurate, which means that for the trained models, most of the features in the unlabeled sample can be identified, and the small part of the features cannot be determined, so that there is a space for improvement; this means that the recognition or accurate output of such unlabeled samples is difficult for the trained model and at the same time has the potential for recognition. That is, the unlabeled sample is a difficult sample with a high training value.

It should be noted that an unidentified unlabelled sample cannot be simply regarded as a difficult sample; for a trained model, it is of training value, can be recognized or accurately output, but temporarily not; if the data is only unrecognizable, the data quality of the unlabeled sample is poor and the unlabeled sample cannot be recognized; or the quality of the data of the unlabeled sample is greatly wrong, and the unlabeled sample cannot be identified even through real person identification, and the like, in this case, the training value of the unlabeled sample is extremely small or even worthless, and the accuracy of the model identification cannot be improved by using the unlabeled sample to train the model, but the accuracy of the identification may be reduced.

Thus, through steps 41-43, whether the unlabeled sample is a difficult sample with a training value can be judged through the first quantity interval, so that the difficult sample is screened out; therefore, on one hand, a large amount of processing time can be saved, and on the other hand, simple samples with low training value can be eliminated, so that the number of samples needing to be labeled is greatly reduced, the training speed and the training effect of the model are greatly improved, and the labeling cost is reduced.

FIG. 5 is a flowchart II of a difficult sample mining method step 40 according to an embodiment of the present invention; in step 40, after step 41, the method further includes:

step 44, judging whether the number of the same output results in the output results of the unlabeled samples falls into a second number interval;

When n is 3, the second number interval is (2.55, 3), that is, when the number of the same output results in the output results of the unlabeled sample is 3, the output results fall into the second number interval.

When n is 4, the second number interval is (3.4, 4), that is, when the number of identical output results in the output results of the unlabeled sample is 4, the second number interval falls.

When n is 5, the second number interval is (4.25, 5), that is, when the number of identical output results in the output results of the unlabeled sample is 5, the second number interval falls.

Step 45, if the output result falls into the second number interval, judging whether the confidence degrees corresponding to the same output result are all high confidence degrees;

statistically, the Confidence interval (Confidence interval) of a probability sample is an interval estimate of a certain overall parameter of the sample, the Confidence interval representing the extent to which the true value of the parameter has a certain probability of falling around the measurement; the confidence interval indicates the degree of plausibility of the measured value of the measured parameter, i.e. the previously required "certain probability", which is referred to as the confidence level, also called confidence.

In the present application, the confidence degree reflects the credibility of the corresponding trained model for identifying or outputting the unlabeled sample, and can be divided into: high confidence and low confidence, wherein the high confidence reflects that the corresponding trained model is very reliable for identifying or outputting the unlabeled sample; the low confidence reflects that the recognition or output of the unlabeled sample by the corresponding trained model is not very reliable, and is only a simple classification according to the features, and the trained model is not very definite to the classification.

High confidence and low confidence are very important data for evaluating the trained model; if the recognition or accurate output result is correct and has high confidence level, the corresponding trained model can be considered to be recognized or output accurately; if the result of the recognition or accurate output is correct and has low confidence, then the corresponding trained model can be considered to find the correct feature in the recognition or accurate output, but mixed with other interference features, and further improvement is needed.

Optionally, the confidence of the high confidence is greater than or equal to 0.5. Here, it should be clear that the confidence is taken to be in the range of [0,1], and the confidence of 0.5 or more is classified as high confidence.

Optionally, the confidence of the low confidence is less than 0.5, that is, the confidence of less than 0.5 is classified as the low confidence.

The method for dividing the confidence level can laterally evaluate the recognition result or the accurate output result of the trained model, thereby facilitating the constant and further improvement of the accuracy degree of the trained model.

However, the above-mentioned dividing method is imperfect for partial confidence, for example: a confidence score of 0.49999 is classified as low confidence and a confidence score of 0.5 is classified as high confidence, with essentially little or no difference between the two, and for a trained model, this classification of high and low confidence can lead to an erroneous evaluation of the trained model.

To solve this problem, we propose a further improved solution, namely: the confidence coefficient of the high confidence coefficient is greater than or equal to 0.5, and the confidence coefficient of the low confidence coefficient is less than or equal to 0.3. In this way, the division is performed by setting a value interval between the high confidence level and the low confidence level, so that the situation that the confidence levels with small differences are divided into different situations is avoided, and the trained model can be evaluated more accurately based on the division. It should be noted that, in the above method for dividing a high confidence level and a low confidence level, a value interval is set, and data of the interval is a discarded part, and for an unlabeled sample in the present application, only when a result of identification or accurate output is a high confidence level and a low confidence level, the sample is evaluated when a confidence level is introduced to evaluate whether the sample is a difficult sample, a high-value sample, or a next-high-value sample.

That is, for the confidence degree dividing method with value intervals, when the sample result is obtained in step 30, the unlabeled sample with the confidence degree of interval data may be directly excluded, which is taken in combination with the unlabeled sample in fig. 3 as an example, that is, when the sample result is obtained, the confidence degree corresponding to the output result of the trained model 2 in the unlabeled sample 5 is 0.44349215893406857, the confidence degree corresponding to the output result of the trained model 2 in the unlabeled sample 9 is 0.47514144209287335, and the confidence degree corresponding to the output result of the trained model 3 in the unlabeled sample 10 is 0.49871608639879994, which all belong to interval data, so that the unlabeled sample 5, the unlabeled sample 9, and the unlabeled sample 10 are directly excluded, and only the unlabeled samples 1-4, 6-8 are retained for further analysis. In step 30, the unlabeled samples with the confidence degrees being interval data are not excluded, and in step 40, whether the samples belong to the difficult samples or not is determined by combining the confidence degrees; the specific exclusion methods may be described by practical or exemplary terms.

Step 46, if the non-uniformity is the high confidence, the unlabeled sample is a difficult sample.

When the confidence coefficient falls into the second number interval, the confidence coefficient is not uniform to be high, and a part of the confidence coefficient is low (or all the confidence coefficients are low), the result of the output of almost all the trained models is the same (or the result of the output of only a single trained model is different), but the accuracy of the output result of the trained models cannot be guaranteed (low confidence coefficient); in this case, it can be generally considered that the output results of all (same results) trained models are relatively accurate, but the accuracy of a small part of the trained models to the output results is not so determined, which means that for the trained models, most features used for classification in the unlabeled sample can be identified, but a small part of the trained models cannot be determined, and a space for improvement exists; that means that the identification or accurate output of the unlabeled sample is difficult for the trained model and at the same time has the potential for identification; however, these difficulties do not bring about an improvement (or a smaller improvement) in the accuracy of the trained model, and only increase the improvement in confidence of the trained model. That is, the unlabeled sample is a difficult sample with a certain training value.

Thus, through steps 44-46, whether the unlabeled sample is a difficult sample with a training value can be judged through the second quantity interval and the confidence coefficient, so that the difficult sample is screened out; therefore, on one hand, a large amount of processing time can be saved, and on the other hand, simple samples with low training value can be eliminated, so that the number of samples needing to be labeled is greatly reduced, the training speed and the training effect of the model are greatly improved, and the labeling cost is reduced.

FIG. 6 is a flowchart of the difficult sample mining method step 43 according to an embodiment of the present invention; wherein the step 43 comprises:

step 431, if the output result falls into the first quantity interval, judging whether the confidence degrees corresponding to the same output result are all low confidence degrees;

falling within the first number interval means that the unlabeled sample is of high training value and is a difficult sample. Meanwhile, confidence degrees can be output while the recognition of the trained model or the accurate output result is obtained, and the meanings represented by different confidence degrees are different; therefore, the corresponding confidence level is judged after the first quantity interval is fallen, so that more detailed judgment is carried out on the unlabeled sample.

Step 432, if the confidence is low, the unlabeled sample is the second highest value sample;

and 433, if the confidence coefficients are not low, the unlabeled sample is the high-value sample.

Wherein the difficult samples include high value samples and second-to-high value samples.

In order to facilitate understanding of the situation corresponding to the judgment of the output results of the unlabeled samples passing through different trained models, the case when n is 3 is taken as an example:

if the three output results are different, the data quality of the unlabeled sample is poor, and the unlabeled sample cannot be identified; or the quality of the data of the unlabeled sample is relatively large, and the unlabeled sample cannot be identified even through real person identification, and the like, in this case, the training value of the unlabeled sample is extremely small, even negative.

If two output results are the same and one output result is different among the three output results, most of the output results of the trained models can be considered to be more accurate, but a small part of the output results of the trained models are still inaccurate, which means that most of the features in the unlabeled sample can be identified for the trained models, and the small part of the features cannot be determined, so that a space for improvement exists; this means that the recognition or accurate output of such unlabeled samples is difficult for the trained model and at the same time has the potential for recognition. That is, the unlabeled sample has a high training value and is a difficult sample; in the case of excluding unlabeled samples with confidence level of interval data, as shown in fig. 3, unlabeled samples 1-4, 6-8 are difficult samples.

In the three output results, on the basis that two output results are the same, the confidence degrees of the two output results which are the same can be further specifically judged: if the two confidence degrees are both high confidence degrees, the two same output results are very credible (namely, the output results are accurately identified with high probability), the unlabeled sample is identifiable, and the other output result is different, which means that the identification of the unlabeled sample has obvious difficulty, namely, the unlabeled sample has high training value and is a high-value sample; if one of the two confidence degrees is high confidence degree and the other is low confidence degree, the two same output results are more credible (but the credibility is lower than the former), the unlabeled sample is identifiable, and the output results of the other are different, which means that the unlabeled sample has obvious difficulty in identification, i.e. has very high training value and is a high-value sample, and the unlabeled samples 1-4 and 6-7 are high-value samples as shown in the combined graph of fig. 3; if the two confidence degrees are both low confidence degrees, the confidence degrees of two identical output results are low (the trained model is likely to be identified through incorrect features), the unlabeled sample has certain unidentifiable possibility, and the output result of the other model is different, which means that the identification of the unlabeled sample has obvious difficulty, so that the training value is reduced due to the unidentifiable possibility and is a next-high-value sample; as shown in fig. 3, both confidences of the unlabeled sample 8 are low confidences, and are the next highest value samples.

If the three output results are the same, in this case, the output results of all the trained models can be generally considered to be accurate, and on this basis, the confidence degrees of the same three output results can be further specifically judged: if the confidence degrees corresponding to the three output results are all high confidence degrees, the identification results of the three trained models to the unlabeled samples are very consistent, namely, the three trained models are easy to identify, and the unlabeled samples which can be simply identified have no training value for the trained models; if the confidence degrees of the three output results are partially high confidence degrees and partially low confidence degrees, or the confidence degrees are all low confidence degrees, that is, a small part (or all) of the trained models are not determined to the accuracy of the output results, it means that for the trained models, most features for classification in the unlabeled sample can be identified, but a part of the trained models cannot be determined to the small part of the features, and a space for improvement exists; that means that the identification or accurate output of the unlabeled samples is difficult for the trained model; however, these difficulties do not bring about an improvement (or a smaller improvement) in the accuracy of the trained model, and only increase the improvement in confidence of the trained model. That is, the unlabeled sample has a certain training value and is the next highest value sample.

Optionally, after the difficult samples are further subdivided into high-value samples and second-highest-value samples, in step 46, if the disparity is the high confidence, the unlabeled sample is the second-highest-value sample.

FIG. 7 is a flowchart of a difficult sample mining method according to an embodiment of the present invention; in the step 40, an accurate sample is screened from the unlabeled samples according to the sample result.

Optionally, after the step 40, the method further includes:

and 50, generating the difficult sample according to the accurate sample and the current trained model.

For the difficult samples identified in steps 10-40, the selected difficult samples have randomness, that is, the difficult samples are determined on the basis of the obtained unlabeled samples, and how to obtain the unlabeled samples has a great influence on the obtained difficult samples.

Although the unlabeled samples are theoretically obtained randomly, in actual operation, the obtaining mode is actually obtained by obtaining a video of a traffic intersection within a period of time, and the like, and the obtaining mode is substantially greatly influenced, and the influence sometimes causes that the number of the difficult samples screened from the difficult samples is very small, so that the ideal training effect is not achieved.

In the step 40, the accurate sample refers to a sample that has been correctly labeled. In step 50, the currently trained model is the best existing model, which is the most desirable model retained after training the model, as described above.

FIG. 8 is a flowchart of step 40 of a difficult sample mining method according to an embodiment of the present invention; wherein, the step 40 further includes, after the step 45:

and step 47, if the confidence level is high, the unmarked sample is an accurate sample.

In this step, that is, if the number of the same output results in the output results of the unlabeled sample falls within a second number interval and the confidence degrees corresponding to the same output results are all high confidence degrees, the unlabeled sample is an accurate sample. The above means that in all the trained models, the output results of almost all the trained models are the same (or only the output results of the individual trained models are different), and the accuracy of the output results of the trained models can be ensured, that is, for the unlabeled sample, the output results of the trained models can be regarded as the accuracy (except the output results of the individual trained models), that is, the output results can be regarded as the accurate label of the unlabeled sample. For accurate samples, the output of the trained model is the label of the sample.

Therefore, the accurate sample is determined through the output result and the confidence coefficient of the trained model, the accuracy is high, and the recognition efficiency is high.

In addition, the accurate samples can be screened out in other ways, for example, manual work is introduced to label the unlabeled samples, and the labeling can be regarded as accurate labeling; and considering the marked sample as an accurate sample.

FIG. 9 is a flowchart one of the difficult sample mining method steps 50 according to an embodiment of the present invention; wherein the step 50 comprises:

step 51, performing data dithering processing on the accurate sample to generate dithering sample data;

in this step, since the obtained accurate sample is generally a picture, the data dithering mode is generally different from the picture changing mode: such as rotating the angle of a picture, changing the size of a picture, changing the aspect ratio of a picture, changing the brightness of a picture, etc., or a mixture of these.

The data dithering processing mode comprises at least one of picture angle rotation, picture size conversion and picture brightness conversion.

The data dithering processing method will be specifically described below with reference to specific examples.

As shown in fig. 10, it is an accurate sample, i.e., the original image in this embodiment; the processing mode of the picture size transformation may be that the picture is processed by a scaling ratio, as shown in fig. 11A, the picture is a picture with an accurate sample of 0.2 scaling ratio and is generated shaking sample data; as shown in fig. 11B, the picture with the exact sample 1.2 scaling ratio is the generated jitter sample data. The picture size may also be transformed by other methods, such as adjusting the aspect ratio of the picture.

The processing manner of the picture angle rotation may be to process the picture by rotating the picture by a certain angle, for example, as shown in fig. 12A, the picture is obtained by clockwise rotating the original image in fig. 10 by 30 ° and is the generated jitter sample data; as shown in fig. 12B, the original image in fig. 10 is rotated by 30 ° counterclockwise, and is generated as shaking sample data. The image angle can also be transformed by other methods, such as turning, mirroring, and the like.

The processing method of the picture brightness transformation may be to convert the picture from the RBG triple channel to the HSV triple channel, and then to process the picture by increasing or decreasing the constant C on the V triple channel. As shown in fig. 13A, the accurate sample V channel is reduced by 100 to be a darkened picture, which is the generated jitter sample data; as shown in fig. 13B, the image after the accurate sample V channel is increased by 100 and brightened is the generated jitter sample data. The picture brightness can also be transformed by other ways, such as adjusting the V channel by other constants, or adjusting the value of the H channel/S channel, etc.

Step 52, obtaining a current trained model, inputting the jitter sample data into the current trained model, and obtaining a jitter sample result;

and the output result of the current trained model is the shaking sample result.

Step 53, obtaining the label value of the accurate sample, and judging whether the label value is the same as the jitter sample result;

the labeled value of the accurate sample may be obtained by manual labeling, or may be the output result of the trained model (except the output result of the individual trained model) in step 47; the method for obtaining the label value of the accurate sample may be directly reading the stored label, or may be obtained by re-determining step 47.

If not, the jittered samples are high-value samples in the difficult samples, step 54.

After the data jitter processing is carried out on the accurate sample, jitter sample data can be generated, and the jitter sample contained in the jitter sample data and the accurate sample have a plurality of similar parts and also have partially different parts; the marking of the accurate sample can be regarded as the output result of inputting the accurate sample into the current trained model, and the shaking sample result is the output result of inputting the shaking sample into the current trained model; the result of the label is the same as that of the jitter sample, which means that the current trained model can still accurately identify or accurately output the jitter sample, and the training value of the corresponding jitter sample to the current trained model is very small; the result of the label is different from that of the shake sample, which means that the current trained model cannot accurately identify or accurately output the shake sample, and the corresponding shake sample has a high training value for the current trained model and is a high-value sample.

Taking the unlabeled sample 1 in fig. 3 as an example, assuming that it is the screened accurate sample and labeled wan CEL, if the output result of a certain jittered sample of the accurate sample input to the currently trained model is wan CEL, the labeled value of the accurate sample is the same as the result of the jittered sample, and the corresponding jittered sample is not a difficult sample; if the output result of inputting a certain jittered sample of the accurate sample into the currently trained model is Hunan CEL, the labeling value of the accurate sample is different from the result of the jittered sample, and the corresponding jittered sample is a difficult sample, specifically a high-value sample in the difficult sample.

Through the steps 51-54, the accurate samples can be subjected to data dithering to generate dithered samples, and difficult samples can be screened out from the dithered samples, so that the difficult samples can be generated through the accurate samples on the basis of screening the difficult samples, the number of the difficult samples is further increased, a better training effect is achieved on the model after marking, and the accuracy of identification or accurate output of the trained model is further improved.

The method comprises the steps of inputting a jitter sample into a current trained model to obtain a jitter sample result, and meanwhile, obtaining the confidence coefficient corresponding to the jitter sample result, wherein if the label of the accurate sample is the same as the jitter sample result, and the confidence coefficient corresponding to the jitter sample result is low confidence coefficient, the jitter sample can be judged to be a second highest value sample.

Therefore, on the basis of screening out high-value samples, secondary high-value samples can be further screened out, the number of difficult samples is further increased, a better training effect is achieved on the model after labeling, and the accuracy of identification or accurate output of the trained model is further improved.

As shown in fig. 14, the step 50 further includes:

step 55, judging the common points of the difficult samples according to the accurate samples and the difficult samples generated by the accurate samples;

according to the screened difficult samples, the same part of the jitter data of the difficult samples generated by the accurate samples can be judged; for example, when analyzing difficult samples in the shake samples obtained by rotating the accurate samples through the picture angle in combination with fig. 12A and 12B, it is found that the shake samples obtained after the original image is rotated clockwise or counterclockwise by more than 30 °, neither of the current trained models can be identified or accurately output.

And 56, adjusting the data dithering processing mode of the accurate samples based on the common point of the difficult samples.

After the common points of the difficult samples are determined, the data dithering processing mode of the accurate samples can be adjusted, so that all or most of the dithering sample data obtained after the data dithering processing is performed on the accurate samples are the difficult samples. For example, when fig. 12A and 12B are combined with the above assumptions, the probability of generating a sample that is difficult to accurately sample can be increased by performing a data dithering process in which the original image is rotated more than 30 ° clockwise or counterclockwise.

The difficult samples substantially reflect the defects of the trained model, and if the unlabeled samples are randomly acquired, the screened difficult samples are also random, namely the reflected defects of the trained model are also random. For the trained model, the accuracy of recognition or accurate output is further improved through training, and the defects are substantially eliminated through further correction aiming at the defects; the randomness of the above difficult samples can fully reflect the defects of the trained model, but the number of the difficult samples for each defect is very small, which has the following consequences: for a certain defect, the number of the difficult samples is small, so that the defect cannot be further determined through the difficult samples, and due to the small number, only a few corresponding defects can be eliminated after the trained model is trained, and the defect cannot be corrected.

In addition, although the unlabeled samples are theoretically obtained randomly, in actual operation, the unlabeled samples are actually obtained by obtaining videos of traffic intersections for a period of time, and the like, the obtaining mode is substantially greatly influenced, and the influence sometimes causes that the number of the difficult samples screened from the videos is very small, so that the ideal training effect is not achieved.

For example, a model is usually trained through legally acquired traffic videos of each intersection in the day, then for the trained model, traffic screenshots of each intersection in the day are randomly acquired as unlabeled samples to screen difficult samples, and the number of screened difficult samples is small; on the contrary, if the traffic screenshots of all intersections under the night light are obtained as unmarked samples, the number of screened difficult samples is large, and the training effect on the model after marking is more ideal.

Through steps 51-56, the selected difficult samples can be determined to have common points corresponding to the defects of the currently trained model, and through the selected difficult samples, the currently trained model can be trained, so that the corresponding defects can be eliminated.

Therefore, the number of difficult samples can be increased, so that a better training effect is achieved on the model after labeling, and the accuracy of recognition or accurate output of the trained model is further improved; a difficult sample may also be generated for a particular defect, such that by training the currently trained model, the corresponding defect is eliminated.

The embodiment of the present disclosure provides a model training method, which may be performed by a model training apparatus, and the model training apparatus may be integrated in an electronic device such as a mobile phone. FIG. 15 is a flowchart I of a model training method according to an embodiment of the present invention; the model training method comprises the following steps:

step 10, obtaining sample data which is not marked;

step 20, obtaining a plurality of trained models;

step 40, screening out difficult samples from the unlabeled samples according to the sample results;

in the model training method, the specific contents of the steps 10 to 40 can refer to the specific description of the steps 10 to 40 in the difficult sample mining method, and are not described herein again.

Step 70, labeling the difficult samples;

and 80, training the current trained model according to the labeled difficult sample.

The unlabeled samples can be screened, and the model is trained after the difficult samples are determined to be labeled; compared with manual data observation, the screening can save a large amount of processing time; and labeling is carried out through screening, most simple samples (samples easy to identify) in unlabeled samples can be eliminated, so that the number of samples needing labeling is greatly reduced, and the training speed and the training effect of the trained model are greatly improved.

Before step 10, step 00 may also be included, in which parameters of the model are set, and the model is trained in multiple rounds;

the model parameters can be randomly set or set according to actual conditions; the number of iterations is the number of iterations of the trained model.

Therefore, the trained models with different iteration times can be obtained conveniently by carrying out a plurality of rounds of training; thereby facilitating the subsequent acquisition of the trained model (the trained model of the different iteration number can be acquired subsequently, and the trained models of other channels can also be acquired).

Optionally, in the step 40, an accurate sample is further screened from the unlabeled sample according to the sample result.

Optionally, after the step 40, the method further includes:

In the model training method, the specific content of step 50 may refer to the specific description of step 50 in the difficult sample mining method, and is not described herein again.

Like this, through accurate sample and the current model of training, can generate the difficult sample and train the model after annotating to can be on the basis of screening the difficult sample, generate the difficult sample through accurate sample and annotate, further increase the quantity of difficult sample and mark, thereby play better training effect to the model, further promote the degree of accuracy of the discernment of the model of training or accurate output.

As shown in fig. 16, the step 70 further includes:

and step 60, acquiring the sample demand of model training and the number of high-value samples and second high-value samples in the difficult samples, and determining the difficult samples needing to be labeled.

As shown in fig. 17, wherein the step 60 comprises:

step 61, acquiring the sample demand of model training and the number of high-value samples and second high-value samples in the difficult samples;

when training a model, sometimes a training plan needs to be specified in advance, so that the training plan of the model can be completed in time. For example, after million pieces of unlabeled data are obtained, if there is no time limit, the model can be trained directly by training high-value samples first, then training second-high-value samples, and finally training other unlabeled samples; if time limitation exists, if the round of model training is finished within one week, the number of samples which can be labeled and trained within the limited time needs to be estimated, then the unlabeled samples with the number of samples are selected from the unlabeled data, and the model is trained after labeling. Wherein the sample demand is the number of samples which can be labeled and trained in a limited time.

Step 62, judging the size relationship between the sample demand and the high-value sample quantity;

and 63, if the sample demand is smaller than the number of the high-value samples, randomly selecting the high-value samples with the same number as the sample demand from the high-value samples as the difficult samples needing to be marked.

For example, the number of samples that can be labeled and trained within the limited time, i.e., the sample demand, is assumed to be 10 ten thousand, and if the number of high-value samples is 13 ten thousand, 10 ten thousand samples are randomly selected from the high-value samples for labeling and training.

Through steps 61-63, the required unlabeled samples can be rapidly selected, so that the labeling and model training can be rapidly performed. And the selected unmarked sample has high training value, thereby further improving the effect of model training.

As shown in fig. 18, wherein the step 60 further comprises:

step 64, if the sample demand is greater than the high-value sample quantity, calculating the difference between the sample demand and the high-value sample quantity;

and 65, randomly selecting the secondary high-value samples with the same quantity as the difference value from the secondary high-value samples, and taking the high-value samples and the selected secondary high-value samples as difficult samples needing to be marked.

For example, the number of samples that can be labeled and trained within the above-mentioned limited time, that is, the sample demand is assumed to be 10 ten thousand, if the number of high-value samples is 3 ten thousand, the difference between the sample demand and the number of high-value samples is calculated, and if the number of high-value samples is 7 ten thousand, 7 ten thousand samples are randomly selected from the second high-value samples, and labeling and training are performed on 7 ten thousand second high-value samples and 3 ten thousand high-value samples.

When the high-value sample and the second-high-value sample are selected according to the sample demand, the high-value sample and the second-high-value sample can be selected in other modes according to actual conditions. For example, the unlabeled samples are sorted according to certain rules, in the sorting process, high-value samples, next-high-value samples and other samples are placed in sequence from front to back, and after the sample demand is determined, the same number of unlabeled samples can be directly selected from front to back.

Through steps 61-65, the required unlabeled samples can be rapidly selected, thereby facilitating rapid labeling and model training. And the selected unmarked sample has high training value, thereby further improving the effect of model training.

The embodiment of the present disclosure provides a difficult sample mining device for implementing the difficult sample mining method according to the above aspects of the present disclosure, and the difficult sample mining device is described in detail below.

Fig. 19 is a block diagram illustrating a structure of a difficult sample excavating device according to an embodiment of the present invention; wherein, the difficult sample excavating device comprises:

the acquisition unit 1 is used for acquiring sample data which is not marked;

and further for obtaining a plurality of trained models;

the model unit 2 is used for inputting each unlabeled sample in the unlabeled sample data into a plurality of trained models to obtain a sample result;

and the screening unit 3 is used for screening the difficult samples from the unlabelled samples according to the sample results.

Therefore, the unmarked samples can be screened, and the difficult samples can be determined; compared with manual data observation, the screening can save a large amount of processing time; and most simple samples (samples easy to identify) in the unlabeled samples can be removed through screening, so that the number of samples needing to be labeled is greatly reduced, the training speed and the training effect of the model are greatly improved, and the labeling cost is reduced.

Optionally, in the obtaining unit 1, the number of the trained models is 3 to 5.

Optionally, in the obtaining unit 1, the number of the trained models is 3.

Optionally, in the model unit 2, the output result includes an output result obtained after the unlabeled sample is input into the trained model and a confidence of the output result.

Optionally, the screening unit 3 is further configured to: obtaining the sample result of the same unlabeled sample; judging whether the number of the same output results in the output results of the unlabeled samples falls into a first number interval or not; and if the number of the unlabeled samples falls into the first number interval, the unlabeled samples are difficult samples.

Optionally, the screening unit 3 is further configured to: judging whether the number of the same output results in the output results of the unlabeled samples falls into a second number interval or not; if the confidence degrees corresponding to the same output results are all high confidence degrees, judging whether the confidence degrees corresponding to the same output results are all high confidence degrees; and if the non-uniformity is the high confidence level, the unlabeled sample is a difficult sample.

optionally, the screening unit 3 is further configured to: if the output result falls into the first quantity interval, judging whether the confidence degrees corresponding to the same output result are all low confidence degrees; if the confidence coefficient is low, the unlabeled sample is the second highest value sample; and if the confidence coefficients are not all low, the unlabeled sample is the high-value sample.

Optionally, in the screening unit 3, if the unevenness is the high confidence, the unlabeled sample is a second highest-value sample.

Optionally, in the screening unit 3, an accurate sample is screened from the unlabeled sample according to the sample result.

Optionally, the screening unit 3 is further configured to: and if the confidence level is high, the unlabeled sample is an accurate sample.

Optionally, the difficult sample excavating device further comprises:

and the generating unit 4 is used for generating the difficult sample according to the accurate sample and the current trained model.

Optionally, the generating unit 4 is further configured to:

performing data dithering processing on the accurate sample to generate dithering sample data; acquiring a current trained model, and inputting the jitter sample data into the current trained model to obtain a jitter sample result; acquiring a label value of the accurate sample, and judging whether the label value is the same as the jitter sample result or not; if so, the jittered sample is a high-value sample of the difficult samples.

Optionally, the generating unit 4 is further configured to:

judging the common points of the difficult samples according to the accurate samples and the difficult samples generated by the accurate samples; and adjusting the data dithering processing mode of the accurate samples based on the common point of the difficult samples.

The embodiment of the present disclosure provides a model training apparatus, which is used for executing the model training method described above in the present disclosure, and the model training apparatus is described in detail below.

FIG. 20 is a block diagram of a model training apparatus according to an embodiment of the present invention; wherein, the model training device comprises:

the above difficult sample excavating device is used for determining the difficult sample;

the labeling unit 6 is used for labeling the difficult samples;

and the training unit 7 is used for training the currently trained model according to the labeled difficult sample.

Therefore, unlabeled samples can be screened, and the model is trained after difficult samples are determined to be labeled; compared with manual data observation, the screening can save a large amount of processing time; and labeling is carried out through screening, most simple samples (samples easy to identify) in unlabeled samples can be eliminated, so that the number of samples needing labeling is greatly reduced, the training speed and the training effect of the model are greatly improved, and the labeling cost is reduced.

Optionally, the model training apparatus further includes:

and the quantity determining unit 5 is used for acquiring the sample demand of model training, and the quantity of high-value samples and second-high-value samples in the difficult samples, and determining the difficult samples needing to be labeled.

Optionally, the number determining unit 5 is further configured to: acquiring the sample demand of model training and the number of high-value samples and second high-value samples in the difficult samples; judging the size relationship between the sample demand and the high-value sample quantity; and if the sample demand is smaller than the number of the high-value samples, randomly selecting the high-value samples with the same number as the sample demand from the high-value samples as the difficult samples needing to be marked.

Optionally, the number determining unit 5 is further configured to: if the sample demand is greater than the high-value sample quantity, calculating a difference value between the sample demand and the high-value sample quantity; randomly selecting the second highest value samples with the same quantity as the difference value from the second highest value samples, and taking the second highest value samples and the selected second highest value samples as difficult samples needing to be marked.

It should be noted that the above-described device embodiments are merely illustrative, for example, the division of the units is only one logical function division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The internal functions and structures of the difficult sample excavating apparatus and the model training apparatus are described above, and as shown in fig. 21, in practice, the difficult sample excavating apparatus and the model training apparatus may be implemented as an electronic device including: a processor and a memory, the memory storing a control program that, when executed by the processor, implements the above-described difficult sample mining method or implements the above-described model training method.

Fig. 22 is a block diagram illustrating another electronic device according to an embodiment of the invention. The electronic device 12 shown in fig. 22 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 22, the electronic device 12 may be implemented in the form of a general-purpose electronic device. The components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Electronic device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer-readable storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown, but commonly referred to as a "hard drive"). Although not shown in FIG. 22, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only memory (CD-ROM), a Digital versatile disk Read Only memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.

Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other electronic devices. Such communication may be through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via the Network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 via the bus 18. It is noted that although not shown, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing, for example, implementing the methods mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.

The electronic device of the invention can be a server or a terminal device with limited computing power, and the lightweight network structure of the invention is particularly suitable for the latter. The base body implementation of the terminal device includes but is not limited to: intelligent mobile communication terminal, unmanned aerial vehicle, robot, portable image processing equipment, security protection equipment etc.. The embodiment of the present disclosure provides a computer-readable storage medium, which stores instructions that, when loaded and executed by a processor, implement the above-mentioned method for mining difficult samples, or implement the above-mentioned method for training models.

The technical solution of the embodiment of the present invention substantially or partly contributes to the prior art, or all or part of the technical solution may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method according to the embodiment of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

Although the present disclosure has been described above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present disclosure, and these changes and modifications are intended to be within the scope of the present disclosure.

Claims

1. A method of mining a difficult sample, comprising:

obtaining sample data which is not marked;

obtaining a plurality of trained models;

2. The method of mining a difficult sample according to claim 1, wherein the number of trained models is 3-5 of the plurality of trained models obtained.

3. The method of mining a difficult sample according to claim 1, wherein the number of trained models is 3 among the plurality of trained models obtained.

4. The method according to any one of claims 1 to 3, wherein each unlabeled sample in the unlabeled sample data is input into a plurality of trained models, and a sample result is obtained, wherein the sample result includes an output result of the unlabeled sample after being input into the trained models and a confidence of the output result.

5. The method for mining difficult samples according to claim 4, wherein the screening out difficult samples from the unlabeled samples according to the sample results comprises:

obtaining the sample result of the same unlabeled sample;

and if the number of the unmarked samples falls into the first number interval, the unmarked samples are the difficult samples.

6. The difficult sample mining method of claim 5, wherein the first number interval is (0.35n, 0.85n), n being the number of trained models.

7. The method for mining difficult samples according to claim 5, wherein after obtaining the sample result of the same unlabeled sample, the method further comprises:

and if the non-uniformity is the high confidence level, the unmarked sample is the difficult sample.

8. The method of hard sample mining as claimed in claim 7, wherein the second number interval is (0.85n, n), n being the number of trained models.

9. The difficult sample mining method of claim 5, wherein the difficult samples include high value samples and second-highest value samples;

if the number of the unmarked samples falls into the first number interval, the unmarked samples are the difficult samples, and the method comprises the following steps:

10. The method of mining a difficult sample according to claim 7, wherein the unlabeled sample is among the difficult samples if the disparity is the high confidence level, and the unlabeled sample is a next highest value sample if the disparity is the high confidence level.

11. The method for mining difficult samples according to any one of claims 1 to 3, wherein the difficult samples are screened from the unlabeled samples according to the sample results, and the accurate samples are screened from the unlabeled samples according to the sample results;

12. The method for mining a difficult sample according to claim 7, wherein if the second number of intervals falls within the second number of intervals, after determining whether the confidence levels corresponding to the same output result are all high confidence levels, the method further comprises:

13. The method of claim 11, wherein the generating the difficult samples from the accurate samples and the currently trained models comprises:

acquiring the current trained model, and inputting the jitter sample data into the current trained model to obtain a jitter sample result;

if so, the jittered sample is a high-value sample of the difficult samples.

14. The method of mining difficult samples according to claim 13, wherein the generating the difficult samples from the accurate samples and the currently trained models further comprises:

15. The method of claim 13, wherein the data dithering process comprises at least one of a picture angle rotation, a picture size transformation, and a picture brightness transformation.

16. A method of model training, comprising:

determining the difficult sample according to the difficult sample mining method of any one of claims 1-15;

labeling the difficult samples;

17. The model training method of claim 16, wherein before labeling the difficult sample, further comprising:

acquiring the sample demand of model training and the number of high-value samples and second high-value samples in the difficult samples, and determining the difficult samples needing to be labeled.

18. The model training method of claim 17, wherein the obtaining of the sample requirement of model training and the number of high-value samples and second-high-value samples in the difficult samples to determine the difficult samples to be labeled comprises:

acquiring the sample demand of model training and the number of the high-value samples and the second high-value samples in the difficult samples;

19. The model training method of claim 18, wherein the obtaining of the sample requirement of model training and the number of high-value samples and second-high-value samples in the difficult samples determines the difficult samples to be labeled, and further comprises:

randomly selecting the second highest value samples with the same quantity as the difference value from the second highest value samples, and taking the second highest value samples and the selected second highest value samples as the difficult samples needing to be marked.

20. A difficult sample excavation device, comprising:

the device comprises an acquisition unit (1) for acquiring unlabeled sample data;

and further for obtaining a plurality of trained models;

the model unit (2) is used for inputting each unlabeled sample in the unlabeled sample data into a plurality of trained models to obtain a sample result;

and the screening unit (3) is used for screening out difficult samples from the unlabelled samples according to the sample results.

21. A model training apparatus, comprising:

the difficult sample mining device of claim 20, to determine the difficult sample;

a labeling unit (6) for labeling the difficult samples;

and the training unit (7) is used for training the currently trained model according to the labeled difficult sample.

22. An electronic device comprising a processor and a memory, wherein the memory stores a control program that, when executed by the processor, implements a difficult sample mining method as claimed in any one of claims 1 to 15, or implements a model training method as claimed in any one of claims 16 to 19.

23. A computer readable storage medium storing instructions which, when loaded and executed by a processor, carry out a method of hard sample mining as claimed in any one of claims 1 to 15 or a method of model training as claimed in any one of claims 16 to 19.