CN111062440A

CN111062440A - Sample selection method, device, equipment and storage medium

Info

Publication number: CN111062440A
Application number: CN201911310273.5A
Authority: CN
Inventors: 文心杰; 王晓利
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-04-24
Anticipated expiration: 2039-12-18
Also published as: CN111062440B

Abstract

The embodiment of the application discloses a sample selection method, a sample selection device, sample selection equipment and a storage medium, wherein the method comprises the following steps: extracting characteristic vectors corresponding to a plurality of sample data aiming at the sample data to be audited; clustering the characteristic vectors corresponding to the plurality of sample data to obtain at least one clustering category, and determining a clustering center of the at least one clustering category; determining sampling probabilities corresponding to the plurality of sample data according to the clustering center of the at least one clustering category and the characteristic vectors corresponding to the plurality of sample data; and selecting a preset number of sample data from the plurality of sample data according to the sampling probability corresponding to each sample data as target sample data needing to be audited. Therefore, time cost and economic cost consumed by examining and verifying the sample data are saved, and the reliability of the examination and verification of the sample data is considered.

Description

Sample selection method, device, equipment and storage medium

Technical Field

The present application relates to the technical field of Artificial Intelligence (AI), and in particular, to a sample selection method, apparatus, device and storage medium based on Artificial Intelligence.

Background

Under the large environment of artificial intelligence, various requirements related to machine learning are brought forward. At present, a large amount of labeled sample data is generally required to be obtained when a model is trained based on a machine learning algorithm, and a main method for obtaining the large amount of labeled sample data is to perform manual labeling through a team, wherein the team generally comprises a labeling person and an auditor, the labeling person is responsible for labeling the sample data, and the auditor is responsible for auditing the labeling quality of the sample data.

Currently, an auditor mainly audits labeled sample data through the following two ways: in the first mode, all the labeled sample data are audited in full, namely, the quality of all the labeled sample data is audited; and in the second mode, part of the labeled sample data is randomly extracted from all the labeled sample data, and the quality of the extracted labeled sample data is checked.

For the first mode, the amount of labeling sample data required for training a model based on a machine learning algorithm is huge at present, and accordingly, time cost and economic cost required for auditing all labeling sample data are also huge. For the second method, it is difficult for randomly extracted labeling sample data to accurately reflect the distribution of all labeling sample data, and the labeling sample data is determined to have high accuracy, and poor reliability.

In summary, how to save the time cost and the economic cost consumed in the process of examining and verifying the sample data and consider the reliability of examining and verifying the sample data becomes a problem to be solved urgently at present.

Disclosure of Invention

The embodiment of the application provides a sample selection method, a sample selection device, sample selection equipment and a storage medium, so that time cost and economic cost consumed by examining and verifying sample data can be saved, and the reliability of the examination and verification of the sample data is considered.

In view of the above, a first aspect of the present application provides a sample selection method, including:

extracting characteristic vectors corresponding to a plurality of sample data aiming at the sample data to be audited;

clustering the characteristic vectors corresponding to the sample data to obtain at least one clustering category, and determining a clustering center of the at least one clustering category;

determining sampling probabilities corresponding to the plurality of sample data according to the clustering center of the at least one clustering category and the feature vectors corresponding to the plurality of sample data;

and selecting a preset number of sample data from the plurality of sample data according to the sampling probability corresponding to each sample data as target sample data needing to be audited.

A second aspect of the present application provides a sample selection device, the device comprising:

the characteristic vector extraction module is used for extracting characteristic vectors corresponding to a plurality of sample data to be audited;

the clustering module is used for clustering the characteristic vectors corresponding to the plurality of sample data to obtain at least one clustering category and determining a clustering center of the at least one clustering category;

a sampling probability determining module, configured to determine, according to the cluster center of the at least one cluster category and the feature vectors corresponding to the plurality of sample data, sampling probabilities corresponding to the plurality of sample data;

and the selection module is used for selecting a preset number of sample data from the plurality of sample data according to the sampling probability corresponding to each of the plurality of sample data, and the sample data is used as target sample data needing to be audited.

A third aspect of the application provides an electronic device comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is adapted to perform the steps of the sample selection method according to the first aspect as described above, according to the computer program.

A fourth aspect of the present application provides a computer-readable storage medium for storing a computer program for performing the steps of the sample selection method of the first aspect described above.

A fifth aspect of the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the sample selection method of the first aspect described above.

According to the technical scheme, the embodiment of the application has the following advantages:

in the sample selection method provided by the embodiment of the application, firstly, a feature vector corresponding to each sample data is respectively extracted for a plurality of sample data to be audited; then, clustering the characteristic vectors corresponding to the plurality of sample data to obtain at least one clustering category, and determining a clustering center of the obtained at least one clustering category; and further, determining the sampling probability corresponding to each sample data according to the clustering center of at least one clustering class and the characteristic vector corresponding to each of the plurality of sample data to be audited by adopting a coreset core set construction algorithm, and selecting a plurality of sample data from the plurality of sample data to be audited according to the sampling probability corresponding to each sample data to be audited as target sample data to be audited. The method combines a clustering algorithm and a coreset core set construction algorithm, and extracts partial sample data from the full sample data to be audited as target sample data to be audited, namely, extracts a coreset core set from the full sample data, and the coreset core set has the characteristic of reflecting the distribution characteristics of the full data, so that the distribution characteristics of the full sample data to be audited can be reflected by a plurality of target sample data selected by the method; therefore, the full-scale sample data is not required to be audited, time cost and economic cost consumed by auditing the sample data are greatly saved, and the reliability of the audit of the sample data is ensured on the premise that the target sample data can reflect the distribution characteristics of the full-scale sample data to be audited.

Drawings

Fig. 1 is a schematic view of an application scenario of a sample selection method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a sample selection method according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating an implementation process of a sample selection method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a first sample selection device provided in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a second sample selection device provided in an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a third sample selection device provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

When the prior art is adopted to audit labeled sample data, the following two technical problems mainly exist: firstly, carrying out full amount auditing on all labeled sample data, and ensuring the reliability of the auditing of the sample data, wherein the time cost and the economic cost which are required to be consumed are extremely high; secondly, the sample data randomly extracted from all the labeled sample data is audited, so that time cost and economic cost for auditing the sample data can be saved, but the randomly extracted sample data is difficult to accurately reflect the distribution condition of the full amount of sample data, so that the reliability of auditing the sample data is reduced.

In view of the above technical problems, the embodiments of the present application provide a sample selection method, in which a coreset kernel set has a characteristic of reflecting overall distribution characteristics of full-size data, a clustering algorithm is combined with a coreset kernel set construction algorithm, and based on this, part of sample data is extracted from full-size sample data to be audited as target sample data to be audited, so that time cost and economic cost consumed for sample data auditing are saved, and reliability of sample data auditing is ensured.

Specifically, in the sample selection method provided in the embodiment of the present application, feature extraction processing is performed on full-size sample data to be audited first, so as to obtain feature vectors corresponding to the full-size sample data; then, clustering processing is carried out on the basis of the characteristic vectors corresponding to the full-scale sample data respectively to obtain at least one clustering category, and a clustering center of the at least one clustering category is determined; and then, determining the sampling probability corresponding to each sample data in the full-volume sample data according to the clustering center of the at least one clustering category and the characteristic vector corresponding to each full-volume sample data by adopting a coreset core set construction method, and selecting a preset number of sample data from the full-volume sample data according to the sampling probability corresponding to each sample data to be used as the target sample data needing to be audited.

Compared with the mode of examining and verifying the full amount of sample data one by one in the prior art, the method provided by the embodiment of the application can greatly reduce the amount of the sample data needing to be examined, thereby saving the time cost and the economic cost for examining and verifying the sample data; compared with the mode of auditing the sample data randomly extracted from the full-size sample data in the prior art, the target sample data selected by the method provided by the embodiment of the application can accurately reflect the overall distribution characteristics of the full-size sample data, so that the reliability of the auditing of the sample data can be ensured.

It should be understood that the sample selection method provided in the embodiments of the present application may be generally applied to a device with data processing capability, and the device may specifically be a terminal device or a server. The terminal device may be a computer, a Personal Digital Assistant (PDA), a tablet computer, a smart phone, or the like. The server may specifically be an application server or a Web server, and in actual deployment, the server may be an independent server or a cluster server.

In order to facilitate understanding of the technical solution provided in the embodiment of the present application, an application scenario to which the sample selection method provided in the embodiment of the present application is applied is described below.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a sample selection method provided in an embodiment of the present application. As shown in fig. 1, the application scenario includes: a server 110. The server 110 is configured to execute the sample selection method provided in the embodiment of the present application, select a preset number of sample data from the full-size sample data to be audited, and use the sample data as target sample data to be audited, where the selected target sample data can generally reflect the overall distribution characteristics of the full-size sample data.

Specifically, the server 110 may first obtain full-size sample data to be audited, and perform feature extraction processing on each sample data in the full-size sample data to obtain a feature vector corresponding to each sample data. Then, in a vector space corresponding to the full amount of sample data, clustering processing is carried out based on the characteristic vector corresponding to each sample data to obtain at least one clustering category, and the clustering center of each obtained clustering category is determined. And then, determining the sampling probability corresponding to each sample data in the full-scale sample data according to the clustering center of at least one clustering class and the characteristic vector corresponding to each full-scale sample data by adopting a coreset core set building algorithm, and extracting partial sample data from the full-scale sample data according to the sampling probability corresponding to each sample data in the full-scale sample data to be used as target sample data needing to be audited.

In the sample selection process, all the sample data to be audited, which is acquired by the server 110, are generally labeled sample data, and at present, part of the sample data needs to be extracted from all the sample data and provided to the auditor, so that the auditor can perform the auditing of the labeling quality based on the part of the sample data, and in order to ensure that the labeling accuracy determined based on the extracted part of the sample data is close to the labeling accuracy corresponding to all the sample data, the coreset core set building algorithm is integrated into the sample selection process, so that the extracted part of the sample data can reflect the overall distribution characteristics of all the sample data, and the reliability of the sample data auditing is ensured while the time cost and the economic cost which are required by the sample data auditing are saved.

It should be understood that the scenario shown in fig. 1 is only an example, and in practical applications, the sample selection method provided in the embodiment of the present application is not only applicable to the scenario shown in fig. 1, but also applicable to other scenarios; that is, the method provided by the embodiment of the present application may be applied to a scenario in which part of sample data that needs to be audited is extracted from full-size sample data, and may also be applied to other scenarios in which samples need to be extracted, and an application scenario to which the method for selecting samples provided by the embodiment of the present application is applied is not limited at all.

The sample selection method provided in the present application is described below by way of example.

Referring to fig. 2, fig. 2 is a schematic flow chart of a sample selection method provided in an embodiment of the present application. For convenience of description, the following embodiments describe the implementation process of the sample selection method by taking a terminal device as an example. As shown in fig. 2, the sample selection method includes the steps of:

step 201: and extracting the characteristic vectors corresponding to the plurality of sample data aiming at the plurality of sample data to be audited.

And after the marking of the full-scale sample data is finished, entering an auditing stage of the marked full-scale sample data so as to audit the marking quality of the marked full-scale sample data. In the auditing stage, the terminal equipment needs to obtain full-size sample data to be audited, and performs feature extraction processing on each sample data in the full-size sample data, so as to obtain feature vectors corresponding to the full-size sample data.

It should be noted that, in the face of different processing tasks (specifically, the processing tasks corresponding to the model to be trained are different), the total sample data to be acquired usually belongs to different data types. Specifically, when models for processing images, such as an image classification model and an image segmentation model, need to be trained by using full-scale sample data, the full-scale sample data may be image data; when models for processing texts, such as a text classification model, a semantic segmentation model, a text translation model and the like, need to be trained by using full sample data, the full sample data can be text data; when models for processing videos such as a video classification model need to be trained by using full sample data, the full sample data can be video data; when a model for processing speech, such as a speech recognition model, needs to be trained by using full sample data, the full sample data may be all speech data.

The present application does not limit the data type to which the full-scale sample data belongs. It should be understood that no matter what model is trained with full-size sample data, it is necessary to ensure that the data type of each sample data in the full-size sample data is the same.

Specifically, when the feature extraction processing is performed on the sample data, the following processing may be performed on each sample data in the full-size sample data: and inputting sample data into a pre-trained convolutional neural network, and acquiring a feature vector output by the last full-connection layer in the convolutional neural network as a feature vector corresponding to the sample data.

Taking the data type to which the full amount of sample data belongs as an image sample as an example, the acceptance v3 can be used as a convolutional neural network for extracting feature vectors, and because the parameter amount of the acceptance v3 network is large, a migration learning method can be adopted to directly apply the parameters trained by the acceptance v3 on the ImageNet data set to the acceptance v3 for extracting the feature vectors in the application, and in the process of training the acceptance v3, the softmax layer containing 1000 neurons in the last layer in the original acceptance v3 and the softmax layer adapted to the training task are removed, and then the acceptance v3 is finely adjusted (finetune). When the concept v3 is used to extract the feature vector of the image sample in the full sample data, the last softmax layer added during training can be removed, and the output of the full connected layer before the softmax layer can be directly used as the feature vector of the image sample, so that the feature vector of each image sample in the full sample data can be extracted by the concept v 3.

It should be understood that in practical applications, besides the assumption v3, other model structures may be used to extract the feature vectors of the sample data, and the model structure of the convolutional neural network for extracting the feature vectors is not limited in this application.

Step 202: and clustering the characteristic vectors corresponding to the plurality of sample data to obtain at least one clustering category, and determining a clustering center of the at least one clustering category.

After extracting the feature vector corresponding to each sample data in the full sample data, the terminal device may perform clustering processing based on the feature vector corresponding to each sample data in the vector space corresponding to the full sample data according to the preset number of cluster categories, thereby obtaining at least one cluster category (the number of cluster categories obtained through clustering processing is equal to the preset number of cluster categories), and determining the cluster center of each cluster category.

It should be understood that the preset number of cluster categories may be set according to actual requirements, and is usually set to be an integer greater than 1, and the preset number of cluster categories is not specifically limited herein.

It should be noted that, in practical applications, the terminal device may generally adopt a k-means clustering algorithm (a k-means clustering algorithm for short), perform clustering processing on feature vectors corresponding to the full-volume sample data in a vector space corresponding to the full-volume sample data, and find a clustering center of each clustering class after the clustering processing. Of course, in practical applications, the terminal device may also use other clustering algorithms to perform the clustering process, and find the clustering center of each clustering category, and the application does not limit the clustering algorithm used herein.

Step 203: and determining the sampling probability corresponding to the sample data according to the clustering center of the at least one clustering category and the characteristic vector corresponding to the sample data.

After the terminal equipment finishes the clustering processing of the feature vectors corresponding to the full-volume sample data, a coreset core set construction method can be adopted, a coreset core set is constructed based on the clustering center of at least one clustering category obtained through the clustering processing and the feature vectors corresponding to the full-volume sample data, the corresponding sampling probability of each sample data in the full-volume sample data can be determined, and a plurality of sample data are extracted from the full-volume sample data to form the coreset core set based on the sampling probability corresponding to each sample data.

When the corresponding sampling probability of a certain sample data in the full sample data is determined, the cluster type of the sample data can be determined firstly, and the number of the sample data in the cluster type is counted; then, determining the distance between the sample data and the clustering center of the clustering class to which the sample data belongs as the distance corresponding to the sample data, and determining the distances between other sample data in the clustering class to which the sample data belongs and the clustering center as the respective distances corresponding to other sample data in the clustering class; and then, determining the sensitivity corresponding to the sample data according to the distance corresponding to the sample data, the distances corresponding to other sample data in the cluster type to which the sample data belongs, the number of the sample data included in the cluster type to which the sample data belongs and the number of the sample data in the full-amount sample data. Thus, the sensitivity corresponding to each sample data in the full sample data is determined according to the method. And then, according to the sensitivity corresponding to the sample data and the sensitivity corresponding to each full sample data, calculating the sampling probability corresponding to the sample data.

More specifically, when determining the corresponding sensitivity of a sample data in the full sample data, the class weight may be determined according to the preset number of cluster classes; determining the distance corresponding to each sample data in the full sample data, namely determining the distance between the feature vector corresponding to each sample data and the clustering center of the clustering class to which the feature vector belongs; then, calculating a distance normalization value according to the distance corresponding to the full sample data and the number of the sample data contained in the full sample data; and then, determining the sensitivity corresponding to the sample data according to the distance corresponding to the sample data, the distances corresponding to other sample data in the cluster type to which the sample data belongs, the number of the sample data in the cluster type to which the sample data belongs and the number of the sample data in the full amount of sample data by combining the class weight and the distance normalization value.

In the above, the distance between the sample data and the cluster center of the cluster category to which the sample data belongs is generally referred to as the euclidean distance.

In order to further understand the implementation manner of determining the sampling probability p (x) corresponding to the sample data, the process of calculating the sampling probability p (x) is described below with reference to a specific calculation formula.

1. The class weight is calculated by equation (1):

α＝16(log(k+2)) (1)

wherein α is the class weight, and k is the preset cluster class number.

2. Constructing a set B corresponding to each cluster type; b is_iCharacterizing the set corresponding to the ith cluster class, which includes the feature vectors belonging to cluster class i, B_iCan be noted as b_i。

3. Calculating a distance normalization value by equation (2):

wherein, c_φIs a distance normalization value; x represents a set of characteristic vectors corresponding to the full-amount sample data, and | X | represents the number of sample data included in the full-amount sample data; x represents one sample data in the full sample data, and d (x, B) represents the distance between the feature vector corresponding to the sample data x and the cluster center of the cluster category to which the feature vector belongs.

4. For each sample data, calculating its corresponding sensitivity by equation (3):

wherein s (x) represents the sensitivity corresponding to the sample data x; b is_iRepresents the cluster type, | B, to which the feature vector corresponding to the sample data x belongs_iI represents the clustering class B_iThe number of sample data included.

5. For each sample data, its corresponding sampling probability is calculated by equation (4):

p(x)＝s(x)/∑_x′∈Xs(x′) (4)

step 204: and selecting a preset number of sample data from the plurality of sample data according to the sampling probability corresponding to each sample data as target sample data needing to be audited.

After the terminal equipment determines the sampling probability corresponding to each full-size sample data, a preset number of sample data can be selected from the full-size sample data according to the sampling probability corresponding to each full-size sample data, the sample data is used as target sample data needing to be audited, namely the preset number of target sample data are selected from the full-size sample data to form a coreset core set, and the overall distribution characteristics of the target sample data in the coreset core set are close to the overall distribution characteristics of the full-size sample data.

It should be noted that the preset number may be manually set according to actual requirements, or may be calculated based on a preset confidence level and a confidence interval, and the determination method of the preset number is not limited in this application.

Specifically, when target sample data is selected, the terminal device may select a preset number of target sample data from the full-size sample data based on a random weighting algorithm according to respective corresponding sampling probabilities of the full-size sample data. Based on the sampling probability corresponding to each sample data determined in step 203, the sampling probability is substantially the probability that the sample data is expected to be extracted; for example, if the sampling probability corresponding to the sample data a is 60%, the sampling probability corresponding to the sample data B is 20%, and the sampling probability corresponding to the sample data C is 20%, when the target sample data is extracted based on the random weighting algorithm, the probability of extracting the sample data a is 60%, the probability of extracting the sample data B is 20%, and the probability of extracting the sample data C is 20%.

It should be understood that, in practical applications, the terminal device may extract the target sample data through other manners besides the above random weighting algorithm, and the manner of extracting the target sample data is not limited in this application.

Optionally, in order to ensure that the extracted target sample data can relatively accurately reflect the overall distribution condition of the full-size sample data, that is, ensure that the labeling accuracy corresponding to the extracted target sample data is close to the labeling accuracy corresponding to the full-size sample data, the method provided in the embodiment of the present application may further verify the extracted target sample data based on a likelihood function.

Specifically, the terminal device may determine a first likelihood function corresponding to the full-size sample data, where the first likelihood function is used to represent the labeling accuracy corresponding to the full-size sample data; determining a second likelihood function corresponding to the preset number of target sample data, wherein the second likelihood function is used for representing the marking accuracy corresponding to the extracted preset number of target sample data; then, a difference between the second likelihood function and the first likelihood function is calculated, and if the difference is within a preset threshold range, it may be determined that the preset number of target sample data extracted in steps 201 to 204 passes verification.

It should be noted that the likelihood function is a function related to a statistical model parameter in statistics, and when an output X is given, the likelihood function L (θ | X) related to the parameter θ is equal to the probability of a post-variable X related to the parameter θ, that is, L (θ | X) ═ P (X ═ X | θ). In the method provided by the application, the likelihood function is used for representing the annotation accuracy of the sample data.

The sample selection method combines a clustering algorithm and a coreset core set construction algorithm, and extracts partial sample data from the full sample data to be audited as target sample data to be audited, namely, extracts the coreset core set from the full sample data, and the coreset core set has the characteristic of reflecting the distribution characteristics of the full data, so that the distribution characteristics of the full sample data to be audited can be reflected by a plurality of target sample data selected by the method; therefore, the full-scale sample data is not required to be audited, time cost and economic cost consumed by auditing the sample data are greatly saved, and the reliability of the audit of the sample data is ensured on the premise that the target sample data can reflect the distribution characteristics of the full-scale sample data to be audited.

In order to further understand the sample selection method provided in the embodiment of the present application, the sample selection method is generally described in an exemplary manner by taking full-scale sample data as image sample data for training an image classification model.

Referring to fig. 3, fig. 3 is a schematic view of an implementation process of the sample selection method provided in the embodiment of the present application in a scene of extracting picture sample data.

As shown in fig. 3, after acquiring a large amount of labeled picture sample data to be audited, the terminal device inputs the acquired picture sample data into an acceptance v3 model one by one, and acquires an acceptance v3 model to remove a feature vector output by the last layer of full link layer, and uses the feature vector as a feature vector corresponding to the picture sample data. The model parameter value of the acceptance v3 model is obtained by training on ImageNet, the last softmax layer is removed when the acceptance v3 model is applied, and the feature vector output by the full connection layer before the softmax layer is directly obtained and used as the feature vector corresponding to the picture sample data.

After the characteristic vectors corresponding to all the picture sample data are obtained, clustering processing is carried out on the characteristic vectors corresponding to the picture sample data through a kmeans clustering algorithm, and a clustering center of each clustering category is found. Then, correspondingly calculating the sensitivity corresponding to each picture sample data according to the clustering center of each clustering category through a coreset core set construction algorithm, carrying out normalization processing on the sensitivity corresponding to each picture sample data to obtain the sampling probability corresponding to each picture sample data, finally extracting m preset number of picture sample data from all picture sample data through a random weighting algorithm according to the sampling probability corresponding to each picture sample data to form a coreset core set, and providing the m picture sample data for an auditor to audit marked quality.

The input of the coreset kernel set construction algorithm comprises the following steps: x, K, B and m; wherein, X refers to a set of characteristic vectors corresponding to all picture sample data; k is the number of the cluster types obtained after the clustering treatment; b is the set of cluster centers for each cluster category; m is the size of the coreset kernel set that is finally constructed.

The coreset kernel set construction algorithm comprises the steps of firstly calculating category weights α and then constructing B_i，B_iIs close to the cluster center b_iAnd belongs to the set of eigenvectors x of the cluster category i; then, a distance normalization value c is calculated_φThe distance is normalized to a value c_φThe method is obtained by calculation according to the respective corresponding distances d (x, B) of all picture sample data, wherein d (x, B) represents the Euclidean distance between the feature vector corresponding to the picture sample data x and the clustering center of the clustering class to which the feature vector belongs; and then toCalculating the sensitivity corresponding to each image sample data, and obtaining the sampling probability corresponding to each image sample data after normalization processing is carried out on the sensitivity; and finally, according to the sampling probability corresponding to each image sample data, constructing and obtaining a coreset core set with the size of m.

In order to further prove that the sample selection method provided by the embodiment of the present application can bring better technical effects, the inventors performed experiments based on the sample data extracted by the sample selection method provided by the embodiment of the present application and the sample data randomly extracted in the prior art, and the experimental processes and experimental results are specifically as follows:

firstly, performing feature extraction on 9794 pieces of picture sample data to be audited by using an acceptance v3 model obtained by training on ImageNet to obtain feature vectors corresponding to the 9794 pieces of picture sample data; using 869 pieces of picture sample data as training data, training to obtain a 23-class classifier, wherein the structure of the classifier is that a softmax layer of 23 classes is added on a full connection layer of an initiation model. Here, the trained classifier is used as a labeling person, that is, the trained classifier is used to determine the labeling result corresponding to each of 9794 pieces of picture sample data, and the labeling of the labeling person on the picture sample data in reality is fitted to obtain the labeling accuracy corresponding to the 9794 pieces of picture sample data.

By using the sample selection method provided by the embodiment of the application, three coreset core sets are respectively extracted from 9794 pieces of picture sample data, wherein the three coreset core sets respectively comprise 700 pieces of picture sample data, 800 pieces of picture sample data and 900 pieces of picture sample data; meanwhile, three sample data sets respectively including 700 pieces of picture sample data, 800 pieces of picture sample data and 900 pieces of picture sample data are extracted from 9794 pieces of picture sample data by adopting a random sampling method.

The trained classifier is used for respectively testing the two types of data sets (the sample data set extracted by the sample selection method provided by the embodiment of the application and the sample data set extracted by the random sampling method), and the obtained marking accuracy is compared with the marking accuracy corresponding to the sample data of 9794 pictures determined before, so that the marking accuracy of the core set selected by the sample selection method provided by the embodiment of the application and the marking accuracy of the sample data set extracted randomly can be judged, and which can represent the marking accuracy of all the pictures better.

The data shown in tables 1, 2 and 3 below are the average values after five experiments:

wherein the error value mean is calculated by equation (5), and the error value variance is calculated by equation (6):

mean value of error value abs (accuracy of label of full data-accuracy of label of sample data) (5)

Error value variance ═ (full data label accuracy-sample data label accuracy) ^2 (6)

TABLE 1

700	Mean value of error values	Variance of error value
			This application	0.006507811	7.06252E-05
Prior Art	0.013538318	0.000297349

TABLE 2

800	Mean value of error values	Variance of error value
			This application	0.006007518	6.42214E-05
Prior Art	0.007247396	6.65562E-05

TABLE 3

900	Mean value of error values	Variance of error value
			This application	0.00571237	4.9851E-05
Prior Art	0.01093664	0.00013633

Through comparison, on the three sample data sets with different sizes, the sample selection method provided by the embodiment of the application is closer to the labeling accuracy rate of full-scale sample data than the method in the prior art, and the error value variance is smaller, so that the sample selection method provided by the embodiment of the application is more stable.

For the sample selection method described above, the present application also provides a corresponding sample selection device, so that the sample selection method described above can be applied and implemented in practice.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a sample selection apparatus 400 corresponding to the sample selection method shown in fig. 2, where the sample selection apparatus 400 includes:

a feature vector extraction module 401, configured to extract, for a plurality of sample data to be audited, a feature vector corresponding to each of the plurality of sample data;

a clustering module 402, configured to perform clustering processing on feature vectors corresponding to the multiple sample data to obtain at least one cluster category, and determine a cluster center of the at least one cluster category;

a sampling probability determining module 403, configured to determine, according to the cluster center of the at least one cluster category and the feature vector corresponding to each of the plurality of sample data, a sampling probability corresponding to each of the plurality of sample data;

a selecting module 404, configured to select, according to the sampling probabilities corresponding to the multiple sample data, a preset number of sample data from the multiple sample data, where the sample data is used as target sample data to be audited.

Alternatively, on the basis of the sample selection device shown in fig. 4, referring to fig. 5, fig. 5 is a schematic structural diagram of another sample selection device 500 provided in the embodiment of the present application. As shown in fig. 5, the sampling probability determination module 403 includes:

a sensitivity calculation submodule 501, configured to determine a sensitivity corresponding to the sample data according to the distance corresponding to the sample data, the distances corresponding to the other sample data in the cluster category to which the sample data belongs, the number of the sample data included in the cluster category to which the sample data belongs, and the number of the plurality of sample data; the distance corresponding to the sample data is used for representing the distance between the sample data and the clustering center of the clustering class to which the sample data belongs;

the sampling probability calculating submodule 502 is configured to determine the sampling probability corresponding to the sample data according to the sensitivity corresponding to the sample data and the sensitivities corresponding to the plurality of sample data.

Optionally, on the basis of the sample selection apparatus shown in fig. 5, the sensitivity calculation sub-module 501 is specifically configured to:

calculating a distance normalization value according to the respective corresponding distances of the plurality of sample data;

calculating the sensitivity corresponding to the sample data according to the class weight, the distance normalization value, the distance corresponding to the sample data, the respective distances corresponding to other sample data in the cluster class to which the sample data belongs, the number of the sample data included in the class to which the sample data belongs and the number of the plurality of sample data; the class weight is determined according to the number of the cluster classes.

Optionally, on the basis of the sample selection apparatus shown in fig. 4, the selection module 404 is specifically configured to:

and selecting the sample data with the preset number from the sample data based on a random weighting algorithm according to the sampling probability corresponding to each sample data, and taking the sample data as the target sample data.

Optionally, on the basis of the sample selection apparatus shown in fig. 4, the feature vector extraction module 401 is specifically configured to:

and aiming at each sample data in the plurality of sample data, inputting the sample data into a pre-trained convolutional neural network, and acquiring a feature vector output by the last full-connection layer in the convolutional neural network as a feature vector corresponding to the sample data.

Alternatively, on the basis of the sample selection device shown in fig. 4, referring to fig. 6, fig. 6 is a schematic structural diagram of another sample selection device provided in the embodiment of the present application. As shown in fig. 6, the apparatus further includes:

a first likelihood function determining module 601, configured to determine first likelihood functions corresponding to the plurality of sample data, where the first likelihood functions are used to characterize labeling accuracy corresponding to the plurality of sample data;

a second likelihood function determining module 602, configured to determine a second likelihood function corresponding to the preset number of target sample data, where the second likelihood function is used to represent the labeling accuracy corresponding to the preset number of target sample data;

a verification module 603, configured to determine that the target sample data in the preset number passes verification if a difference between the second likelihood function and the first likelihood function is within a preset threshold range.

Optionally, on the basis of the sample selection apparatus shown in fig. 4, the plurality of sample data are used for training any one of the following types of models: an image classification model, a video classification model, a text classification model, an image segmentation model, a semantic segmentation model, a speech recognition model, and a text translation model.

The sample selection device combines a clustering algorithm with a coreset core set construction algorithm, and extracts partial sample data from the full sample data to be audited as target sample data to be audited, namely, the coreset core set is extracted from the full sample data, and the coreset core set has the characteristic of reflecting the distribution characteristics of the full data, so that the distribution characteristics of the full sample data to be audited can be reflected by a plurality of target sample data selected by the device; therefore, the full-scale sample data is not required to be audited, time cost and economic cost consumed by auditing the sample data are greatly saved, and the reliability of the audit of the sample data is ensured on the premise that the target sample data can reflect the distribution characteristics of the full-scale sample data to be audited.

The embodiment of the present application further provides a device for selecting a sample, where the device may specifically be a terminal or a server, and the device provided in the embodiment of the present application will be described below from the perspective of hardware implementation.

An apparatus is further provided in the embodiment of the present application, as shown in fig. 7, for convenience of description, only a portion related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, please refer to the method portion of the embodiment of the present application. The terminal may be any terminal device including a computer, a tablet computer, a Personal Digital Assistant (PDA, for short, in the english language), a Sales terminal (POS, for short, in the english language), a vehicle-mounted computer, and the terminal is taken as a computer as an example:

fig. 7 is a block diagram illustrating a partial structure of a computer related to a terminal provided in an embodiment of the present application. Referring to fig. 7, the computer includes: radio Frequency (RF) circuit 710, memory 720, input unit 730, display unit 740, sensor 750, audio circuit 760, wireless fidelity (WiFi) module 770, processor 780, and power supply 790. Those skilled in the art will appreciate that the computer architecture shown in FIG. 7 is not intended to be limiting of computers, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

The memory 720 may be used to store software programs and modules, and the processor 780 performs various functional applications of the computer and data processing by operating the software programs and modules stored in the memory 720. The memory 720 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the computer, etc. Further, the memory 720 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 780 is a control center of the computer, connects various parts of the entire computer using various interfaces and lines, performs various functions of the computer and processes data by operating or executing software programs and/or modules stored in the memory 720 and calling data stored in the memory 720, thereby monitoring the entire computer. Optionally, processor 780 may include one or more processing units; preferably, the processor 780 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 780.

In the embodiment of the present application, the processor 780 included in the terminal further has the following functions:

Optionally, the processor 780 is further configured to perform the steps of any implementation manner of the sample selection method provided in the embodiment of the present application.

Fig. 8 is a schematic structural diagram of a server provided in this embodiment, where the server 800 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 822 (e.g., one or more processors) and a memory 832, and one or more storage media 830 (e.g., one or more mass storage devices) for storing applications 842 or data 844. Memory 832 and storage medium 830 may be, among other things, transient or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 822 may be provided in communication with the storage medium 830 for executing a series of instruction operations in the storage medium 830 on the server 800.

The server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input-output interfaces 858, and/or one or more operating systems 841, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 8.

The CPU 822 is configured to execute the following steps:

Optionally, CPU 822 may also be configured to execute the steps of any implementation of the sample selection method in the embodiment of the present application.

The embodiments of the present application further provide a computer-readable storage medium for storing a computer program, where the computer program is configured to execute any one implementation of a sample selection method described in the foregoing embodiments.

The present application further provides a computer program product including instructions, which when run on a computer, cause the computer to perform any one of the embodiments of a sample selection method described in the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing computer programs.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of sample selection, the method comprising:

2. The method of claim 1, wherein for each sample data in the plurality of sample data, determining a sampling probability corresponding to the sample data by:

determining the sensitivity corresponding to the sample data according to the distance corresponding to the sample data, the distances corresponding to other sample data in the cluster type to which the sample data belongs, the number of the sample data included in the cluster type to which the sample data belongs and the number of the plurality of sample data; the distance corresponding to the sample data is used for representing the distance between the sample data and the clustering center of the clustering class to which the sample data belongs;

and determining the sampling probability corresponding to the sample data according to the sensitivity corresponding to the sample data and the sensitivities corresponding to the plurality of sample data respectively.

3. The method according to claim 2, wherein said determining the sensitivity corresponding to the sample data according to the distance corresponding to the sample data, the distances corresponding to the other sample data in the cluster category to which the sample data belongs, the number of sample data included in the cluster category to which the sample data belongs, and the number of the sample data comprises:

4. The method according to claim 1, wherein the selecting a preset number of sample data from the plurality of sample data according to the sampling probability corresponding to each of the plurality of sample data as target sample data to be audited includes:

5. The method according to claim 1, wherein said extracting the feature vector corresponding to each of the plurality of sample data comprises:

6. The method of claim 1, further comprising:

determining a first likelihood function corresponding to the plurality of sample data, wherein the first likelihood function is used for representing the labeling accuracy corresponding to the plurality of sample data;

determining a second likelihood function corresponding to the preset number of target sample data, wherein the second likelihood function is used for representing the marking accuracy corresponding to the preset number of target sample data;

and if the difference value between the second likelihood function and the first likelihood function is within a preset threshold range, determining that the preset number of target sample data passes verification.

7. The method of any of claims 1 to 6, wherein said plurality of sample data is used to train any type of model: an image classification model, a video classification model, a text classification model, an image segmentation model, a semantic segmentation model, a speech recognition model, and a text translation model.

8. A sample selection device, the device comprising:

9. An electronic device, comprising: a memory and a processor;

the memory is used for storing a computer program;

the processor is configured to perform the sample selection method of any one of claims 1 to 7 in accordance with the computer program.

10. A computer-readable storage medium for storing a computer program for performing the sample selection method of any one of claims 1 to 7.