CN116306977A

CN116306977A - Sample selection method, device, equipment and storage medium

Info

Publication number: CN116306977A
Application number: CN202310273873.9A
Authority: CN
Inventors: 李兴建; 吴昊宇; 熊昊一
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-03-20
Filing date: 2023-03-20
Publication date: 2023-06-23

Abstract

The disclosure provides a sample selection method, a device, equipment and a storage medium, and relates to the field of data processing, in particular to the fields of artificial intelligence, big data and machine learning. The specific implementation scheme is as follows: a plurality of unlabeled samples is obtained. And determining a disturbance vector of the unlabeled sample aiming at any unlabeled sample in the plurality of unlabeled samples, wherein the disturbance vector of the unlabeled sample is used for representing a procedure of the unlabeled sample affected by noise. The larger the disturbance vector of the unlabeled sample is, the higher the complexity of the unlabeled sample is. And selecting a target unlabeled sample from the unlabeled samples according to the disturbance vector of each unlabeled sample in the unlabeled samples, wherein the target unlabeled sample comprises unlabeled samples with different disturbance vectors.

Description

Sample selection method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing, and in particular, to the field of artificial intelligence, big data, and machine learning, and more particularly, to a method, apparatus, and device for selecting a sample, and a storage medium.

Background

In model training using training data, the training data may, in some cases, include a large number of unlabeled samples. The appropriate unlabeled samples need to be selected from a large number of unlabeled samples for labeling, so that model training can be performed by using labeled data.

Disclosure of Invention

The present disclosure provides a method, apparatus, device and storage medium for selecting samples.

According to an aspect of the present disclosure, there is provided a sample selection method including:

a plurality of unlabeled samples is obtained. And determining a disturbance vector of the unlabeled sample aiming at any unlabeled sample in the plurality of unlabeled samples, wherein the disturbance vector of the unlabeled sample is used for representing the influence degree of noise on the unlabeled sample. The larger the disturbance vector of the unlabeled sample is, the higher the complexity of the unlabeled sample is. And selecting a target unlabeled sample from the unlabeled samples according to the disturbance vector of each unlabeled sample in the unlabeled samples, wherein the target unlabeled sample comprises unlabeled samples with different disturbance vectors.

According to another aspect of the present disclosure, there is provided a sample selection apparatus including:

and the acquisition unit is used for a plurality of unlabeled samples.

The determining unit is used for determining a disturbance vector of an unlabeled sample aiming at any unlabeled sample in the plurality of unlabeled samples, wherein the disturbance vector of the unlabeled sample is used for representing the influence degree of noise on the unlabeled sample. The larger the disturbance vector of the unlabeled sample is, the higher the complexity of the unlabeled sample is.

And the selection unit is used for selecting a target unlabeled sample from the unlabeled samples according to the disturbance vector of each unlabeled sample in the unlabeled samples, wherein the target unlabeled sample comprises unlabeled samples with different disturbance vectors.

According to still another aspect of the present disclosure, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the methods of the first aspect.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, comprising:

the computer instructions are for causing a computer to perform any of the methods of the first aspect.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising:

a computer program which, when executed by a processor, performs any of the methods of the first aspect.

According to the technical scheme, the representative unlabeled sample can be selected from a plurality of unlabeled samples, so that the data to be labeled is reduced, and the data labeling time is saved. The model can be trained according to the selected unlabeled samples, so that the performance of the model can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic diagram of an application scenario of data annotation provided in an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a method of selecting a sample provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of another sample selection method provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a method for determining a disturbance model provided by an embodiment of the present disclosure;

FIG. 5 is a schematic illustration of another sample selection method provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a sample selection apparatus provided by an embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In an environment where deep learning technology has been applied very widely, a labeled sample (may also be referred to as a labeled sample, labeled sample data, labeled data) plays a vital role in training a model. Although it has become increasingly easier in some cases to obtain unlabeled samples, significant human resources and time costs are often incurred in labeling such unlabeled samples. And the active learning technology can select unlabeled samples with better and more valuable improvement on model training effect from the unlabeled samples. Labeling the unlabeled samples selected by active learning, and adding the labeled samples to the training of the model to obtain better model performance improvement.

In some application scenarios, for example, in an image recognition scenario, it is first required to label an image/image data for model training, and use the labeled image/image data to perform model training, so as to obtain a model with an image recognition function. However, during model training. A large amount of image/image data is required for training, but there may be a large amount of unlabeled image/image data with little variability in the large amount of image/image data, and if labeling is performed on each of the plurality of images/image data with little variability, a large amount of time and effort are required.

In some embodiments, some samples worth labeling can be selected from a plurality of unlabeled samples for labeling based on a selection method of the active learning samples of the information entropy. The selection method is that the information entropy of the result is output through a calculation model, and unlabeled samples are selected through the value of the information entropy. The unlabeled sample can be data for which the current model cannot accurately obtain a result. For example, after the unlabeled sample is input as input data to the current model, an output result cannot be obtained, or the output result is inaccurate, or a plurality of results are output.

In still other embodiments, unlabeled samples may be selected for labeling from a plurality of unlabeled samples by a random sample selection method.

However, the above method has the following problems:

the first problem is unstable performance.

Under the condition that the redundancy of unlabeled samples is very high (namely, more similar unlabeled samples exist in the unlabeled samples), the data selected by the selection method of the active learning samples based on the information entropy is selected, and the improvement effect on the model performance is higher than that of the random sample selection method.

This is because there may be more similar data in the unlabeled samples selected by the random sample selection method, so if model training is performed using more similar data, the training data is single in type, and the performance of training to obtain a model is poor.

Under the condition that the redundancy of unlabeled samples is low (i.e. fewer similar unlabeled samples exist in the unlabeled samples), selecting data selected by using a random sample selection method has a higher improvement effect on model performance than that of an active learning sample selection method based on information entropy.

The method is characterized in that when the redundancy of unlabeled samples is low, the types of data selected by the selection method of the active learning samples of the information entropy are less, so that the representativeness of the data is not high. Therefore, the performance of the model obtained by training the data selected by the method for selecting the active learning sample based on the information entropy is also poor.

As can be seen from the above, the above two data selection methods have some disadvantages and poor stability when facing unlabeled samples with different redundancy.

And secondly, dirty data is easy to generate.

Dirty data may refer to outliers, anomalies, or noisy data, among other things.

In the field of active learning, it is often faced with data generated in the actual scene, not data that has been disclosed. Since the probability that dirty data belongs to each type is relatively low, when the data is selected by using the active learning sample selection method based on information entropy, the probability that dirty data is selected is relatively high. However, dirty data does not help in model training and improvement of effectiveness.

In view of this, the embodiment of the application provides a method for selecting samples, which determines unlabeled disturbance vectors capable of reflecting complexity according to initial feature vectors and disturbance feature vectors of unlabeled samples, and selects a plurality of samples with different complexity as unlabeled samples for subsequent model training according to the disturbance vectors of a plurality of unlabeled samples.

The above method may be applied to a selection device, for example. In combination with the application scenario, as shown in fig. 1, the selecting device may screen representative image data from a plurality of image data to be annotated, annotate the screened image data to obtain annotated image data, and then use the annotated image data to perform model training.

The selection device in fig. 1 may be an electronic device or a server. The electronic device may be a tablet, a cell phone, a desktop, a laptop, a handheld computer, a notebook, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook, a cellular telephone, a personal digital assistant (personal digital assistant, PDA), an augmented reality (augmented reality, AR), a Virtual Reality (VR), or the like.

The server can be an entity server or a cloud server.

In some scenarios, the server may be a separate server, or may be a server cluster made up of multiple servers.

The following describes in detail a sample selection method provided by an embodiment of the present disclosure with reference to the accompanying drawings.

As shown in fig. 2, a method for selecting a sample provided by an embodiment of the present disclosure may include: s201 to S203.

S201, obtaining a plurality of unlabeled samples.

Wherein, unlabeled exemplars may refer to untagged exemplars or untagged exemplars of the type. For example, the unlabeled exemplar may include image/image data, point cloud data, search data (e.g., text, character strings), and the like. Unlabeled exemplars may be referred to as unlabeled data, unlabeled exemplar data, and the like.

In an application scenario, when the unlabeled sample is an image/image data, the unlabeled sample may be an image (such as a face image or an article image) in the image classification field (such as face recognition or image recognition). For example, the unlabeled sample may be an image/image data acquired by an image acquisition device. For example, the image capturing device may be a video camera or a still camera.

In still another application scenario, when the unlabeled sample is point cloud data, the unlabeled sample may be data in a driving field (e.g., automatic driving). For example, unlabeled samples may collect point cloud data of a road for a radar system of a vehicle. For example, the point cloud data of the road may include point cloud data of a vehicle on the road, point cloud data of a road surface state, and the like.

In still another application scenario, when the unlabeled exemplar is search data, the unlabeled exemplar may be data in a search field (e.g., keyword matching, search information matching). For example, unlabeled exemplars may be words, phrases, strings, etc. entered by a user on a search engine.

S202, determining disturbance vectors of any unlabeled sample in the plurality of unlabeled samples.

The disturbance vector may also be referred to as a disturbance value, a disturbance change value, or the like. The disturbance value of the sample can be used to characterize the extent to which the sample is affected by noise. The larger the perturbation value of a sample, the higher the complexity of that sample.

In a possible implementation manner, after the selecting device in fig. 1 obtains a plurality of unlabeled samples, for each unlabeled sample, the selecting device may determine a disturbance vector of the unlabeled sample according to an initial feature vector and a disturbance feature vector of the unlabeled sample. For example, the disturbance vector for an unlabeled sample may be the difference between the initial feature vector for the unlabeled sample and the disturbance feature vector.

The complexity of the unlabeled sample can be reflected due to the difference between the initial feature vector and the disturbance feature vector of the unlabeled sample. Therefore, the disturbance vector of the unlabeled sample can be obtained rapidly and accurately through the initial feature vector and the disturbance feature vector of the unlabeled sample.

Wherein the initial feature vector of the sample may be used to characterize feature values corresponding to a plurality of elements of the sample.

For example, the initial feature vector of an image may refer to red, green, and blue intensity values of a plurality of pixels of the image. For example, the size of the image is 64×64 pixels. One pixel may be represented by intensity values of three colors red, green and blue. The initial feature vector of the image may include 3 matrix conversions of 64×64 (e.g., the initial feature vector may be 1×64 or 64×1 array). A matrix of 64 x 64 corresponds to a color intensity value of the image.

For another example, the initial feature vector of the point cloud data may be coordinate data of a plurality of valid points of the point cloud data. The plurality of valid points of the point cloud data may be point clouds other than noise data in the point cloud data. For example, if the point cloud data is the point cloud data of a sign on a road, the noise data may refer to the point cloud data of other objects (such as vehicles, road surfaces, etc.) on the road other than the sign.

For another example, the initial feature vector of the character string may be a vector composed of feature numbers at respective positions of the character string. For example, the feature vector of the string abcdaabcab may be (0,0,0,0,1,1,2,3,1,2).

In some scenes, after obtaining initial feature vectors and disturbance feature vectors of a plurality of unlabeled samples, the initial feature vectors of the plurality of unlabeled samples can be spliced to obtain initial feature matrices of the plurality of labeled samples, and the disturbance feature vectors of the plurality of unlabeled samples are spliced to obtain disturbance feature matrices of the plurality of unlabeled samples. The disturbance feature matrix comprises a disturbance vector of each unlabeled sample in the unlabeled samples, and in the initial feature matrix and the disturbance feature matrix, the unlabeled samples are same in order. In this way, the feature vector of each unlabeled sample is obtained as compared to the initial feature vector and the disturbance feature vector according to the unlabeled sample. In the embodiment of the disclosure, the initial feature vectors and the disturbance feature vectors of the plurality of unlabeled samples are spliced respectively, so that the disturbance vector of each unlabeled sample in the plurality of unlabeled samples can be calculated at one time, and the processing efficiency is improved.

In one example, the selection device may input a plurality of unlabeled samples into the initial model in batches to obtain respective initial feature vectors of each unlabeled sample, and input a plurality of unlabeled samples into the disturbance model in batches to obtain respective disturbance feature vectors of each unlabeled sample.

In this example, the plurality of labeled samples and the plurality of unlabeled samples belong to the same scene and the same field of data as the unlabeled sample of S201. For example, the image/image data of the road, the point cloud data of the road, the face image/image data, and the like.

It should be noted that, in the embodiment of the present application, after the simple data is input into the disturbance model, the obtained result changes less. After the complex data is input into the disturbance model, the obtained result changes greatly. This is because simple data contains few elements and has relatively obvious characteristics, and thus the overall change in the obtained result is small even when disturbance processing is performed. Whereas complex data is the opposite.

For example, taking data as an image/image data, a simple image includes fewer objects (e.g., elements in the image include only avatars), which are easily pattern-identifiable. Thus, the object included in the image is subjected to disturbance processing, and the processed image is still easily recognized. While complex images include many objects (e.g., image data includes many elements, such as avatars, backgrounds, and other objects), they are not easily identified by the model. Therefore, the complex image is processed again, and the processed image is not easily recognized.

S203, selecting a target unlabeled sample from the unlabeled samples according to the disturbance value of each unlabeled sample in the unlabeled samples.

The target unlabeled sample is a representative sample, for example, the target unlabeled sample may include a plurality of data with different disturbance vectors.

In one possible implementation, the selecting device may select the target unlabeled sample from the plurality of unlabeled samples according to a preset screening algorithm.

Wherein, preset screening algorithm can be set according to the needs. For example, a K-Center Greedy algorithm, a K Center algorithm, etc., may be used without limitation.

In an example, taking a preset screening algorithm as a K-Center-gray algorithm as an example, the selecting device may define the labeled samples as a set S, and divide the plurality of unlabeled samples into a plurality of clusters according to values of disturbance vectors of the plurality of unlabeled samples. In this way, the selection device may select the target unlabeled sample through multiple rounds of iterative computations. Meanwhile, compared with other active learning algorithms, the K-Center-Greedy algorithm has better data richness and information quantity in the selection of unlabeled samples, so that the model training effect can be better improved.

Wherein, a cluster may include one or more unlabeled samples, and the disturbance value of the one or more unlabeled samples is within a preset range. The preset ranges corresponding to different clusters are different.

For example, the selection device may divide the plurality of unlabeled exemplars into K clusters. K is a positive integer. In each round of iterative computation, the selection device takes an unlabeled sample with the largest distance from the set S in each cluster as target data, adds the target data into the set S, and then, in the next round of iterative computation, determines the target data in each cluster again based on the mode. In this way, the selection device can obtain a plurality of target data. Further, the selection device may take the plurality of target data as target unlabeled samples.

The distance between the unlabeled sample and the set S may refer to a minimum value of distances between the unlabeled sample and a plurality of labeled samples in the set S. The distance may be a euclidean distance.

In another possible implementation manner, the selecting device may use any unlabeled sample as a reference sample, then calculate a distance between a disturbance vector of each unlabeled sample and a disturbance vector of the reference sample, and use a plurality of unlabeled samples with equal or similar distances as the same set, so that the selecting device may obtain a plurality of sets. Finally, the selection device may select one or more unlabeled exemplars from each of the sets as target unlabeled data.

The plurality of unlabeled samples with similar distances means that the difference between the distances between the plurality of unlabeled samples and the reference sample is smaller than a preset value.

Based on the above-described technical solution of fig. 2, after a plurality of unlabeled samples are obtained, a disturbance value of each unlabeled sample may be determined. Because the disturbance value of the sample can reflect the complexity of the sample, unlabeled samples with different complexity can be selected from a plurality of unlabeled samples based on the disturbance value of the unlabeled samples. Unlabeled samples of the same or similar complexity may belong to the same class of data, and therefore, the selected target unlabeled sample is representative data in the plurality of unlabeled samples. In the subsequent model training, the performance of the model can be better improved.

In some examples, in order to obtain an initial feature vector and a disturbance feature vector of each unlabeled sample so as to obtain a disturbance vector of the unlabeled sample, the selection device may input unlabeled data into the initial model and the disturbance model respectively, to obtain an initial feature vector and a disturbance feature vector of the unlabeled sample. In particular, reference may be made to the description of the embodiments described below.

In some embodiments, as shown in fig. 3, the method for selecting a sample provided in the embodiments of the present application may include S301 to S305.

S301, acquiring a plurality of unlabeled samples.

Herein, S301 may refer to the description of S201 above, and will not be described herein.

S302, inputting the unlabeled sample into an initial feature model to obtain an initial feature vector of the unlabeled sample.

Wherein the initial model has a function of determining a plurality of characteristic values of the sample. The initial model may be a pre-trained model. For example, the initial model may be trained based on a predetermined algorithm and a plurality of labeled samples, or may be self-supervised learned based on a plurality of unlabeled samples. The preset algorithm may be set according to needs, for example, may be a machine learning algorithm (such as a convolutional neural network algorithm), and the like, and is not limited. The self-supervised learning may not be described in detail with reference to the prior art. The disturbance model is a model obtained by carrying out disturbance processing on the initial model.

S303, performing disturbance processing on the initial model to obtain a disturbance model, and inputting the unlabeled sample into the disturbance model to obtain a disturbance feature vector of the unlabeled sample.

Wherein the disturbance model has the function of determining a plurality of eigenvalues of the samples affected by noise.

In a possible implementation manner, after the selection device acquires the initial model, the selection device may adjust target parameters of the initial model, and determine the adjusted initial model as the disturbance model. Thus, the disturbance model can be obtained rapidly.

The target parameter may be a parameter of the feature vector used for extracting the input sample of the initial model from the parameters of the initial model.

In one example, where the initial model is trained from a convolutional neural network algorithm, the model derived from the convolutional neural network algorithm (which may also be referred to as a convolutional neural network model, i.e., the initial model) may include a plurality of convolutional layers and a fully-connected layer. Adjusting parameters of the initial model may refer to adjusting parameters of one or more of the plurality of convolutional layers.

For example, as shown in fig. 4, the initial model may include three convolution layers (convolution layer 1, convolution layer 2, convolution layer 3, respectively). The target parameter may be a parameter in any one or more of convolution layers 1-3. Therefore, the convolution layer of the initial model can be used for extracting the characteristic vector of the input sample, and the initial model after parameter adjustment, namely the disturbance model, can be obtained quickly and conveniently by adjusting the parameters in the convolution layer of the initial model.

For example, the selection device may increase or decrease the parameters in the convolutional layer by a preset value, or by a preset ratio. The preset value and the preset proportion can be set according to needs, for example, the preset value can be a fixed value, and the preset proportion is 1%, 5% or the like. Of course, other values are possible without limitation. In this way, by adjusting the parameters of the convolution layer, the Gaussian noise is added to the initial model, so that the disturbance model can be obtained based on the adjusted parameters, and the method is simple and convenient.

S304, determining a disturbance vector of any unlabeled sample according to the initial feature vector and the disturbance feature vector of the unlabeled sample aiming at any unlabeled sample in the plurality of unlabeled samples.

S305, selecting a target unlabeled sample from the unlabeled samples according to the disturbance vector of each unlabeled sample in the unlabeled samples.

The descriptions of S304 to S305 with reference to S202 to S203 are omitted herein.

Based on the technical scheme of fig. 3, in order to determine the disturbance value of the unlabeled sample, after the initial model is obtained, disturbance processing may be performed on the initial model to obtain a disturbance model. Thus, the selection device can obtain the initial feature vector of the unlabeled sample through the initial model, and obtain the disturbance feature vector of the unlabeled sample through the disturbance model. Furthermore, the disturbance value of the unlabeled sample can be obtained according to the initial feature vector and the disturbance feature vector of the unlabeled sample.

In some embodiments, the selection device may instantiate the model after acquiring the initial model and perturbing the model in order to ensure proper operation of the model.

The instantiation of the model may refer to selecting a device to configure a running memory for the model, to run the model, and to store a running result of the model.

In still other embodiments, after the initial model and the perturbation model are acquired, the model may be set to an evaluation mode in order to ensure that the function of the model can be normally performed. Under the evaluation model, the model has a function of determining/outputting the characteristic value of the sample.

In one example, the initial model is in an evaluation mode, and when the input of the initial model is an unlabeled sample, the output of the initial model is an initial feature vector of the unlabeled sample.

In yet another example, the perturbation model is in the evaluation mode, and when the input of the perturbation model is an unlabeled sample, the output of the perturbation model is a perturbation feature vector of the unlabeled sample.

It should be noted that, in the embodiment of the present disclosure, the mode of the model may further include a training mode. When the model is in the training mode, the model may be trained using the input data.

In a scene, the model is a convolutional neural network model, when the model is in a training mode, data transmitted between convolutional layers need to be normalized, and the variance of the processed data is 1. When the model is in the evaluation mode, the data transmitted between the convolution layers does not need to be normalized.

In one possible implementation, the selection device may adjust the mode of the model in response to the instruction.

Wherein the instructions may be for controlling a mode of the model. The instruction may be an instruction entered through an input device (e.g., a keyboard) of the selection apparatus. For example, the instruction may include a code or a string of characters, or the like.

In still other embodiments, after obtaining the target unlabeled sample, the selecting device further trains the initial model according to the target unlabeled sample to obtain a trained model, and specifically, reference may be made to the following descriptions of S504 and S505, which will not be repeated.

In some embodiments, as shown in fig. 5, the method for selecting a sample provided in the embodiments of the present application may include S501 to S505.

S501, obtaining a plurality of unlabeled samples.

S502, determining a disturbance value of any unlabeled sample in the plurality of unlabeled samples.

S503, selecting a target unlabeled sample from the unlabeled samples according to the disturbance value of each unlabeled sample in the unlabeled samples.

The descriptions of S501 to S503 may be referred to as S201 to S203, and will not be repeated.

S504, labeling the unlabeled target sample to obtain labeled data.

The labeling of the target unlabeled sample may refer to setting a label for the target unlabeled sample or determining a type to which the target unlabeled sample belongs.

In one possible implementation manner, the selecting device may respond to the labeling operation to label the target unlabeled sample, so as to obtain labeled data.

The following describes a plurality of application scenarios in S201.

In an application scenario, when the unlabeled sample is image data, labeling the target unlabeled sample may refer to setting an attribute tag of an image for the image data. For example, the attribute tags of the image may include the gender, skin tone, age, etc. of the faces in the image, or may include the category, color, size, etc. of the objects in the image.

In still another application scenario, when the unlabeled sample is point cloud data, labeling the target unlabeled sample may refer to setting a type tag to which the point cloud data belongs. For example, the type tag to which the point cloud data belongs may include the size, the number, the kind (e.g., car, bus, etc.), the color, etc. of the vehicles on the road, or may include the number of pedestrians on the road, the vehicle density, the traffic accident, etc.

In still another application scenario, when the unlabeled sample is search data, labeling the target unlabeled sample may refer to setting a corresponding search result for the search data. For example, search results for search data may include web pages, articles, links, and the like.

S505, setting the initial model as a training mode, and retraining the initial model according to the marked data in the training mode to obtain a trained initial model.

The training mode and the setting manner of the training mode may refer to the description in the foregoing embodiments, which is not repeated.

In one example, after obtaining the annotated data, the selection device may perform iterative training on the initial model according to a preset algorithm, to obtain a trained initial model. The preset algorithm may be consistent with the training algorithm of the initial model, for example, may be a convolutional neural network algorithm.

Based on the embodiment of fig. 5, compared with the case that each unlabeled sample needs to be labeled, in the embodiment of the disclosure, by labeling the target unlabeled sample in the plurality of unlabeled data, the workload can be reduced, and the labeling efficiency is improved. In addition, since the target unlabeled sample is representative data among a plurality of unlabeled samples, the performance of the model can be improved.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

The foregoing description of the embodiments of the present disclosure has been presented primarily in terms of computer apparatus. It will be appreciated that the computer device, in order to carry out the functions described above, comprises corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative form identification method steps described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The embodiment of the disclosure may divide the functional modules or functional units according to the method example described above for determining the information points, for example, each functional module or functional unit may be divided corresponding to each function, or two or more functions may be integrated in one processing module. The integrated modules may be implemented in hardware, or in software functional modules or functional units. The division of modules or units in the embodiments of the present disclosure is merely a logic function division, and other division manners may be actually implemented.

Fig. 6 is a schematic structural diagram of a sample selecting device according to an embodiment of the disclosure. The sample selection device may include: an acquisition unit 601, a determination unit 602, and a selection unit 603.

The obtaining unit 601 is configured to obtain a plurality of unlabeled samples.

The determining unit 602 is configured to determine, for any unlabeled sample of the plurality of unlabeled samples, a disturbance vector of the unlabeled sample, where the disturbance vector of the unlabeled sample is used to characterize a degree to which the unlabeled sample is affected by noise. The larger the disturbance vector of the unlabeled sample is, the higher the complexity of the unlabeled sample is.

The determining unit 602 is further configured to select a target unlabeled sample from the plurality of unlabeled samples according to the disturbance value vector of each unlabeled sample in the plurality of unlabeled samples, where the target unlabeled sample includes unlabeled samples with different disturbance vectors.

Optionally, the determining unit 602 is specifically configured to: and determining the disturbance value of the unlabeled sample according to the initial feature vector and the disturbance feature vector of the unlabeled sample. Wherein the disturbance vector of the unlabeled sample includes elements corresponding to a plurality of elements of the unlabeled sample. The initial feature vector of the unlabeled sample is used for representing the feature values corresponding to a plurality of elements of the unlabeled sample, and the disturbance feature vector of the unlabeled sample is used for representing the feature values corresponding to a plurality of elements of the scrambled unlabeled sample.

Optionally, as shown in fig. 6, the apparatus may further include a processing unit 604, configured to input the unlabeled sample into the initial model obtained by training in advance to obtain an initial feature vector of the unlabeled sample, perform disturbance processing on the initial model to obtain a disturbance model, and input the unlabeled sample into the disturbance model to obtain a disturbance feature vector of the unlabeled sample. The initial model has a function of determining a plurality of characteristic values of the sample.

Optionally, the processing unit 604 is specifically configured to adjust a target parameter of the initial model, and determine the adjusted initial model as the disturbance model, where the target parameter includes a parameter of the parameters of the initial model, where the parameter is used to extract a feature vector of an input sample of the initial model.

Optionally, the initial model comprises a plurality of convolution layers, and the target parameter comprises a parameter of one or more of the plurality of convolution layers.

Optionally, as shown in fig. 6, the apparatus further includes a stitching unit 605, configured to stitch initial feature vectors of the plurality of unlabeled samples to obtain initial feature matrices of the plurality of unlabeled samples, and stitch disturbance feature vectors of the plurality of unlabeled samples to obtain disturbance feature matrices of the plurality of unlabeled samples, where the initial feature matrices and the disturbance feature matrices are the same in order of the plurality of unlabeled samples. The determining unit 602 is further configured to determine a disturbance vector matrix corresponding to the plurality of unlabeled samples according to the initial feature matrix and the disturbance feature matrix. The disturbance vector matrix includes a disturbance vector for each unlabeled sample of the plurality of unlabeled samples. The selecting unit 603 is specifically configured to select a target unlabeled sample from a plurality of unlabeled samples according to the disturbance vector matrix.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

FIG. 7 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks. For example, the communication unit 709 may be used to perform S201 in fig. 2.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and examples described above, e.g., S202, S203 in fig. 2. For example, in some embodiments, the methods of fig. 2, 3, 5 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM702 and/or the communication unit 709. When the computer program is loaded into RAM703 and executed by computing unit 701, one or more of the steps of fig. 2, 3, 5 described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the solutions of fig. 2, 3, 5 by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of selecting a sample, comprising:

obtaining a plurality of unlabeled samples;

determining a disturbance vector of an unlabeled sample aiming at any unlabeled sample in the plurality of unlabeled samples, wherein the disturbance vector is used for representing the influence degree of noise on the unlabeled sample; the larger the disturbance vector of the unlabeled sample is, the higher the complexity of the unlabeled sample is;

selecting a target unlabeled sample from the unlabeled samples according to the disturbance vector of each unlabeled sample in the unlabeled samples, wherein the target unlabeled sample comprises unlabeled samples with different disturbance vectors.

2. The method of claim 1, wherein the unlabeled exemplar includes a plurality of elements, the determining a disturbance vector for the unlabeled exemplar comprising:

determining a disturbance vector of the unlabeled sample according to the initial feature vector and the disturbance feature vector of the unlabeled sample, wherein elements included in the disturbance vector correspond to a plurality of elements included in the unlabeled sample;

the initial feature vector of the unlabeled sample is used for representing feature values corresponding to a plurality of elements of the unlabeled sample, and the disturbance feature vector of the unlabeled sample is used for representing the feature values corresponding to a plurality of elements of the unlabeled sample after scrambling.

3. The method of claim 2, wherein the method further comprises:

inputting the unlabeled sample into an initial model to obtain an initial feature vector of the unlabeled sample;

performing disturbance processing on the initial model to obtain a disturbance model, and inputting the unlabeled sample into the disturbance model to obtain a disturbance feature vector of the unlabeled sample; the initial model has a function of determining a plurality of eigenvalues of a plurality of elements of the sample.

4. A method according to claim 3, wherein said perturbing the initial model to obtain a perturbation model comprises:

Adjusting target parameters of the initial model, and determining the adjusted initial model as the disturbance model; the target parameters include parameters of feature vectors used for extracting input samples of the initial model among parameters of the initial model.

5. The method of claim 4, wherein the initial model comprises a plurality of convolutional layers and the target parameter comprises a parameter of one or more of the plurality of convolutional layers.

6. The method of any of claims 2-5, wherein the method further comprises:

splicing the initial feature vectors of the plurality of unlabeled samples to obtain initial feature matrixes of the plurality of unlabeled samples, and splicing the disturbance feature vectors of the plurality of unlabeled samples to obtain disturbance feature matrixes of the plurality of unlabeled samples, wherein the sequences of the plurality of unlabeled samples in the initial feature matrixes and the disturbance feature matrixes are the same;

determining a disturbance vector matrix corresponding to the unlabeled samples according to the initial feature matrix and the disturbance feature matrix, wherein the disturbance vector matrix comprises a disturbance vector of each unlabeled sample in the unlabeled samples;

The selecting a target unlabeled sample from the plurality of unlabeled samples according to the disturbance vector of each unlabeled sample in the plurality of unlabeled samples includes:

and selecting the target unlabeled sample from the plurality of labeled samples according to the disturbance vector matrix.

7. A sample selection apparatus comprising:

the acquisition unit is used for a plurality of unlabeled samples;

the determining unit is used for determining disturbance vectors of the unlabeled samples aiming at any unlabeled sample in the plurality of unlabeled samples, wherein the disturbance vectors are used for representing a program of the unlabeled samples affected by noise; the larger the disturbance vector of the unlabeled sample is, the higher the complexity of the unlabeled sample is;

8. The apparatus of claim 7, wherein the unlabeled exemplar includes a plurality of elements, the determining unit being specifically configured to:

The initial feature vector is used for representing feature values corresponding to a plurality of elements of the unlabeled sample, and the disturbance feature vector of the unlabeled sample is used for representing the feature values corresponding to a plurality of elements of the scrambled unlabeled sample.

9. The apparatus of claim 8, wherein the apparatus further comprises a processing unit; the processing unit is used for:

performing disturbance processing on the initial model to obtain a disturbance model, and inputting the unlabeled sample into the disturbance model to obtain a disturbance feature vector of the unlabeled sample; the initial model has a function of determining a plurality of characteristic values of a sample.

10. The apparatus of claim 9, wherein the processing unit is specifically configured to:

11. The apparatus of claim 10, wherein the initial model comprises a plurality of convolutional layers and the target parameter comprises a parameter of one or more of the plurality of convolutional layers.

12. The apparatus according to any one of claims 8-11, wherein the apparatus further comprises:

the splicing unit is used for splicing the initial feature vectors of the plurality of unlabeled samples to obtain initial feature matrixes of the plurality of unlabeled samples, splicing the disturbance feature vectors of the plurality of unlabeled samples to obtain disturbance feature matrixes of the plurality of unlabeled samples, and sequencing the plurality of unlabeled samples in the initial feature matrixes and the disturbance feature matrixes is the same;

the determining unit is further configured to determine a disturbance vector matrix corresponding to the plurality of unlabeled samples according to the initial feature matrix and the disturbance feature matrix, where the disturbance vector matrix includes a disturbance vector of each unlabeled sample in the plurality of unlabeled samples;

the selecting unit is specifically configured to select the target unlabeled sample from the plurality of labeled samples according to the disturbance vector matrix.

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-6.