CN110147845B

CN110147845B - Sample collection method and sample collection system based on feature space

Info

Publication number: CN110147845B
Application number: CN201910435700.6A
Authority: CN
Inventors: 徐化永
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-05-23
Filing date: 2019-05-23
Publication date: 2021-08-06
Anticipated expiration: 2039-05-23
Also published as: CN110147845A

Abstract

The present disclosure provides a sample collection method based on a feature space, including: determining a first feature extraction algorithm and constructing a first feature space; obtaining a first sample to be processed; extracting a feature vector of a first sample to be processed; calculating the similarity between the feature vector of the first sample to be processed and the feature vector of each sample in the first feature space, and recording the maximum value as the maximum value of the first similarity; judging whether the maximum value of the first similarity is smaller than a preset first similarity threshold value or not; when the maximum value of the first similarity is judged to be smaller than a first similarity threshold value, placing a first sample to be processed in a first feature space, and detecting whether the first feature space meets a preset acquisition condition; when the first feature space is detected to meet the preset acquisition condition, taking a sample in the first feature space as a training sample aiming at a preset task; and when the first feature space is detected not to meet the preset acquisition condition, continuing to execute the step of acquiring the first sample to be processed.

Description

Sample collection method and sample collection system based on feature space

Technical Field

The present disclosure relates to the field of deep learning technologies, and in particular, to a sample collection method, a sample collection system, a server, and a computer-readable medium based on a feature space.

Background

In Deep Learning (Deep Learning) projects, such as segmentation, gesture recognition, or limb recognition, a large number of training samples need to be collected for a predetermined task, and then a detection model for the predetermined task is trained based on the training samples. The more abundant the collected training samples and the more even the data distribution, the better the effect of the trained detection model is.

Disclosure of Invention

The present disclosure is directed to at least one of the technical problems in the prior art, and provides a sample collection method, a sample collection system, a server and a computer readable medium based on a feature space.

In a first aspect, an embodiment of the present disclosure provides a sample acquisition method based on a feature space, including:

determining a first feature extraction algorithm for a predetermined task, and constructing a first feature space based on the first feature extraction algorithm;

obtaining a first sample to be processed;

extracting a feature vector of the first sample to be processed according to the first feature extraction algorithm;

calculating the similarity between the feature vector of the first sample to be processed and the feature vector of each sample in the first feature space, and determining the maximum value of the similarity, wherein the maximum value is marked as the maximum value of the first similarity;

judging whether the maximum value of the first similarity is smaller than a preset first similarity threshold value or not;

when the maximum value of the first similarity is judged to be smaller than the first similarity threshold, the first sample to be processed is placed in the first feature space to update the first feature space, and whether the first feature space meets a preset acquisition condition is further detected;

when the first feature space is detected to meet a preset acquisition condition, taking a sample in the first feature space as a training sample for the preset task;

and when the first feature space is detected not to meet the preset acquisition condition, continuing to execute the step of acquiring the first sample to be processed.

In some embodiments, when it is determined that the first maximum value of similarity is greater than or equal to the first similarity threshold, the first to-be-processed sample is discarded, and the step of obtaining the first to-be-processed sample is continued.

In some embodiments, the step of determining a first feature extraction algorithm for a predetermined task comprises:

constructing a second feature space based on a predetermined second feature extraction algorithm;

obtaining a second sample to be processed;

extracting a feature vector of the second sample to be processed according to the second feature extraction algorithm;

calculating the similarity between the feature vector of the second sample to be processed and the feature vector of each sample in the second feature space, and determining the maximum value of the similarity, wherein the maximum value is marked as the maximum value of the second similarity;

judging whether the second similarity maximum value is smaller than a preset second similarity threshold value or not;

when the maximum value of the second similarity is judged to be smaller than the second similarity threshold, the second sample to be processed is placed in the second feature space to update the second feature space, and whether the total number of samples in the second feature space reaches a preset total number threshold is further detected;

when the total number of the samples is detected not to reach the total number threshold value, the step of obtaining a second sample to be processed is continuously executed;

when the total number of the samples is detected to reach the total number threshold value, training a preliminary detection model aiming at the preset task according to the samples in the second feature space;

and acquiring a feature extraction algorithm corresponding to a feature extraction part in the preliminary detection model to serve as the first feature extraction algorithm.

In some embodiments, when it is determined that the second similarity maximum is greater than or equal to the second similarity threshold, the second to-be-processed sample is discarded, and the step of obtaining the second to-be-processed sample is continuously performed.

In some embodiments, after the step of constructing the first feature space according to the feature extraction algorithm and before the step of obtaining the first sample to be processed is performed for the first time, the method further includes:

taking the samples in the second feature space as third samples to be processed, and judging whether the maximum value of the similarity between the feature vector of the third sample to be processed and the feature vector of each sample in the first feature space is smaller than the first similarity threshold value or not for each third sample to be processed;

when the maximum value of the similarity between the feature vector of the third sample to be processed and the feature vector of each sample in the first feature space is judged to be smaller than the first similarity threshold, placing the third sample to be processed in the first feature space to update the first feature space;

and when the maximum value of the similarity between the feature vector of the third sample to be processed and the feature vector of each sample in the first feature space is judged to be greater than or equal to the first similarity threshold, discarding the third sample to be processed.

In some embodiments, the step of detecting whether the first feature space satisfies a predetermined acquisition condition specifically includes:

judging whether the sample in the first feature space completely covers a preset test data set;

when the sample in the first characteristic space is judged to completely cover the test data set, detecting that the first characteristic space meets a preset acquisition condition;

and when the sample in the first feature space is judged not to completely cover the test data set, detecting that the first feature space does not meet the preset acquisition condition.

In some embodiments, the step of determining whether the first feature space completely covers a predetermined test data feature space comprises:

extracting a feature vector of each test sample in the test data set according to the first feature extraction algorithm, and calculating the maximum value of the similarity between the feature vector of each test sample and the feature vector of each sample in the first feature space;

detecting whether the maximum value of the similarity between the feature vector of any test sample in the test data set and the feature vector of each sample in the first feature space is greater than or equal to the first similarity threshold value;

when the maximum value of the similarity between the feature vector of any test sample in the test data set and the feature vector of each sample in the first feature space is detected to be greater than or equal to the first similarity threshold, judging that the first feature space completely covers the test data set; when it is detected that the maximum value of the similarity between the feature vector of at least one test sample in the test data set and the feature vector of each sample in the first feature space is smaller than the first similarity threshold, it is determined that the first feature space does not completely cover the test data set.

In some embodiments, the similarity between two feature vectors, S:

where d is the distance between the two feature vectors.

In some embodiments, the distance comprises: the euclidean distance.

In a second aspect, an embodiment of the present disclosure further provides a sample acquisition system based on a feature space, including:

a determination module for determining a first feature extraction algorithm for a predetermined task

A construction module for constructing a first feature space based on the first feature extraction algorithm;

the acquisition module is used for acquiring a first sample to be processed;

the extraction module is used for extracting the feature vector of the first sample to be processed according to the first feature extraction algorithm;

the calculation module is used for calculating the similarity between the feature vector of the first sample to be processed and the feature vector of each sample in the first feature space, and determining the maximum value of the similarity, wherein the maximum value is marked as the maximum value of the first similarity;

the first judgment module is used for judging whether the maximum value of the first similarity is smaller than a preset first similarity threshold value or not;

the placement detection module is configured to, when the first judgment module judges that the first similarity maximum value is smaller than the first similarity threshold value, place the first sample to be processed in the first feature space to update the first feature space, and further detect whether the first feature space meets a predetermined acquisition condition;

the first processing module is used for taking a sample in the first feature space as a training sample aiming at the preset task when the placement detection module detects that the first feature space meets a preset acquisition condition; and when the placement detection module detects that the first feature space does not meet the preset acquisition condition, controlling the acquisition module to continue to execute the operation of acquiring the first sample to be processed.

In some embodiments, further comprising:

and the second processing module is configured to discard the first to-be-processed sample and control the obtaining module to perform the operation of obtaining the first to-be-processed sample when the first determining module determines that the first maximum value of the similarity is greater than or equal to the first similarity threshold.

In some embodiments, the determining module comprises:

a construction unit for constructing a second feature space based on a predetermined second feature extraction algorithm;

the acquisition unit is used for acquiring a second sample to be processed;

the extraction unit is used for extracting the feature vector of the second sample to be processed according to the second feature extraction algorithm;

the calculating unit is used for calculating the similarity between the feature vector of the second sample to be processed and the feature vector of each sample in the second feature space, and determining the maximum value of the similarity, wherein the maximum value is marked as the maximum value of the second similarity;

the first judging unit is used for judging whether the second similarity maximum value is smaller than a preset second similarity threshold value or not;

a placement detection unit, configured to, when the first determination unit determines that the second similarity maximum value is smaller than the second similarity threshold, place the second to-be-processed sample in the second feature space to update the second feature space, and further detect whether a total number of samples in the second feature space reaches a predetermined total number threshold;

the first processing unit is used for controlling the obtaining unit to continuously execute the operation of obtaining the second sample to be processed when the placement detection unit detects that the total number of the samples does not reach the total number threshold value;

a training unit, configured to train a preliminary detection model for the predetermined task according to the samples in the second feature space when the placement detection unit detects that the total number of the samples reaches the total number threshold;

and the determining unit is used for acquiring a feature extraction algorithm corresponding to the feature extraction part in the preliminary detection model to serve as the first feature extraction algorithm.

In some embodiments, the determining module further comprises

And the second processing unit is used for discarding the second to-be-processed sample and controlling the obtaining unit to continue to execute the operation of obtaining the second to-be-processed sample when the first judging unit judges that the maximum value of the second similarity is greater than or equal to the second similarity threshold.

In some embodiments, further comprising:

a second judging module, configured to, after the constructing module constructs the first feature space based on the feature extraction algorithm, use a sample in the second feature space as a third sample to be processed, and judge, for each third sample to be processed, whether a maximum value of a similarity between a feature vector of the third sample to be processed and a feature vector of each sample in the first feature space is smaller than the first similarity threshold;

a third processing module, configured to place the third sample to be processed in the first feature space to update the first feature space when the second determining module determines that a maximum value of similarity between the feature vector of the third sample to be processed and the feature vector of each sample in the first feature space is smaller than the first similarity threshold;

and the fourth processing module is configured to discard the third sample to be processed when the second determining module determines that the maximum value of the similarity between the feature vector of the third sample to be processed and the feature vector of each sample in the first feature space is greater than or equal to the first similarity threshold.

In some embodiments, the placement detection module comprises:

the placing unit is used for placing the first sample to be processed in the first feature space to update the first feature space when the first judging module judges that the maximum value of the first similarity is smaller than the first similarity threshold;

a second judging unit, configured to judge whether a sample in the first feature space completely covers a predetermined test data set;

when the second judging unit judges that the sample in the first feature space completely covers the test data set, the placement detection module detects that the first feature space meets a preset acquisition condition; when the second judging unit judges that the sample in the first feature space does not completely cover the test data set, the placement detection module detects that the first feature space does not meet a preset acquisition condition.

In some embodiments, the second determination unit includes:

the calculation subunit is used for extracting a feature vector of each test sample in the test data set according to the first feature extraction algorithm and calculating the maximum value of the similarity between the feature vector of the test sample and the feature vector of each sample in the first feature space;

a detecting subunit, configured to detect whether maximum values of similarities between feature vectors of any one of the test samples in the test data set and feature vectors of each sample in the first feature space are greater than or equal to the first similarity threshold;

when the detecting subunit detects that the maximum value of the similarity between the feature vector of any one of the test samples in the test data set and the feature vector of each sample in the first feature space is greater than or equal to the first similarity threshold, the second judging unit judges that the first feature space completely covers the test data set; when the detecting subunit detects that the maximum value of the similarity between the feature vector of at least one test sample in the test data set and the feature vector of each sample in the first feature space is smaller than the first similarity threshold, the second determining unit determines that the first feature space does not completely cover the test data set.

In some embodiments, the similarity between two feature vectors, S:

where d is the distance between the two feature vectors.

In some embodiments, the distance comprises: the euclidean distance.

In a third aspect, an embodiment of the present disclosure further provides a server, including:

one or more processors;

a storage device having one or more programs stored thereon;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a method as provided by any of the preceding embodiments.

In a fourth aspect, this disclosure further provides a computer readable medium, on which a computer program is stored, where the program is executed by a processor to implement the method provided in any one of the foregoing embodiments

The present disclosure has the following beneficial effects:

the embodiment of the disclosure provides a sample collection method, a sample collection system, a server and a computer readable medium based on a feature space, which can realize positive sample collection aiming at a preset task, objectively and effectively screen out repeated samples and samples with larger similarity in the collection process, and simultaneously can ensure uniform distribution of various types of samples, and are beneficial to improving the detection performance of a subsequently trained model.

Drawings

Fig. 1 is a flowchart of a sample collection method based on a feature space according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a method of implementing step S8 in the present disclosure;

fig. 3 is a flowchart of another method for collecting samples based on a feature space according to an embodiment of the present disclosure;

fig. 4 is a block diagram of a sample collection system based on a feature space according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of one configuration of a determination module of the present disclosure;

fig. 6 is a block diagram of a sample collection system based on a feature space according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a placement detection module according to the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present disclosure, a sample collection method and a sample collection system based on a feature space provided in the present disclosure are described in detail below with reference to the accompanying drawings.

Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but which may be embodied in different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. Thus, a first element, component, or section discussed below could be termed a second element, component, or section without departing from the teachings of the present disclosure.

Embodiments described herein may be described with reference to plan and/or cross-sectional views in light of idealized schematic illustrations of the disclosure. Accordingly, the example illustrations can be modified in accordance with manufacturing techniques and/or tolerances. Accordingly, the embodiments are not limited to the embodiments shown in the drawings, but include modifications of configurations formed based on a manufacturing process. Thus, the regions illustrated in the figures have schematic properties, and the shapes of the regions shown in the figures illustrate specific shapes of regions of elements, but are not intended to be limiting.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The sample collection method is used for collecting a positive sample aiming at a preset task, the preset task can be any task applicable to a deep learning technology, such as a segmentation task, a classification task, a positioning task, an identification task and the like, and the specific type of the preset task is not limited by the technical scheme disclosed by the invention. Based on the deep learning technique, a detection model for a predetermined task can be trained according to Positive Samples (Positive Samples) for the predetermined task, wherein the number and data distribution of the Positive Samples directly influence the performance of the finally trained detection model.

The specific type of the detection model is determined by a predetermined task, for example, if the predetermined task is a segmentation task, the trained detection model is a segmentation model; if the preset task is a classification task, the trained detection model is a classification model; and if the preset task is a positioning task, the trained detection model is a positioning model. This is not exemplified here.

It is well known to those skilled in the art that for any of the trained test models, it contains two parts: a feature extraction part and an arithmetic processing part; wherein, the characteristic extraction part is used for extracting the characteristic vector of the input sample based on a characteristic extraction algorithm (a characteristic extraction algorithm determined based on the training sample); the operation processing part is used for executing corresponding detection tasks (such as segmentation tasks, classification tasks, positioning tasks and identification tasks) on the feature vectors extracted by the feature extraction part and outputting detection results.

In addition, the "sample to be processed" and the "test sample" referred to in the following technical solutions are positive samples for a predetermined task.

Fig. 1 is a flowchart of a sample collection method based on a feature space according to an embodiment of the present disclosure, as shown in fig. 1,

and step S1, determining a first feature extraction algorithm aiming at the preset task, and constructing a first feature space based on the first feature extraction algorithm.

The feature space in the present disclosure refers to a multidimensional space established based on a feature extraction algorithm, and the dimensions of the feature space and the feature attributes represented by the dimensions are determined by the feature extraction algorithm; specifically, the dimension of the feature space is equal to the dimension of the feature vector extracted by the feature extraction algorithm, and the feature attribute of each dimension of the feature space is equal to the attribute corresponding to each feature in the feature vector.

In the present disclosure, a process of extracting a feature vector from raw data using a feature extraction algorithm may be regarded as a process of mapping the raw data to a corresponding feature space, and each feature in the feature vector obtained by the feature extraction algorithm corresponds to a one-dimensional coordinate in the feature space. Thus, the feature space can be considered as a multi-dimensional coordinate system, and the sample in the feature space can be considered as a coordinate point in the multi-dimensional coordinate system.

In step S1, in order to ensure that the finally trained model has better performance for the predetermined task, it is necessary to determine a feature extraction algorithm for the predetermined task, considering that the features selected for different specific tasks are different. For example, for an image recognition task, the feature extraction algorithm often used includes: histogram of Oriented Gradient (HOG) algorithm, Local Binary Pattern (LBP) algorithm or Haar-like algorithm. For the text classification task, the feature extraction algorithm often selected includes: a Term Frequency-Inverse text Frequency index (Term Frequency-Inverse text Frequency, abbreviated as TF-IDF) algorithm.

It should be noted that, the technical solution of the present disclosure does not limit the determination process of the first feature extraction algorithm, and the existing feature extraction algorithm may be directly selected, or the feature extraction algorithm determined by a certain means may also be used.

After a first feature extraction algorithm for a predetermined task is determined, dimensions of feature vectors extracted by the feature extraction algorithm and feature attributes corresponding to the dimensions are determined, and at the moment, a first feature space matched with the first feature extraction algorithm can be constructed. It should be noted that, in step S1, the constructed first feature space does not include any samples, i.e., is an empty feature space.

And step S2, acquiring a first sample to be processed.

In step S2, a positive sample may be randomly or according to a certain rule from a preset positive sample database to be used as the first sample to be processed. The positive sample database is constructed according to some positive samples uploaded by a user and some positive samples acquired by the system in advance, the number of the positive samples in the positive sample database is sufficient and diversified, and a large number of repeated or extremely similar samples also exist in the positive sample database.

And step S3, extracting a feature vector of the first sample to be processed according to the first feature extraction algorithm.

In step S3, feature extraction is performed on the first to-be-processed sample acquired in step S2 using a first feature extraction algorithm to obtain a feature vector of the first to-be-processed sample. The process of extracting feature vectors from samples using a feature extraction algorithm is conventional in the art and will not be described in detail here.

Step S4, calculating the similarity between the feature vector of the first sample to be processed and the feature vector of each sample in the first feature space, and determining the maximum value of the similarity, where the maximum value is recorded as the maximum value of the first similarity.

It should be noted that, in step S4, as a scenario, when the first feature space is empty, the maximum value of the first similarity corresponding to the first sample to be processed is directly set to a preset value, where the preset value is smaller than a predetermined first similarity threshold used subsequently.

Of course, to avoid the first feature space being empty when step S4 is executed, a positive sample may be randomly selected from the positive sample database after step S1 is finished and before step S2 is started, and the positive sample may be placed in the first feature space, and then steps S2 to S4 are performed.

In step S4, first, a similarity between the feature vector of the first sample to be processed and the feature vectors of the samples in the first feature space is calculated; then, the similarity degrees are compared to determine a maximum value, which is recorded as a first similarity maximum value, and represents the similarity between one sample most similar to the first sample to be processed in the first feature space and the first sample to be processed (the similarity degree between two feature vectors).

As an alternative, the similarity S between two feature vectors is calculated based on the distance d between the two feature vectors:

in the present disclosure, as a specific alternative, the distance between two feature vectors is a euclidean distance (also called euclidean distance) between two feature vectors.

It should be noted that the above-mentioned technical means for calculating the similarity between two feature vectors based on the euclidean distance between the feature vectors is only an alternative in the present disclosure, and does not limit the technical solution of the present disclosure. The similarity between two feature vectors may also be calculated based on other vector similarity calculation algorithms, such as a similarity calculation algorithm based on pearson correlation coefficients, a similarity calculation algorithm based on Tanimoto coefficients, and so on, which are not illustrated herein.

Step S5, determining whether the first similarity maximum value is smaller than a predetermined first similarity threshold value.

The first similarity threshold can be set and adjusted according to actual needs. For example, the first similarity threshold value is 90%.

In step S5, when the maximum first similarity is determined to be smaller than the first similarity threshold, it indicates that the similarity between each sample in the first feature space and the first to-be-processed sample is smaller, and then step S7 is performed. When the first maximum similarity value is greater than or equal to the first similarity threshold, it indicates that the similarity between the sample in the first feature space and the first sample to be processed is greater, and then step S6 is executed.

And step S6, discarding the first sample to be processed.

After the end of step S6, the above-described step S2 is executed again.

Step S7, a first sample to be processed is placed in the first feature space to update the first feature space.

And step S8, detecting whether the first feature space meets a preset acquisition condition.

Fig. 2 is a flowchart of a specific implementation method of step S8 in the present disclosure, and as an alternative embodiment, as shown in fig. 2, in order to detect whether the first feature space meets the predetermined acquisition condition, a test data set is preset, where the test data set includes some test samples that are manually screened in advance, the number of the test samples in the test data set is relatively small, but the test samples are of a large number and the data distribution is uniform. At this time, "predetermined acquisition conditions" are specifically "coverage test data set"; specifically, step S8 is: it is determined whether the sample in the first feature space completely covers the predetermined set of test data.

Step S8 includes:

step S801, for each test sample in the test data set, extracting a feature vector of the test sample according to a first feature extraction algorithm, and calculating a maximum value of similarity between the feature vector of the test sample and the feature vector of each sample in the first feature space.

Step S802, whether the maximum value of the similarity between the feature vector of any test sample in the test data set and the feature vector of each sample in the first feature space is larger than or equal to a first similarity threshold value or not is detected.

Taking a certain test sample in the test data set as an example, when the maximum value of the similarity between the feature vector of the test sample and the feature vector of each sample in the first feature space is greater than or equal to a first similarity threshold, it indicates that at least one sample with relatively large similarity to the test sample exists in the first feature space, that is, the sample in the first feature space can cover the test sample; when the maximum value of the similarity between the feature vector of the test sample and the feature vector of each sample in the first feature space is smaller than the first similarity threshold, it indicates that the similarity between each sample in the first feature space and the test sample is relatively small, that is, the sample in the first feature space may not cover the test sample.

Based on the above principle, when it is detected that the maximum value of the similarity between the feature vector of any one of the test samples in the test data set and the feature vector of each sample in the first feature space is greater than or equal to the first similarity threshold, it is determined that the sample in the first feature space completely covers the test data set, that is, step S7 detects that the first feature space meets the predetermined acquisition condition; when it is detected that the maximum value of the similarity between the feature vector of at least one test sample in the test data set and the feature vector of each sample in the first feature space is smaller than the first similarity threshold, it is determined that the sample in the first feature space does not completely cover the test data set, that is, step S7 detects that the first feature space does not satisfy the predetermined acquisition condition.

It should be noted that, the above-mentioned case that the "predetermined acquisition condition" is "the coverage test data set" is only an alternative in the present disclosure, and does not limit the technical solution of the present disclosure, and the "predetermined acquisition condition" in the present disclosure may also be set and adjusted according to actual needs. For example, the "predetermined acquisition condition" is that "the total number of samples in the first feature space reaches a predetermined threshold value", in step S7, the total number of samples in the first feature space may be counted, and the total number of samples is compared with the predetermined threshold value, and if the total number of samples is smaller than the predetermined threshold value, it is detected that the first feature space meets the predetermined acquisition condition; otherwise, when the first feature space is detected to not meet the preset acquisition condition. For other cases, there is no further example here.

In step S8, when it is detected that the first feature space satisfies the predetermined acquisition condition, performing step S9; when it is detected that the first feature space does not satisfy the predetermined acquisition condition, the above step S2 is executed again.

And step S9, taking the samples in the first feature space as training samples for a predetermined task.

After the first feature space meets the preset acquisition condition, the samples in the first feature space can be directly used as training samples for a preset task.

Through the steps S1 to S9, positive sample collection aiming at a preset task can be realized, repeated samples and samples with high similarity can be objectively and effectively screened out in the collection process, and meanwhile, the uniform distribution of various types of samples can be ensured, and the detection performance of a subsequently trained model can be improved.

Fig. 3 is a flowchart of another method for acquiring a sample based on a feature space according to an embodiment of the present disclosure, and as shown in fig. 3, the method includes:

and S101, constructing a second feature space based on a predetermined second feature extraction algorithm.

It should be noted that the second feature extraction algorithm used in step S101 may be any existing feature extraction algorithm. For example, the feature extraction algorithm used by the feature extraction part in the existing classification model, the feature extraction algorithm used by the feature extraction part in the existing segmentation model, and the feature extraction algorithm used by the feature extraction part in the existing positioning model. This is not exemplified here.

And step S102, acquiring a second sample to be processed.

In step S102, a positive sample may be randomly or according to a certain rule from a preset positive sample database to be used as a second sample to be processed.

And S103, extracting a feature vector of a second sample to be processed according to a second feature extraction algorithm.

And step S104, calculating the similarity between the feature vector of the second sample to be processed and the feature vector of each sample in the second feature space, and determining the maximum value of the similarity, wherein the maximum value is marked as the maximum value of the second similarity.

In step S104, as a scenario, when the second feature space is empty, the maximum value of the second similarity corresponding to the second sample to be processed is directly set as a preset value, where the preset value is smaller than a predetermined second similarity threshold used subsequently.

Of course, in order to avoid the second feature space being empty when step S104 is executed, after step S101 is finished and before step S102 is started, a positive sample may be randomly selected from the positive sample database, and the positive sample may be placed in the second feature space, and then step S102 to step S104 may be performed.

Firstly, calculating the similarity between the feature vector of a first sample to be processed and the feature vector of each sample in a first feature space; then, the similarity degrees are compared to determine a maximum value, which is recorded as a first similarity maximum value, and represents the similarity between one sample most similar to the first sample to be processed in the first feature space and the first sample to be processed (the similarity degree between two feature vectors).

And step S105, judging whether the maximum value of the second similarity is smaller than a preset second similarity threshold value.

The second similarity threshold can be set and adjusted according to actual needs. As an alternative, the second similarity threshold is equal to the first similarity threshold.

In step S105, when the second similarity maximum value is smaller than the second similarity threshold value, step S107 is executed; when the second similarity maximum value is judged to be greater than or equal to the second similarity threshold value, step S106 is executed.

And step S106, discarding the second sample to be processed.

And continuing to perform the step of obtaining a second sample to be processed

And S107, placing the second sample to be processed in the second feature space to update the second feature space.

And step S108, detecting whether the total number of the samples in the second feature space reaches a preset total number threshold value.

The predetermined total threshold value can be designed and adjusted according to actual needs.

In step S108, when it is detected that the total number of samples does not reach the total number threshold, the above step S102 is continued again. When it is detected that the total number of samples reaches the total number threshold, step S109 is performed.

And step S109, training a preliminary detection model aiming at the predetermined task according to the samples in the second feature space.

In step S109, based on the deep learning technique, a preliminary detection model for a predetermined task may be trained from all samples in the second feature space. The process of training a corresponding model according to a sample based on a deep learning technique is conventional in the art and will not be described in detail herein.

In the trained preliminary detection model for the predetermined task, the preliminary detection model includes: the system comprises a feature extraction part for a preset task and an operation processing part for the preset task, wherein the feature extraction part stores a feature extraction algorithm which is determined based on samples in a second feature space and is matched with the preset task.

Step S110, a feature extraction algorithm corresponding to a feature extraction part in the preliminary detection model is obtained to serve as a first feature extraction algorithm, and a first feature space is constructed based on the first feature extraction algorithm.

In step S110, a feature extraction algorithm corresponding to the feature extraction part in the preliminary detection model is used as a first feature extraction algorithm, and a first feature space is constructed based on the first feature extraction algorithm.

Through the steps S101 to S110, a first feature extraction algorithm matched with the predetermined task can be automatically generated based on the samples in the second feature space, and a foundation is laid for the subsequent collection of the samples aiming at the predetermined task. Generally, the first feature extraction algorithm determined in steps S101 to S110 is different from the second feature extraction algorithm used in step S101.

It should be noted that the above process of determining the first feature extraction algorithm for the predetermined task based on steps S101 to S110 is only one preferred embodiment of the present disclosure, and does not limit the technical solution of the present disclosure.

When the above-mentioned step S101 to step S110 are adopted to realize step S1, the method further includes, after step S1:

step S1a, taking the samples in the second feature space as third samples to be processed, and determining, for each third sample to be processed, whether the maximum value of the similarity between the feature vector of the third sample to be processed and the feature vector of each sample in the first feature space is smaller than the first similarity threshold.

Step S1b, when it is determined that the maximum value of the similarity between the feature vector of the third to-be-processed sample and the feature vector of each sample in the first feature space is smaller than the first similarity threshold, placing the third to-be-processed sample in the first feature space to update the first feature space.

Step S1c, when it is determined that the maximum value of the similarity between the feature vector of the third to-be-processed sample and the feature vector of each sample in the first feature space is greater than or equal to the first similarity threshold, discarding the third to-be-processed sample.

Assume that, after step S110 is finished, the number of samples in the second feature space is N, where the feature vector of the ith sample in the second feature space is denoted as a_iWherein i is a positive integer less than N; the execution logic of the above step S1a to step S1c is as follows:

step one, i is made to be 1;

secondly, judging the characteristic vector of the ith sample in the second characteristic space and recording the characteristic vector as A_iWhether the maximum value of the similarity with the feature vector of each sample in the first feature space is smaller than a first similarity threshold value.

It should be noted that, when the second step is executed for the first time and the first feature space is empty, the maximum value of the similarity between the feature vector of the 1 st sample and the feature vector of each sample in the first feature space is directly set to be a preset value, and the preset value is smaller than the first similarity threshold.

When the determination result of the second step is yes, step S1b is executed, that is, the ith sample is placed in the first feature space to update the first feature space, and then it is further determined whether i is smaller than N; if the i is smaller than the N, executing the i to i +1, and continuing to execute the second step; if i is greater than or equal to N, the following step S2 is executed.

When the determination result in the second step is "no", step S1c is executed, that is, the ith sample is discarded, and thereafter, it is further determined whether i is smaller than N, and if i is smaller than N, i is executed to i +1, and the second step is continuously executed; if i is greater than or equal to N, the following step S2 is executed.

The execution of the above steps S1 a-S1 c can be regarded as an optimized screening of the samples in the second feature space in step S109 for the predetermined task.

It should be noted that, in some embodiments, the step S2 may be directly executed after the step S110 is finished without performing the step S1a to the step S1c, and this also belongs to the protection scope of the present disclosure.

And step S2, acquiring a first sample to be processed.

And step S6, discarding the first sample to be processed.

After the end of step S6, the above-described step S2 is executed again.

For specific descriptions of the steps S2 to S9 in this embodiment, reference may be made to the corresponding contents in the foregoing embodiments, and details are not described here again.

In this embodiment, through the steps S101 to S9, the first feature extraction algorithm for the predetermined task can be automatically determined, and the positive samples for the predetermined task can be collected based on the determined first feature extraction algorithm, so that the repeated samples and the samples with a large similarity can be objectively and effectively screened out in the collection process, and meanwhile, the uniform distribution of each type of sample can be ensured, which is beneficial to improving the detection performance of the subsequently trained model.

Fig. 4 is a block diagram of a structure of a sample acquisition system based on a feature space according to an embodiment of the present disclosure, and as shown in fig. 4, the sample acquisition system may be used to implement the sample acquisition method according to any of the foregoing embodiments, and the sample acquisition system includes: the device comprises a determining module 1, a constructing module 2, an obtaining module 3, an extracting module 4, a calculating module 5, a first judging module 6, a placing detecting module 7, a first processing module 8 and a second processing module 9.

Wherein the determination module 1 is configured to determine a first feature extraction algorithm for a predetermined task.

The construction module 2 is configured to construct a first feature space based on a first feature extraction algorithm.

The obtaining module 3 is used for obtaining a first sample to be processed.

The extraction module 4 is configured to extract a feature vector of the first sample to be processed according to a first feature extraction algorithm.

The calculating module 5 is configured to calculate similarity between the feature vector of the first sample to be processed and the feature vector of each sample in the first feature space, and determine a maximum value of the similarity, where the maximum value is recorded as a maximum value of the first similarity.

The first judging module 6 is configured to judge whether the first similarity maximum value is smaller than a predetermined first similarity threshold value.

The placement detection module 7 is configured to, when the first determination module 6 determines that the first similarity maximum value is smaller than the first similarity threshold value, place the first sample to be processed in the first feature space to update the first feature space, and further detect whether the first feature space meets a predetermined acquisition condition.

The first processing module 8 is configured to, when the placement detection module 7 detects that the first feature space meets a predetermined acquisition condition, take a sample in the first feature space as a training sample for a predetermined task; and when the placement detection module 7 detects that the first feature space does not meet the preset acquisition condition, controlling the acquisition module 3 to continue to perform the operation of acquiring the first sample to be processed.

The second processing module 9 is configured to discard the first to-be-processed sample and control the obtaining module 3 to perform an operation of obtaining the first to-be-processed sample when the first determining module 6 determines that the first similarity maximum value is greater than or equal to the first similarity threshold.

Fig. 5 is a block diagram of a determining module according to the present disclosure, and as shown in fig. 5, in some embodiments, the determining module 1 includes: the system comprises a construction unit 101, an acquisition unit 102, an extraction unit 103, a calculation unit 104, a first judgment unit 105, a placement detection unit 106, a first processing unit 107, a training unit 108, a determination unit 109 and a second processing unit 110.

Wherein the construction unit 101 is configured to construct the second feature space based on a predetermined second feature extraction algorithm.

The obtaining unit 102 is configured to obtain a second sample to be processed.

The extracting unit 103 is configured to extract a feature vector of the second sample to be processed according to a second feature extraction algorithm.

The calculating unit 104 is configured to calculate a similarity between the feature vector of the second sample to be processed and the feature vector of each sample in the second feature space, and determine a maximum value of the similarity, where the maximum value is marked as a maximum value of the second similarity.

The first judging unit 105 is configured to judge whether the second similarity maximum value is smaller than a predetermined second similarity threshold value.

The placement detection unit 106 is configured to, when the first determination unit 105 determines that the second similarity maximum value is smaller than the second similarity threshold value, place the second to-be-processed sample in the second feature space to update the second feature space, and further detect whether the total number of samples in the second feature space reaches a predetermined total number threshold value.

The first processing unit 107 is configured to control the acquiring unit 102 to continue to perform the operation of acquiring the second sample to be processed when the placement detecting unit 106 detects that the total number of samples does not reach the total number threshold.

The training unit 108 is configured to train a preliminary detection model for the predetermined task according to the samples in the second feature space when the placement detection unit 106 detects that the total number of the samples reaches the total number threshold.

The determining unit 109 is configured to obtain a feature extraction algorithm corresponding to a feature extraction part in the preliminary detection model as a first feature extraction algorithm.

The second processing unit 110 is configured to discard the second to-be-processed sample and control the obtaining unit 102 to continue to perform the operation of obtaining the second to-be-processed sample when the first determining unit 105 determines that the second similarity maximum value is greater than or equal to the second similarity threshold value.

Fig. 6 is a block diagram of a sample acquisition system based on a feature space according to an embodiment of the present disclosure, and as shown in fig. 6, when the determining module 1 adopts the condition shown in fig. 5, in some embodiments, the sample acquisition system further includes: a second judging module 1a, a third processing module 1b and a fourth processing module 1 c.

The second determining module 1a is configured to, after the constructing module 2 constructs the first feature space based on the feature extraction algorithm, use the samples in the second feature space as third samples to be processed, and determine, for each third sample to be processed, whether a maximum value of similarity between the feature vector of the third sample to be processed and the feature vector of each sample in the first feature space is smaller than a first similarity threshold.

The third processing module 1b is configured to, when the second determining module 1a determines that the maximum value of the similarity between the feature vector of the third sample to be processed and the feature vector of each sample in the first feature space is smaller than the first similarity threshold, place the third sample to be processed in the first feature space to update the first feature space.

The fourth processing module 1c is configured to discard the third sample to be processed when the second determining module 1a determines that the maximum value of the similarity between the feature vector of the third sample to be processed and the feature vector of each sample in the first feature space is greater than or equal to the first similarity threshold.

Fig. 7 is a block diagram of a placement detection module according to the present disclosure, and as shown in fig. 7, in some embodiments, the placement detection module 7 includes: a placing unit 701 and a second judging unit 702.

The placing unit 701 is configured to place the first sample to be processed in the first feature space to update the first feature space when the first determining module 6 determines that the first similarity maximum value is smaller than the first similarity threshold value.

The second determining unit 702 is configured to determine whether the sample in the first feature space completely covers the predetermined test data set; when the second judging unit 702 judges that the sample in the first feature space completely covers the test data set, the placement detection module 7 detects that the first feature space meets the predetermined acquisition condition; when the second determining unit 702 determines that the sample in the first feature space does not completely cover the test data set, the placement detection module 7 detects that the first feature space does not satisfy the predetermined acquisition condition.

Further, the second determination unit 702 includes: a calculation subunit and a detection subunit.

The calculation subunit is configured to, for each test sample in the test data set, extract a feature vector of the test sample according to a first feature extraction algorithm, and calculate a maximum value of a similarity between the feature vector of the test sample and the feature vector of each sample in the first feature space.

The detection subunit is configured to detect whether maximum values of similarities between the feature vector of any one of the test samples in the test data set and the feature vectors of the samples in the first feature space are all greater than or equal to a first similarity threshold. When the detecting subunit detects that the maximum value of the similarity between the feature vector of any one of the test samples in the test data set and the feature vector of each sample in the first feature space is greater than or equal to the first similarity threshold, the second determining unit 702 determines that the first feature space completely covers the test data set; when the detecting subunit detects that the maximum value of the similarity between the feature vector of at least one test sample in the test data set and the feature vector of each sample in the first feature space is smaller than the first similarity threshold, the second determining unit 702 determines that the first feature space does not completely cover the test data set.

In some embodiments, the similarity between two feature vectors, S:

wherein d is the distance between the two feature vectors; further, the distance includes: the euclidean distance.

For specific descriptions of each module, unit, and sub-unit in this embodiment, reference may be made to the content of the description of the corresponding step in the foregoing method embodiment, and details are not described here again.

The embodiment of the present disclosure further provides a server, which includes the sample collection system provided in the foregoing embodiment.

An embodiment of the present disclosure further provides a server, where the server includes: one or more processors and storage; the storage device stores one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the sample collection method provided in the foregoing embodiments.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed, implements the sample collection method provided in the foregoing embodiments.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods disclosed above, functional modules/units in the apparatus, may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purposes of limitation. In some instances, features, characteristics and/or elements described in connection with a particular embodiment may be used alone or in combination with features, characteristics and/or elements described in connection with other embodiments, unless expressly stated otherwise, as would be apparent to one skilled in the art. Accordingly, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

It is to be understood that the above embodiments are merely exemplary embodiments that are employed to illustrate the principles of the present disclosure, and that the present disclosure is not limited thereto. It will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the disclosure, and these are to be considered as the scope of the disclosure.

Claims

1. A sample collection method based on a feature space is characterized by comprising the following steps:

determining a first feature extraction algorithm for a predetermined task, and constructing a first feature space based on the first feature extraction algorithm, wherein the feature space is a multidimensional space established based on the feature extraction algorithm, the dimension of the feature space is equal to the dimension of a feature vector extracted based on the feature extraction algorithm, and the feature attribute of each dimension in the feature space is equal to the attribute corresponding to each feature in the feature vector;

obtaining a first sample to be processed;

when the maximum value of the first similarity is judged to be smaller than the first similarity threshold, the first sample to be processed is placed in the first feature space according to the dimension of the feature vector of the first sample to be processed and the attribute corresponding to each feature in the feature vector so as to update the first feature space, and whether the first feature space meets a preset acquisition condition is further detected;

2. The method according to claim 1, wherein when the first maximum value of the similarity is determined to be greater than or equal to the first similarity threshold, the first to-be-processed sample is discarded, and the step of obtaining the first to-be-processed sample is continued.

3. The method of claim 1, wherein the step of determining a first feature extraction algorithm for a predetermined task comprises:

obtaining a second sample to be processed;

4. The method according to claim 3, wherein when the second similarity maximum value is determined to be greater than or equal to the second similarity threshold, the second to-be-processed sample is discarded, and the step of obtaining the second to-be-processed sample is continued.

5. The method according to claim 3, further comprising, after the step of constructing a first feature space according to the feature extraction algorithm and before the step of obtaining the first to-be-processed sample is performed for the first time:

6. The method according to claim 1, wherein the step of detecting whether the first feature space satisfies a predetermined acquisition condition specifically comprises:

7. The method of claim 6, wherein the step of determining whether the first feature space completely covers a predetermined test data feature space comprises:

8. The method according to any of claims 1-7, wherein the similarity between two eigenvectors, S:

where d is the distance between the two feature vectors.

9. The method of claim 8, wherein the distance comprises: the euclidean distance.

10. A sample acquisition system based on a feature space, comprising:

The building module is used for building a first feature space based on the first feature extraction algorithm, wherein the feature space is a multidimensional space built based on the feature extraction algorithm, the dimension of the feature space is equal to the dimension of a feature vector extracted based on the feature extraction algorithm, and the feature attribute of each dimension in the feature space is equal to the attribute corresponding to each feature in the feature vector;

the acquisition module is used for acquiring a first sample to be processed;

a placement detection module, configured to, when the first determination module determines that the maximum value of the first similarity is smaller than the first similarity threshold, place the first to-be-processed sample in the first feature space according to the dimension of the feature vector of the first to-be-processed sample and the attribute corresponding to each feature in the feature vector to update the first feature space, and further detect whether the first feature space meets a predetermined acquisition condition;

11. The system of claim 10, further comprising:

12. The system of claim 10, wherein the determining module comprises:

the acquisition unit is used for acquiring a second sample to be processed;

13. The system of claim 12, wherein the determination module further comprises

14. The system of claim 12, further comprising:

a second judging module, configured to, after the constructing module constructs the first feature space based on the feature extraction algorithm, use a sample in the second feature space as a third sample to be processed, and judge, for each third sample to be processed, whether a maximum value of similarity between a feature vector of the third sample to be processed and feature vectors of samples in the first feature space is smaller than the first similarity threshold;

15. The system of claim 10, wherein the placement detection module comprises:

16. The system according to claim 15, wherein the second determination unit comprises:

17. The system according to any of claims 10-16, wherein the similarity between two eigenvectors, S:

where d is the distance between the two feature vectors.

18. The system of claim 17, wherein the distance comprises: the euclidean distance.

19. A server, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-9.

20. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-9.