CN114444576A

CN114444576A - Data sampling method and device, electronic equipment and storage medium

Info

Publication number: CN114444576A
Application number: CN202111665473.XA
Authority: CN
Inventors: 吴曙楠; 王方舟
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-05-06

Abstract

The present disclosure relates to a data sampling method, apparatus, electronic device and storage medium, including: acquiring original data, wherein each original data comprises an index value and at least one characteristic value, the original data is divided into a positive sample and a negative sample, the index value of the positive sample is greater than a preset threshold value, and the index value of the negative sample is not greater than the preset threshold value; dividing the negative samples into a plurality of groups according to the index values of the negative samples; calculating a difference value between the characteristic value of the negative sample and the characteristic value of the positive sample in each group; dividing the groups into difficult groups and non-difficult groups according to the difference values; and carrying out layered sampling on the non-difficult groups to obtain non-difficult samples, and taking the positive samples, the negative samples in the difficult groups and the non-difficult samples as sampling results of the original data. In this way, in the sampling process, the whole amount of original data does not need to be processed, so that the consumed computing resources are less and the consumed time is less.

Description

Data sampling method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a data sampling method and apparatus, an electronic device, and a storage medium.

Background

In some scenarios, it is often necessary to sample original data according to a certain rule to obtain a positive sample and a negative sample, and then perform model training and verification using the positive sample and the negative sample, so that the trained model can be used for performing two-classification on the index value of the data to be analyzed, that is, predicting whether the index value of the data to be analyzed can reach a certain threshold, for example, predicting whether a certain video can reach a predetermined playing amount, and the like.

In some cases, due to the complexity of the raw data composition, there is no clear boundary between the positive and negative samples, and there are many poorly resolved "grey zone" data near the positive and negative samples, which can lead to biased learning of the model, making the model ineffective in predicting the effect on a small number of samples.

In the prior art, various algorithms can be adopted to perform downsampling on original data so as to avoid biased learning of a model, for example, a data-level sampling method can be adopted to perform downsampling on the original data by using a distance-based N-nearest neighbor algorithm; or, a cost-sensitive learning-based method is adopted to provide higher weight for positive and negative samples, so that the numerical calculation of the final cost is biased towards a specific direction; in addition, an ensemble learning algorithm method can be adopted, a data-level or algorithm-level method is combined with ensemble learning to obtain a powerful ensemble classifier, and the like.

However, all of these algorithms need to process the whole amount of original data, and in an actual application scenario, due to the large data volume of the original data, the calculation cost of these algorithms is too high, the time consumption is long, and the consumption of calculation resources is large.

Disclosure of Invention

The present disclosure provides a data sampling method, an apparatus, an electronic device, and a storage medium, so as to solve at least the problems of high computation cost, long time consumption, and large consumption of computation resources in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a data sampling method, including:

acquiring original data, wherein each original data comprises an index value and at least one characteristic value, the original data is divided into a positive sample and a negative sample, the index value of the positive sample is greater than a preset threshold value, and the index value of the negative sample is not greater than the preset threshold value;

dividing the negative samples into a plurality of groups according to the index values of the negative samples;

calculating a difference value between the characteristic value of the negative sample and the characteristic value of the positive sample in each group;

dividing the groups into difficult groups and non-difficult groups according to the difference values;

and carrying out layered sampling on the non-difficult groups to obtain non-difficult samples, and taking the positive samples, the negative samples in the difficult groups and the non-difficult samples as sampling results of the original data.

Optionally, the dividing the negative examples into a plurality of groups according to the index values of the negative examples includes:

dividing the negative samples into a plurality of groups according to the index values of the negative samples and a preset grouping rule; or the like, or, alternatively,

and performing equivalent segmentation on the negative samples according to the index values of the negative samples, and dividing the negative samples into a preset number of groups.

Optionally, the calculating a difference value between the feature value of the negative sample and the feature value of the positive sample in each group includes:

calculating a first statistical parameter of any characteristic value of each negative sample in any group;

calculating a second statistical parameter of the any characteristic value of the positive sample;

calculating the Euclidean distance between the first statistical parameter and the second statistical parameter to obtain a target difference value between any characteristic value of any group and any characteristic value of the positive sample;

and averaging the target difference values, and normalizing the average result to obtain the difference value between the characteristic value of the negative sample and the characteristic value of the positive sample in any group.

Optionally, the first statistical parameter and the second statistical parameter include any one or more of the following:

mean, variance, 25 quantile, 50 quantile, and 75 quantile.

Optionally, the averaging the target difference values and normalizing the average result to obtain the difference value between the feature value of the negative sample and the feature value of the positive sample in any group includes:

and according to the preset weight of each characteristic value, weighting the target difference values to obtain an average value, and normalizing the average value result to obtain the difference value between the characteristic value of the negative sample and the characteristic value of the positive sample in any group.

Optionally, the dividing the group into a difficult group and a non-difficult group according to the difference value includes:

sorting the groups according to the sequence of the difference values from large to small, and drawing a second-order difference curve of the groups based on a sorting result;

taking the difference value of the group corresponding to the first peak value in the reverse order of the second-order difference curve as a screening threshold value;

and taking the group with the difference value smaller than the screening threshold value as a difficult group, and taking the group with the difference value not smaller than the screening threshold value as a non-difficult group.

Optionally, the performing hierarchical sampling on the non-difficult group to obtain a non-difficult sample includes:

calculating the sampling number of any one non-difficult group according to the difference value of any one non-difficult group and the number of negative samples in any one non-difficult group;

and sampling the non-difficult groups according to the sampling number of the non-difficult groups to obtain non-difficult samples.

Optionally, the calculating the sampling number of any one of the non-difficult groups according to the difference value of any one of the non-difficult groups and the number of negative samples in any one of the non-difficult groups includes:

if the difference value of any one non-difficult group is not 1, calculating the difference between 1 and the difference value of any one non-difficult group, and calculating the product of the difference and the number of negative samples in any one non-difficult group as the sampling number of any one non-difficult group;

and under the condition that the difference value of any one non-difficult group is 1, setting the difference value of any one non-difficult group as a preset value or the maximum value of the difference values of all the non-difficult groups, calculating the difference between 1 and the difference value of any one non-difficult group, and calculating the product of the difference and the number of negative samples in any one non-difficult group as the sampling number of any one non-difficult group.

Optionally, after the performing hierarchical sampling on the non-difficult group to obtain a non-difficult sample, the method further includes:

and if the number of the non-difficult samples is smaller than the number of the positive samples, re-sampling the non-difficult groups according to the number of the positive samples so as to enable the number of the obtained non-difficult samples to be the same as the number of the positive samples.

Optionally, after the performing hierarchical sampling on the non-difficult group to obtain non-difficult samples, and taking the positive samples, the negative samples in the difficult group, and the non-difficult samples as sampling results of the original data, the method further includes:

performing model training by using the sampling result to obtain a target model;

and processing the test sample by using the target model, and judging that the sampling result meets the requirement if the accuracy of the processing result reaches a preset accuracy threshold value.

According to a second aspect of the embodiments of the present disclosure, there is provided a data sampling apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire raw data, each raw data comprises an index value and at least one characteristic value, the raw data is divided into a positive sample and a negative sample, the index value of the positive sample is greater than a preset threshold value, and the index value of the negative sample is not greater than the preset threshold value;

a grouping unit configured to perform grouping of the negative examples into a plurality of groups according to the index values of the negative examples;

a calculation unit configured to perform calculation of a difference value between a feature value of a negative sample and a feature value of the positive sample in each group;

a screening unit configured to perform a classification of the groups into difficult groups and non-difficult groups according to the difference values;

the sampling unit is configured to perform hierarchical sampling on the non-difficult groups to obtain non-difficult samples, and the positive samples, the negative samples in the difficult groups and the non-difficult samples are used as sampling results of the original data.

Optionally, the grouping unit is further configured to perform:

Optionally, the computing unit is further configured to perform:

mean, variance, 25 quantile, 50 quantile, and 75 quantile.

Optionally, the computing unit is further configured to perform:

Optionally, the screening unit is configured to perform:

Optionally, the sampling unit is configured to perform:

Optionally, the sampling unit is further configured to perform:

under the condition that the difference value of any one non-difficult group is not 1, calculating the difference between 1 and the difference value of any one non-difficult group, and calculating the product of the difference and the number of negative samples in any one non-difficult group as the sampling number of any one non-difficult group;

Optionally, the sampling unit is further configured to perform:

Optionally, the apparatus further comprises:

the verification unit is also configured to execute model training by using the sampling result to obtain a target model; and processing the test sample by using the target model, and judging that the sampling result meets the requirement if the accuracy of the processing result reaches a preset accuracy threshold value.

According to a third aspect of embodiments of the present disclosure, there is provided a data sampling electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement any of the data sampling methods described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having instructions which, when executed by a processor of data sampling electronics, enable the data sampling electronics to perform any of the data sampling methods described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the data sampling method of any one of the above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

acquiring original data, wherein each original data comprises an index value and at least one characteristic value, the original data is divided into a positive sample and a negative sample, the index value of the positive sample is greater than a preset threshold value, and the index value of the negative sample is not greater than the preset threshold value; dividing the negative samples into a plurality of groups according to the index values of the negative samples; calculating a difference value between the characteristic value of the negative sample and the characteristic value of the positive sample in each group; dividing the groups into difficult groups and non-difficult groups according to the difference values; and carrying out layered sampling on the non-difficult groups to obtain non-difficult samples, and taking the positive samples, the negative samples in the difficult groups and the non-difficult samples as sampling results of the original data.

Therefore, the negative samples are divided into the difficult groups and the non-difficult groups by grouping the original data, the difference values of the difficult groups and the positive samples are small, the distinguishing difficulty is large, all the difficult groups and the positive samples are reserved, the difference values of the non-difficult groups and the positive samples are large, layered sampling can be carried out, only one part of the difficult groups and the positive samples are reserved, and in the sampling process, the whole amount of original data does not need to be processed, so that the consumed computing resources are small, and the consumed time is small.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow chart illustrating a method of data sampling in accordance with an exemplary embodiment.

Fig. 2 is a graph illustrating a disparity value according to a descending order of disparity values according to an exemplary embodiment.

Fig. 3 is a first order difference plot based on the difference value plot shown in fig. 2, according to an example embodiment.

Fig. 4 is a second order difference plot based on the first order difference plot shown in fig. 3, according to an example embodiment.

Fig. 5 is a block diagram illustrating a data sampling apparatus according to an example embodiment.

FIG. 6 is a block diagram illustrating an electronic device for data sampling in accordance with an exemplary embodiment.

Fig. 7 is a block diagram illustrating an apparatus for data sampling in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flow chart illustrating a data sampling method according to an exemplary embodiment, the data sampling method including the following steps, as shown in fig. 1.

In step S11, raw data are obtained, where each raw data includes an index value and at least one feature value, the raw data are divided into positive samples and negative samples, the index value of the positive sample is greater than a preset threshold, and the index value of the negative sample is not greater than the preset threshold.

In the present disclosure, the raw data is data for training a model to be trained, each raw data includes a plurality of data items, where a data item corresponding to a business index of the model to be trained may be referred to as an index value, and a data item input to the model to be trained for predicting the index value may be referred to as a feature value. The feature value may be one or more, and is not particularly limited.

According to the size of the index value, the original data can be divided into positive samples and negative samples, wherein the positive samples are the index values which are larger than the preset threshold, and the negative samples are the index values which are not larger than the preset threshold. The preset threshold value can be determined according to the service index of the model to be trained.

For example, if the task of the model to be trained is to predict whether the playing amount of the video can reach 100 ten thousands according to the input video features, the video features are feature values, the video playing amount is an index value, and the preset threshold value is 100 ten thousands.

In step S12, the negative examples are divided into a plurality of groups according to their index values.

In this step, the negative examples are divided into a plurality of groups according to the index values of the negative examples, which may specifically be: and dividing the negative samples into a plurality of groups according to the index values of the negative samples and a preset grouping rule. The preset grouping rule may be determined according to business experience of the model to be trained, for example, as for negative samples with a play volume of less than 100 ten thousand, the negative samples may be divided into 4 groups according to 0-10 ten thousand, 10-50 ten thousand, 50-80 ten thousand and 80-100 ten thousand. Therefore, the grouping of the negative samples is more reasonable, the task indexes of the model to be trained can be adjusted, the pertinence is higher, and the subsequent model training efficiency is higher.

Alternatively, in another implementation, the negative examples are divided into a plurality of groups according to the index values of the negative examples, and the method may further include: and performing equivalent segmentation on the negative samples according to the index values of the negative samples, and dividing the negative samples into a preset number of groups. Continuing with the above example, for video data having a play amount of 100 ten thousand or less, ten minutes can be made to obtain 0 to 10 ten thousand, 10 to 20 ten thousand, 20 to 30 ten thousand, 30 to 40 ten thousand, 40 to 50 ten thousand, 50 to 60 ten thousand, 60 to 70 ten thousand, 70 to 80 ten thousand, 80 to 90 ten thousand, and 90 to 100 ten thousand, ten groups. Therefore, the dependency on service experience can be reduced, and data sampling of new indexes of new services can be realized.

In step S13, a difference value between the feature value of the negative sample and the feature value of the positive sample in each group is calculated.

In each group, at least one negative example is included, each negative example having at least one characteristic value. The difference between the positive sample and the negative sample in each group can be calculated for each eigenvalue, and then the difference value between the positive sample and each group can be calculated according to the difference between the positive sample and each group for different eigenvalues.

Specifically, first, a first statistical parameter of any characteristic value of each negative sample in any group may be calculated, and a second statistical parameter of any characteristic value of a positive sample may be calculated. For example, the first statistical parameter and the second statistical parameter include any one or more of the following: the negative samples in the group are arranged from small to large according to the characteristic values and are divided into four equal parts, and the characteristic values of the negative samples at the three dividing points are respectively 25 quantiles, 50 quantiles and 75 quantiles.

In the present disclosure, the first statistical parameters may be respectively expressed as: the first mean avg _ group, the first variance std _ group, the first 25 quantiles 25_ group, the first 50 quantiles 50_ group, and the first 75 quantiles 75_ group, and the second statistical parameter may be expressed as: a second mean avg _ top, a second variance std _ top, a second 25 quantile 25_ top, a second 50 quantile 50_ top, and a second 75 quantile 75_ top.

Then, calculating the Euclidean distance between the first statistical parameter and the second statistical parameter to obtain a target difference value between any characteristic value of any group and any characteristic value of the positive sample.

For example, the euclidean distance between the first statistical parameter and the second statistical parameter may be calculated using the following formula:

diff_n＝sqrt((avg_top-avg_group)²+(std_top-std_group)²+(25_top-25_group)²+(50_top-50_group)²+(75_top-75_group)²)

wherein diff _ n represents a target difference value between the group and the positive sample for the eigenvalue n, and for each eigenvalue, there is a corresponding target difference value.

And further, averaging the target difference values, and normalizing the average result to obtain the difference value between the characteristic value of the negative sample and the characteristic value of the positive sample in any group.

The target difference value can be averaged by using the following formula, and then the average result of each group is normalized, and in the normalization process, the difference value of the group with the largest average result is set to 1, and the difference value of the group with the smallest average result is set to 0, so that the difference value of each group is obtained:

diff_scale_i＝(diff_1+diff_2+...+diff_n)/n

wherein diff _ scale _ i represents a difference value between the group and the positive sample, diff _1, diff _2,. and diff _ n represent a target difference value of the feature value 1, a target difference value of the feature value 2, … and a target difference value of the feature value n, respectively, and n represents the number of the feature values.

Or, in another implementation manner, the target difference values may be weighted and averaged by using a preset weight of each feature value, and then, the average result is normalized to obtain the difference value between the feature value of the negative sample and the feature value of the positive sample in any group.

Therefore, data sampling can be adjusted according to the service requirement of the model to be trained, and if a certain characteristic value is strongly required, the preset weight of the characteristic value can be properly increased, so that the data sampling is more reasonable, and the model training effect is further improved. For example, continuing the above example, it can be understood that the correlation between the number of fans of the video author and the video playing amount is strong, and then a large preset weight may be set for the feature value of the number of fans of the video author.

In step S14, the groups are divided into difficult groups and non-difficult groups according to the difference values.

In this step, for a group whose difference value is greater than the difference value threshold value, the difference between the negative sample and the positive sample in the group can be considered to be large and belong to the negative sample that is easy to distinguish according to the preset difference value threshold value, and therefore the group can be regarded as a non-difficult group, and for a group whose difference value is not greater than the difference value threshold value, the difference between the negative sample and the positive sample in the group can be considered to be small and belong to the negative sample that is not easy to distinguish, and therefore the group can be regarded as a difficult group.

Alternatively, in another implementation, the following steps may be taken to divide the groups into difficult and non-difficult groups:

firstly, sorting groups according to the sequence of difference values from large to small, and drawing a second-order difference curve of the groups based on a sorting result; then, taking the difference value of the group corresponding to the first peak value in the reverse order of the second-order difference curve as a screening threshold value; and then, the group with the difference value smaller than the screening threshold value is used as a difficult group, and the group with the difference value not smaller than the screening threshold value is used as a non-difficult group.

If a group corresponds to the peak point of the second-order difference curve, it means that the difference value of the next group of the group is suddenly changed compared to the group, and since the difference values are sorted from large to small, that is, the difference value of the next group of the group is suddenly decreased, and the difference value is smaller, the difference between the positive sample and the negative sample in the group is smaller, and the difficulty in distinguishing is larger, the difference value of the group can be used as the screening threshold, and the group with the difference value smaller than the screening threshold is used as the difficult group, in other words, the group sorted after the group is used as the difficult group.

For example, as shown in fig. 2, the graph of the difference values is drawn according to the sequence of the difference values from large to small, the graph of the difference values is formed by connecting 10 points, ten points 1 to 10 respectively represent 10 groups of the difference values sorted from large to small, the groups are group 1, group 2, … and group 10, and the ordinate is the difference value corresponding to each group. As shown in fig. 3, the first-order difference graph is drawn based on the difference value graph shown in fig. 2, the first point is a null value, the ordinate of the second point is determined based on the difference between the ordinate of the second point and the ordinate of the first point in fig. 2, that is, the first-order difference corresponding to group 1, and so on, and the ordinate is the first-order difference value. As shown in fig. 4, the second order difference graph is drawn based on the first order difference graph shown in fig. 3, the first point and the second point both represent null values, the third point is determined based on a difference between a ordinate of the third point in fig. 3 and a ordinate of the second point, that is, the second order difference corresponding to group 1, and so on, and the ordinate is the second order difference value.

In fig. 4, the first peak point in the reverse order is the seventh point, in fig. 3, it corresponds to the sixth point, in fig. 2, the corresponding group is the group 5 in fig. 2, therefore, the difference value of the group 5 is the screening threshold, and the group with the difference value smaller than the group 5 is the difficult group, that is, as shown in fig. 2, the group 6 is the difficult group, and the groups sorted thereafter.

In step S15, the non-difficult groups are hierarchically sampled to obtain non-difficult samples, and the positive samples, the negative samples in the difficult groups, and the non-difficult samples are used as the sampling results of the original data.

In this step, the non-difficult groups are hierarchically sampled, and the same number of negative samples may be respectively collected in each non-difficult group, or the number of samples corresponding to each non-difficult group may be determined according to the difference value of each non-difficult group. In each non-difficult group, random sampling may be performed, or negative samples in the group may be sampled after being screened, which is not limited specifically.

In one implementation, the hierarchical sampling of the non-difficult groups to obtain non-difficult samples may include the following steps:

first, the sampling number of any non-difficult group is calculated according to the difference value of any non-difficult group and the number of negative samples in any non-difficult group.

For example, the number of samples for a non-difficult group can be calculated by taking the following equation:

sample_cnt_i＝cnt_i*(1-diff_scale_i)

wherein, sample _ cnt _ i represents the number of samples of the non-difficult group i, diff _ scale _ i represents the disparity value of the group, and cnt _ i represents the number of negative samples in the group.

It can be understood that, in the process of determining the difference values of the respective groups, the difference value of the group with the largest average result is set to 1 in the normalization, so that there is a group with a difference value of 1 in the non-difficult group, and in this case, the difference value of the group can be adjusted.

Specifically, when the difference value of any one of the non-difficult groups is 1, the difference value of the non-difficult group may be set to a preset value or a maximum value among the difference values of the non-difficult groups, for example, the difference value of the group may be adjusted to 0.9, or the difference value of another group next to the group may be used as the difference value of the group. Then, the above formula is adopted to calculate the sampling number of the group, that is, the difference between 1 and the difference value of any one non-difficult group is calculated, and the product of the difference and the number of negative samples in any one non-difficult group is calculated as the sampling number of any one non-difficult group.

For the group with the difference value not being 1 in the non-difficult group, the difference value can be calculated by adopting the formula without adjusting the difference value, the difference between 1 and the difference value of any non-difficult group is calculated, and the product of the difference and the number of negative samples in any non-difficult group is calculated to be used as the sampling number of any non-difficult group.

And then, sampling the non-difficult groups according to the sampling number of the non-difficult groups to obtain non-difficult samples. In this way, a sampling result of the original data is obtained, which includes positive samples, negative samples in the difficult group, and non-difficult samples.

As can be seen from the foregoing, the positive samples and the non-difficult samples having a larger difference value from the positive samples belong to samples that are easy to distinguish, wherein the number of the positive samples is often smaller, and the negative samples in the difficult group having a smaller difference value from the positive samples belong to samples that are not easy to distinguish.

In one implementation, after the non-difficult samples are obtained by hierarchically sampling the non-difficult groups, the number of the non-difficult samples may be further adjusted, specifically, if the number of the non-difficult samples is smaller than the number of positive samples, the non-difficult groups are re-sampled according to the number of the positive samples, so that the obtained number of the non-difficult samples is the same as the number of the positive samples, or the number of the non-difficult samples may be adjusted to a preset sample threshold. Therefore, the training failure of the model to be trained on the negative samples caused by too few non-difficult samples can be avoided.

The resampling for the non-difficult groups may be performed by performing the same number of samplings from each non-difficult group, or may also be performed by determining the resampling number corresponding to each non-difficult group according to a difference value between the positive sample and each non-difficult group, which is not limited specifically.

In the disclosure, after the sampling result of the original data is obtained, model training can be further performed by using the sampling result to obtain a target model; and processing the test sample by using the target model, and judging that the sampling result meets the requirement if the accuracy of the processing result reaches a preset accuracy threshold value, thereby realizing the inspection of the sampling result.

As can be seen from the above, according to the technical scheme provided by the embodiment of the disclosure, the negative samples are divided into the difficult groups and the non-difficult groups by grouping the original data, the difference value between the difficult groups and the positive samples is small, and the difficulty in distinguishing is large, so that all the negative samples are reserved, and the difference value between the non-difficult groups and the positive samples is large, so that the hierarchical sampling can be performed, only a part of the negative samples needs to be reserved, and the whole amount of original data does not need to be processed in the sampling process, so that the consumed computing resources are small, and the consumed time is small.

FIG. 5 is a block diagram illustrating a data sampling apparatus according to an exemplary embodiment, the apparatus comprising:

an obtaining unit 201 configured to perform obtaining raw data, where each raw data includes an index value and at least one feature value, the raw data is divided into a positive sample and a negative sample, the index value of the positive sample is greater than a preset threshold, and the index value of the negative sample is not greater than the preset threshold;

a grouping unit 202 configured to perform grouping of the negative examples into a plurality of groups according to the index values of the negative examples;

a calculating unit 203 configured to perform calculating a difference value between the feature value of the negative sample and the feature value of the positive sample in each group;

a screening unit 204 configured to perform a classification of the groups into difficult groups and non-difficult groups according to the difference values;

a sampling unit 205 configured to perform hierarchical sampling on the non-difficult group, obtain non-difficult samples, and use the positive samples, the negative samples in the difficult group, and the non-difficult samples as sampling results of the original data.

In one implementation, the grouping unit 202 is further configured to perform:

In one implementation, the computing unit 203 is further configured to perform:

and normalizing the target difference value to obtain a difference value between the characteristic value of the negative sample and the characteristic value of the positive sample in any group.

In one implementation, the first statistical parameter and the second statistical parameter include any one or more of:

mean, variance, 25 quantile, 50 quantile, and 75 quantile.

In one implementation, the computing unit 203 is further configured to perform:

normalizing the target difference value to obtain a normalized target difference value;

and according to the preset weight of each characteristic value, carrying out weighting and averaging on the normalized target difference values to obtain the difference value between the characteristic value of the negative sample and the characteristic value of the positive sample in any group.

In one implementation, the screening unit 204 is configured to perform:

In one implementation, the sampling unit 205 is configured to perform:

In one implementation, the sampling unit 205 is further configured to perform:

In one implementation, the apparatus further includes:

the verification unit is also configured to execute model training by using the sampling result to obtain a target model; and processing the test sample by using the target model, and judging that the sampling result meets the requirement if the accuracy of the processing result reaches a preset accuracy threshold.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as a memory comprising instructions, executable by a processor of an electronic device to perform the above-described method is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which, when run on a computer, causes the computer to implement the above-described method of data sampling.

Fig. 7 is a block diagram illustrating an apparatus 800 for data sampling in accordance with an example embodiment.

For example, the apparatus 800 may be a mobile phone, a computer, digital broadcast electronics, messaging devices, game consoles, tablet devices, medical devices, exercise devices, personal digital assistants, and the like.

Referring to fig. 7, the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power supply components 807 provide power to the various components of device 800. The power components 807 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The apparatus 800 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the methods of the first and second aspects.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. Alternatively, for example, the storage medium may be a non-transitory computer-readable storage medium, such as a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the data sampling method as described in any of the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of data sampling, comprising:

2. The data sampling method of claim 1, wherein the dividing the negative examples into a plurality of groups according to the index values of the negative examples comprises:

3. The data sampling method of claim 1, wherein the calculating a difference value between the eigenvalue of the negative sample and the eigenvalue of the positive sample in each group comprises:

4. The data sampling method of claim 3, wherein the first statistical parameter and the second statistical parameter comprise any one or more of:

mean, variance, 25 quantile, 50 quantile, and 75 quantile.

5. The data sampling method of claim 3, wherein the averaging the target difference values and normalizing the average result to obtain the difference value between the feature value of the negative sample and the feature value of the positive sample in any group comprises:

6. The data sampling method of claim 1, wherein the dividing the groups into difficult and non-difficult groups according to the difference values comprises:

7. A data sampling apparatus, comprising:

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the data sampling method of any one of claims 1 to 6.

9. A computer readable storage medium, wherein instructions in the computer readable storage medium, when executed by a processor of data sampling electronics, enable the data sampling electronics to perform the data sampling method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the data sampling method of any one of claims 1-6.