CN113780365A - Sample generation method and device - Google Patents

Sample generation method and device Download PDF

Info

Publication number
CN113780365A
CN113780365A CN202110952742.4A CN202110952742A CN113780365A CN 113780365 A CN113780365 A CN 113780365A CN 202110952742 A CN202110952742 A CN 202110952742A CN 113780365 A CN113780365 A CN 113780365A
Authority
CN
China
Prior art keywords
sample
feature
disturbed
target sample
structured data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110952742.4A
Other languages
Chinese (zh)
Inventor
张长浩
傅欣艺
王维强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202110952742.4A priority Critical patent/CN113780365A/en
Publication of CN113780365A publication Critical patent/CN113780365A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features

Abstract

The embodiment of the specification provides a sample generation method and a sample generation device. Firstly, obtaining a target sample of structured data; the target sample comprises at least one characteristic value, and each characteristic value corresponds to one characteristic of the structured data; determining a feature to be perturbed from at least one feature of the structured data; determining a disturbance range corresponding to the feature to be disturbed; and in the disturbance range, disturbing the characteristic value corresponding to the characteristic to be disturbed in the target sample to obtain a new sample.

Description

Sample generation method and device
Technical Field
One or more embodiments of the present description relate to electronic information technology, and more particularly, to a sample generation method and apparatus.
Background
With the development of artificial intelligence technology, machine recognition models have been generated in various business fields. The machine identification model is trained by using sample data. In order to improve the recognition accuracy of the machine recognition model, it is necessary to train the machine recognition model with as much sample data as possible.
However, the number of samples that can be obtained from real business applications is often limited, and as such, the machine recognition model cannot be trained better.
Disclosure of Invention
One or more embodiments of the present specification describe a sample generation method and apparatus that can generate more training samples.
According to a first aspect, there is provided a sample generation method comprising:
obtaining a target sample of the structured data; the target sample comprises at least one characteristic value, and each characteristic value corresponds to one characteristic of the structured data;
determining a feature to be perturbed from at least one feature of the structured data;
determining a disturbance range corresponding to the feature to be disturbed;
and in the disturbance range, disturbing the characteristic value corresponding to the characteristic to be disturbed in the target sample to obtain a new sample.
Wherein the determining a feature to be perturbed from at least one feature of the structured data comprises:
inputting the target sample into a machine recognition model needing to be trained, and learning each feature in the structured data by the machine recognition model according to the label of the target sample;
determining the importance of each feature in the structured data in learning the target sample by the machine recognition model;
and selecting the first N characteristics with the importance degrees from the characteristics of the structured data as the characteristics to be disturbed, wherein N is an integer not less than 1.
Wherein the determining the importance of each feature in the structured data in the learning of the target sample by the machine recognition model comprises:
and calculating contribution scores of all the features in the structured data in the machine recognition model learning target sample by using a SHAP algorithm or a LINE algorithm, wherein the greater the contribution score, the higher the feature importance degree is.
After the determining the importance degree of each feature in the structured data in the learning of the machine recognition model, further comprising:
and selecting M characteristics with the highest importance degree from all the characteristics of the structured data as the characteristics to be disturbed, wherein M is an integer not less than 1.
The target sample is located in a sample set, the sample set includes at least two original samples, and the target sample is a sample selected from the at least two original samples.
Wherein selecting a target sample from the at least two original samples comprises:
inputting the at least two original samples into a machine recognition model needing to be trained to obtain a score output by the machine recognition model aiming at each original sample; and taking the original sample with high score value as the target sample.
The determining the disturbance range corresponding to the feature to be disturbed comprises:
selecting a minimum characteristic value and a maximum characteristic value from at least two characteristic values of at least two original samples corresponding to the characteristics to be disturbed;
determining the characteristic range of the characteristic to be disturbed by utilizing the minimum characteristic value and the maximum characteristic value;
and obtaining the disturbance range of the feature to be disturbed by utilizing the feature range of the feature to be disturbed.
Wherein the disturbing the characteristic value corresponding to the feature to be disturbed in the target sample in the disturbance range comprises at least one of the following:
calculating a median by using the lower limit value and the upper limit value of the disturbance range, and replacing a characteristic value corresponding to the feature to be disturbed in the target sample with the median;
determining a numerical value in the disturbance range, and replacing the characteristic value corresponding to the feature to be disturbed in the target sample with the numerical value;
calculating an average value of at least two characteristic values of the at least two original samples, which correspond to the characteristics to be disturbed, and replacing the characteristic values in the target sample, which correspond to the characteristics to be disturbed, with the average value.
Wherein the sample is a black sample;
and/or the presence of a gas in the gas,
after the obtaining of the new sample, further comprising: inputting the new sample into the machine recognition model, judging whether the recognition result output by the machine recognition model meets the label requirement of the target sample, if so, taking the new sample as a training sample of the machine recognition model, otherwise, discarding the new sample.
According to a second aspect, there is provided a sample generation apparatus comprising:
a target sample acquisition module configured to obtain a target sample of structured data; the target sample comprises at least one characteristic value, and each characteristic value corresponds to one characteristic of the structured data;
a disturbance feature determination module configured to determine a feature to be disturbed from at least one feature of the structured data;
a disturbance range determination module configured to determine a disturbance range corresponding to the feature to be disturbed;
and the disturbance processing module is configured to disturb the characteristic value corresponding to the feature to be disturbed in the target sample in the disturbance range so as to obtain a new sample.
Wherein the disturbance feature determination module is configured to perform:
inputting the target sample into a machine recognition model needing to be trained, and learning each feature in the structured data by the machine recognition model according to the label of the target sample;
determining the importance of each feature in the structured data in learning the target sample by the machine recognition model;
and selecting the first N characteristics with the importance degrees from the characteristics of the structured data as the characteristics to be disturbed, wherein N is an integer not less than 1.
In one embodiment of the apparatus of the present specification, the disturbance characteristic determination module is configured to perform: and calculating contribution scores of all the features in the structured data in the machine recognition model learning target sample by using a SHAP algorithm or a LINE algorithm, wherein the greater the contribution score, the higher the feature importance degree is.
In an embodiment of the apparatus of the present specification, the disturbance feature determination module is further configured to select, as the feature to be disturbed, M features with the top-ranked degree of importance from the features of the structured data, where M is an integer not less than 1.
The target sample is located in a sample set, the sample set includes at least two original samples, and the target sample is a sample selected from the at least two original samples.
Wherein the target sample acquisition module is configured to perform: inputting the at least two original samples into a machine recognition model needing to be trained to obtain a score output by the machine recognition model aiming at each original sample; and taking the original sample with high score value as the target sample.
Wherein the disturbance range determination module is configured to perform: selecting a minimum characteristic value and a maximum characteristic value from at least two characteristic values of at least two original samples corresponding to the characteristics to be disturbed; determining the characteristic range of the characteristic to be disturbed by utilizing the minimum characteristic value and the maximum characteristic value; and obtaining the disturbance range of the feature to be disturbed by utilizing the feature range of the feature to be disturbed.
Wherein the perturbation processing module is configured to perform at least one of:
calculating a median by using the lower limit value and the upper limit value of the disturbance range, and replacing a characteristic value corresponding to the feature to be disturbed in the target sample with the median;
determining a numerical value in the disturbance range, and replacing the characteristic value corresponding to the feature to be disturbed in the target sample with the numerical value;
calculating an average value of at least two characteristic values of the at least two original samples, which correspond to the characteristics to be disturbed, and replacing the characteristic values in the target sample, which correspond to the characteristics to be disturbed, with the average value.
In one embodiment of the device of the present disclosure, the sample is a black sample;
and/or the presence of a gas in the gas,
in an embodiment of the apparatus of the present specification, the apparatus further includes a verification module configured to perform inputting the new sample into the machine recognition model, determining whether a recognition result output by the machine recognition model meets a label requirement of the target sample, if so, taking the new sample as a training sample of the machine recognition model, otherwise, discarding the new sample.
According to a third aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements a method as described in any of the embodiments of the present specification.
In the sample generation method and device of the embodiment of the specification, the features in the target sample are not disturbed randomly, but a disturbance range is determined, and the features to be disturbed are disturbed in the disturbance range, so that the obtained new sample can meet the requirements of no change of a sample label and interpretability. And then can carry out better training to machine identification model.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present specification, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a sample generation method in one embodiment of the present description.
Fig. 2 is a schematic structural diagram of a sample generation device in one embodiment of the present disclosure.
Fig. 3 is a schematic structural diagram of a sample generation device in another embodiment of the present disclosure.
Detailed Description
As previously mentioned, more training samples need to be acquired to better train the machine recognition model. A currently common approach is to increase the number of samples by adding random perturbations to existing samples to generate new samples. For example, in the field of image recognition, a human face picture is already in a database, random disturbance is added to any pixel in the picture, a new human face picture can be obtained, and both the original human face picture and the new human face picture can be used as samples for training a machine recognition model.
It can be seen that the prior art mainly generates new samples by adding random perturbation to existing samples. For a picture, because its pixels are continuous, samples can be generated in a way that adds random perturbations. However, for a business scenario applying structured data, the way of adding random disturbance often cannot generate a new sample. For example, for a structured data sample in the form of a table, one feature in the table indicates the number of transactions a user has made during a day, with a feature value of 2. Then, if random perturbation is used, the eigenvalue can be arbitrarily modified, such as 1 million instead. The original value 2 indicates that the user trades 2 times in one day, the structured data of the table is marked with a white label indicating that the user is a non-risk user, the value of the feature is changed into 1 million by using a perturbation mode, and in some scenes, if the trading frequency of one user in one day is 1 million, the corresponding label is changed into a black label indicating that the user is a risk user. Therefore, after random disturbance is added, the labels of the samples are changed, and the samples generated after the random disturbance cannot be used for training a machine recognition model. For another example, the original value 2 of the above feature is changed to-3, because the number of transactions in one day by one user cannot be negative, such random perturbation has no interpretability, and the sample generated after random perturbation cannot be used for training the machine recognition model.
Analysis shows that, for a picture, because its pixels are continuous, disturbing a pixel, such as changing the value of the pixel, does not change the label of the disturbed picture, nor does it make the disturbed picture unexplainable. For example, before unperturbed, the image is a human face, after one or a few pixels are randomly perturbed (for example, the value of one pixel is changed from 2 to 1 million), the label of the image obtained after random perturbation is still a human face, and the label is unchanged and also has interpretability. However, structured data differs in that even a change to the value of a feature in structured data, such as a table, tends to change the label of the structured data or render it uninterpretable, as shown in the above example. Therefore, a sample generation method suitable for structured data is needed.
As can be seen from the analysis of the process of generating a sample, if a new sample is to be generated by using an existing sample, two conditions need to be satisfied: condition 1: the label of the specimen cannot be changed; 2. the sample generated is to be interpretable. For the structured data, if the two conditions are to be met, the samples cannot be generated in a random disturbance mode, but a reasonable disturbance range needs to be determined based on each existing sample, and disturbance is performed in the disturbance range.
Specific implementations of the above concepts are described below.
FIG. 1 shows a flow diagram of a sample generation method in one embodiment of the present description. The method is performed by a sample generation apparatus. It is to be understood that the method may also be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. Referring to fig. 1, the method includes:
step 101: obtaining a target sample of the structured data; the target sample comprises at least one characteristic value, and each characteristic value corresponds to one characteristic of the structured data.
Step 103: determining a feature to be perturbed from at least one feature of the structured data.
Step 105: and determining a disturbance range corresponding to the feature to be disturbed.
Step 107: and in the disturbance range, disturbing the characteristic value corresponding to the characteristic to be disturbed in the target sample to obtain a new sample.
As can be seen from the flow shown in fig. 1, in the embodiment of the present specification, instead of randomly perturbing the features in the target sample, a perturbation range is determined, and the features to be perturbed are perturbed within the perturbation range, so that the obtained new sample can satisfy the requirement of not changing the sample label and has interpretability. And then can carry out better training to machine identification model.
The respective steps in fig. 1 are explained below.
Firstly, obtaining a target sample of structured data in step 101; the target sample comprises at least one characteristic value, and each characteristic value corresponds to one characteristic of the structured data.
Structured data refers to data that exists in a fixed format in a record file, such as an elastic Distributed data set (RDD) or tabular data. A sample set format for structured data can be found in table 1 below.
Figure BDA0003219158920000071
TABLE 1
The L samples shown in table 1 (L is an integer greater than 1) are referred to as raw samples and are typically stored in a sample set for training a machine recognition model such as a wind control model. Since the number of original samples that can be obtained from actual traffic is limited, it is necessary to increase the number of samples by obtaining new samples by perturbing the original samples.
At step 101, at least one target sample may be obtained from the L original samples shown in table 1, so that a new sample is generated in a subsequent process using the target sample. The manner in which the target sample of structured data is obtained may include, but is not limited to, the following:
mode 1: and (4) randomly selecting.
For example, a sample is randomly selected from the sample set of raw samples shown in table 1 as the target sample. For example, sample 2 is randomly selected as the target sample.
Mode 2: and selecting a sample which is strongly associated with the label.
To train the machine recognition model, each sample is labeled. For example, for a wind-controlled business, the label of a sample can be a black sample (i.e., a risky user) or a white sample (i.e., a non-risky user), and the machine recognition model learns the features in the sample based on the label. Although the labels of different samples will be the same, such as black samples, in practice, the risk levels will differ. For example, sample 1 may have all features that can be represented by the risky user, and thus be labeled as a black sample, and sample 2 may have only some features that can be represented by the risky user, and thus be labeled as a black sample, it can be understood that sample 1 is more strongly associated with the labeled black sample than sample 2.
In this step 101, a sample strongly associated with the label may be selected as a target sample, and a new sample generated based on such a target sample can better ensure that the label is not changed, and the training effect on the machine recognition model is better. A specific implementation procedure of the mode 2 includes: inputting at least two original samples into a machine recognition model needing training to obtain a score output by the machine recognition model aiming at each original sample; and taking the original sample with high score value as the target sample. The machine recognition model may be a basic machine recognition model trained in advance based on a sample set including original samples.
The high score value is a relative meaning and needs to be determined according to the type of the sample.
If the samples to be generated are white samples, that is, more white samples need to be obtained for training the machine recognition model, for example, for the wind control model, the score value is high, which means the score evaluated based on the white samples, and the score value of the score is smaller, the score value is higher, for example, the score output by the machine recognition model is 10 and 20, then the score value of the original sample with the score of 10 is higher than that of the original sample with the score of 20, and the original sample with the score of 10 can better represent the characteristics of the white sample, so the original sample with the score of 10 can be used as the target sample.
If the samples to be generated are black samples, that is, more black samples need to be obtained for training the machine recognition model, for example, for the wind control model, the score value is high, which means the score evaluated based on the black samples, and the score value of the score is larger, the score value is higher, for example, the score output by the machine recognition model is 80 scores and 90 scores, then the score value of the original sample with the score of 90 is higher than that of the original sample with the score of 80, and the original sample with the score of 90 can better represent the characteristics of the black sample, so the original sample with the score of 90 can be used as the target sample.
Next, in step 103, a feature to be perturbed is determined from at least one feature of the structured data.
In one embodiment of the present description, a feature strongly associated with a tag may be selected as the feature to be perturbed. A plurality of features, such as feature 1 through feature P (P is an integer greater than 1) shown in table 1 above, may be included in a sample. Taking a black sample as an example, the contribution of the P features to the black sample label is different, for example, feature 1 can completely characterize the characteristics of the black sample, feature 2 can characterize the characteristics of the black sample to some extent, and feature P cannot characterize the characteristics of the black sample. Therefore, the characteristics strongly associated with the label can be selected as the characteristics to be disturbed, a new sample generated after the characteristics to be disturbed are disturbed can be ensured not to change the label, and the training effect on the machine identification model is better.
Specifically, one implementation process for selecting the feature strongly associated with the tag as the feature to be perturbed includes:
step 1031: inputting a target sample into a machine recognition model needing to be trained, and learning each feature in the structured data by the machine recognition model according to a label of the target sample;
step 1033: determining the importance of each feature in the structured data in learning the target sample by the machine recognition model;
in this step, the importance of the features may be determined in a variety of ways. For example, using the SHAP algorithm or the LINE algorithm, the contribution scores of the features in the structured data in the machine recognition model learning the target sample are calculated, wherein the greater the contribution score, the higher the importance of the feature.
Step 1035: and selecting the first N characteristics with the importance degrees from the characteristics of the structured data as the characteristics to be disturbed, wherein N is an integer not less than 1.
Since the features with the top N importance levels are the features which are more representative of the tags, i.e. the features which are more strongly related to the tags, the features with the top N importance levels are selected as the features to be perturbed in step 1035. Where N may, for example, be 10% of the total number of features in the structured data.
In an embodiment of the present specification, only the features with the top N importance levels may be selected as the features to be perturbed. Considering that in an actual business implementation, the more features in a sample are disturbed, the greater the degree of change of the sample, and therefore, the more features in the sample may be considered to be disturbed, but in order to ensure that the labels of the samples before and after the disturbance are not changed, in the above step 1035, not only the features with the top N importance degrees may be selected as the features to be disturbed, but also the features with the bottom M importance degrees may be further selected from the features of the structured data as the features to be disturbed, where M is an integer not less than 1. Because the characteristics with the importance degree ranked in the last M are the characteristics weakly associated with the label, namely the characteristics of the label are difficult to embody, the characteristics ranked in the last M are also taken as the characteristics to be disturbed, so that the purpose of increasing the number of disturbed characteristics can be achieved, and the effect of not changing the sample label can be achieved.
Next, in step 105, a perturbation range corresponding to the feature to be perturbed is determined.
As previously mentioned, for structured data, features cannot be perturbed randomly. Therefore, a disturbance range needs to be determined by the process of this step 105.
In one embodiment of the present specification, the process of determining the disturbance range corresponding to the feature to be disturbed includes: selecting a minimum characteristic value and a maximum characteristic value from at least two characteristic values of at least two original samples corresponding to the characteristics to be disturbed; determining the characteristic range of the characteristic to be disturbed by utilizing the minimum characteristic value and the maximum characteristic value; and obtaining the disturbance range of the feature to be disturbed by utilizing the feature range of the feature to be disturbed.
For example, for the structured data shown in table 1, the feature to be perturbed is feature 1, and then, the L original samples from sample 1 to sample L include feature values of L feature 1, where the minimum feature value is 2 and the maximum feature value is 100, so that a feature range corresponding to feature 1 is obtained as [2,100], and then a perturbation range is obtained according to the feature range [2,100], for example, the feature range [2,100] is used as the perturbation range, and for example, a small range included in the feature range [2,100] such as [10,100] may also be used as the perturbation range, and for example, a range slightly expanded from the feature range [2,100] such as [1,100] may also be used as the perturbation range.
Next, in step 107, the feature value corresponding to the feature to be perturbed in the target sample is perturbed within the obtained perturbation range to obtain a new sample.
The perturbation mode of this step 107 includes, but is not limited to, one or more of the following:
and in the mode A, calculating a median by using the lower limit value and the upper limit value of the disturbance range, and replacing the characteristic value corresponding to the feature to be disturbed in the target sample with the median.
A perturbation range is a range of values, including a lower value and an upper value.
For example, if the disturbance range of the feature 1 to be disturbed is [2,100] obtained in step 105, and then the median value of (2+100)2 may be calculated as 51, then referring to table 1 above, the feature value of the target sample, such as feature 1 in sample 1, may be replaced with 51, and the feature value of the feature 1 that is not to be disturbed remains unchanged, so that a new sample, such as sample S1 shown in table 2 below, is obtained.
Figure BDA0003219158920000111
TABLE 2
And B, determining a numerical value in the disturbance range, and replacing the characteristic value corresponding to the feature to be disturbed in the target sample with the numerical value.
For example, if the perturbation range of the feature 1 to be perturbed obtained in step 105 is [2,100], a value may be determined at will in the range of 2 to 100, and the value may be a value of the feature 1 in any original sample in the sample set, such as the value 100 of the feature 1 in the sample 2, or may be independent of the original sample, such as a value 33 in non-table 1. Thereafter, referring to table 1 above, the feature value of the feature 1 in the target sample, such as sample 1, may be replaced with the value, such as 33, and the feature value of the non-to-be-perturbed feature of feature 1 remains unchanged, so as to obtain a new sample, such as sample S2 shown in table 3 below.
Figure BDA0003219158920000121
TABLE 3
And calculating an average value of at least two characteristic values of the at least two original samples corresponding to the characteristics to be disturbed, and replacing the characteristic values of the target samples corresponding to the characteristics to be disturbed with the average value.
For example, step 105 obtains the perturbation range of the feature to be perturbed, i.e. feature 1, as [2,100 ]. Referring to table 1, the values of all samples in that column of feature 1 can be added together divided by L to obtain an average value, such as 40. Then, referring to table 1 above, the feature value of the feature 1 in the target sample, such as sample 1, may be replaced with the value 40, and the feature value of the non-to-be-perturbed feature of feature 1 remains unchanged, so that a new sample, such as sample S3 shown in table 4 below, is obtained.
Figure BDA0003219158920000122
Figure BDA0003219158920000131
TABLE 4
After the above step 107 is performed, a new sample is obtained. In one embodiment of the present description, the new sample may be directly added to the sample set as a training sample for the machine recognition model.
In another embodiment of the present disclosure, a new sample may be further verified, and after the new sample is verified to be a qualified sample (i.e. the label of the target sample is the same as the label of the target sample), a sample set is further added as a training sample of the machine recognition model, where step 107 further includes: inputting a new sample into the machine recognition model, judging whether the recognition result output by the machine recognition model meets the label requirement of the target sample (namely, the recognition result is the same as the label of the target sample, for example, when the target sample is a black sample, the output score of the recognition result is higher than a set value, and when the target sample is a white sample, the output score of the recognition result is lower than another set value), if so, using the new sample as a training sample of the machine recognition model, otherwise, discarding the new sample.
It can be seen that the perturbation process is made constrained, not random perturbation as in the prior art, by the above-described process of the procedure of steps 105 to 107. Because the perturbation range is determined according to the related characteristic value of each original sample in the sample set, the new sample obtained after perturbation is ensured to have interpretability.
In one embodiment of the present description, the machine identification model may be an xgb model.
In one embodiment of the present description, there is provided a sample generation apparatus, see fig. 2, the apparatus 200 comprising:
a target sample acquisition module 201 configured to obtain a target sample of structured data; the target sample comprises at least one characteristic value, and each characteristic value corresponds to one characteristic of the structured data;
a perturbation feature determination module 202 configured to determine a feature to be perturbed from at least one feature of the structured data;
a disturbance range determination module 203 configured to determine a disturbance range corresponding to the feature to be disturbed;
and the perturbation processing module 204 is configured to perturb the feature value corresponding to the feature to be perturbed in the target sample within the perturbation range to obtain a new sample.
In one embodiment of the apparatus of the present description, the disturbance characteristic determination module 202 is configured to perform:
inputting the target sample into a machine recognition model needing to be trained, and learning each feature in the structured data by the machine recognition model according to the label of the target sample;
determining the importance of each feature in the structured data in learning the target sample by the machine recognition model;
and selecting the first N characteristics with the importance degrees from the characteristics of the structured data as the characteristics to be disturbed, wherein N is an integer not less than 1.
In one embodiment of the apparatus of the present description, the disturbance characteristic determination module 202 is configured to perform: and calculating contribution scores of all the features in the structured data in the machine recognition model learning target sample by using a SHAP algorithm or a LINE algorithm, wherein the greater the contribution score, the higher the feature importance degree is.
In an embodiment of the apparatus of the present specification, the perturbation feature determining module 202 is further configured to select, as the feature to be perturbed, M features with the top-ranked degree of importance from the features of the structured data, where M is an integer not less than 1.
In one embodiment of the apparatus of the present specification, the target sample is located in a sample set, the sample set includes at least two original samples, and the target sample is a sample selected from the at least two original samples.
In one embodiment of the apparatus of the present specification, the target sample acquisition module 201 is configured to perform: inputting the at least two original samples into a machine recognition model needing to be trained to obtain a score output by the machine recognition model aiming at each original sample; and taking the original sample with high score value as the target sample.
In one embodiment of the apparatus of the present description, the perturbation range determination module 203 is configured to perform: selecting a minimum characteristic value and a maximum characteristic value from at least two characteristic values of at least two original samples corresponding to the characteristics to be disturbed; determining the characteristic range of the characteristic to be disturbed by utilizing the minimum characteristic value and the maximum characteristic value; and obtaining the disturbance range of the feature to be disturbed by utilizing the feature range of the feature to be disturbed.
In one embodiment of the apparatus of the present description, the perturbation processing module 204 is configured to perform at least one of:
calculating a median by using the lower limit value and the upper limit value of the disturbance range, and replacing the characteristic value corresponding to the characteristic to be disturbed in the target sample with the median;
determining a numerical value in the disturbance range, and replacing the characteristic value corresponding to the feature to be disturbed in the target sample with the numerical value;
calculating an average value of at least two characteristic values of the at least two original samples, which correspond to the characteristics to be disturbed, and replacing the characteristic values in the target sample, which correspond to the characteristics to be disturbed, with the average value.
In one embodiment of the apparatus of the present disclosure, the sample referred to is a black sample.
In an embodiment of the apparatus of the present specification, referring to fig. 3, the apparatus further includes a verification module 301 configured to perform inputting the new sample into the machine recognition model, determining whether a recognition result output by the machine recognition model meets a label requirement of the target sample, if so, taking the new sample as a training sample of the machine recognition model, otherwise, discarding the new sample.
An embodiment of the present specification provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of the embodiments of the specification.
One embodiment of the present specification provides a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor implementing a method in accordance with any one of the embodiments of the specification when executing the executable code.
It is to be understood that the illustrated construction of the embodiments herein is not to be construed as limiting the embodiments herein specifically. In other embodiments of the specification, the sample generation apparatus may include more or fewer components than illustrated, or some components may be combined, some components may be separated, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
For the information interaction, execution process and other contents between the modules in the above-mentioned apparatus and system, because the same concept is based on the embodiment of the method in this specification, specific contents may refer to the description in the embodiment of the method in this specification, and are not described herein again.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this disclosure may be implemented in hardware, software, hardware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (15)

1. A sample generation method, comprising:
obtaining a target sample of the structured data; the target sample comprises at least one characteristic value, and each characteristic value corresponds to one characteristic of the structured data;
determining a feature to be perturbed from at least one feature of the structured data;
determining a disturbance range corresponding to the feature to be disturbed;
and in the disturbance range, disturbing the characteristic value corresponding to the characteristic to be disturbed in the target sample to obtain a new sample.
2. The method of claim 1, wherein the determining a feature to be perturbed from at least one feature of the structured data comprises:
inputting the target sample into a machine recognition model needing to be trained, and learning each feature in the structured data by the machine recognition model according to the label of the target sample;
determining the importance of each feature in the structured data in learning the target sample by the machine recognition model;
and selecting the first N characteristics with the importance degrees from the characteristics of the structured data as the characteristics to be disturbed, wherein N is an integer not less than 1.
3. The method of claim 2, wherein the determining a degree of importance of each feature in the structured data in the machine recognition model learning the target sample comprises:
and calculating contribution scores of all the features in the structured data in the machine recognition model learning target sample by using a SHAP algorithm or a LINE algorithm, wherein the greater the contribution score, the higher the feature importance degree is.
4. The method of claim 2, further comprising, after said determining the degree of importance of each feature in the structured data in learning of the machine recognition model:
and selecting M characteristics with the highest importance degree from all the characteristics of the structured data as the characteristics to be disturbed, wherein M is an integer not less than 1.
5. The method of claim 1, wherein the target sample is in a sample set comprising at least two original samples, and the target sample is a sample selected from the at least two original samples.
6. The method of claim 5, wherein selecting a target sample from at least two original samples comprises:
inputting the at least two original samples into a machine recognition model needing to be trained to obtain a score output by the machine recognition model aiming at each original sample; and taking the original sample with high score value as the target sample.
7. The method of claim 5, the determining a perturbation range corresponding to the feature to be perturbed comprising:
selecting a minimum characteristic value and a maximum characteristic value from at least two characteristic values of at least two original samples corresponding to the characteristics to be disturbed;
determining the characteristic range of the characteristic to be disturbed by utilizing the minimum characteristic value and the maximum characteristic value;
and obtaining the disturbance range of the feature to be disturbed by utilizing the feature range of the feature to be disturbed.
8. The method according to claim 7, wherein the perturbing the feature value corresponding to the feature to be perturbed in the target sample within the perturbation range includes at least one of:
calculating a median by using the lower limit value and the upper limit value of the disturbance range, and replacing a characteristic value corresponding to the feature to be disturbed in the target sample with the median;
determining a numerical value in the disturbance range, and replacing the characteristic value corresponding to the feature to be disturbed in the target sample with the numerical value;
calculating an average value of at least two characteristic values of the at least two original samples, which correspond to the characteristics to be disturbed, and replacing the characteristic values in the target sample, which correspond to the characteristics to be disturbed, with the average value.
9. The method of any one of claims 1 to 8, wherein the sample is a black sample;
and/or the presence of a gas in the gas,
after the obtaining of the new sample, further comprising: inputting the new sample into the machine recognition model, judging whether the recognition result output by the machine recognition model meets the label requirement of the target sample, if so, taking the new sample as a training sample of the machine recognition model, otherwise, discarding the new sample.
10. A sample generation device, comprising:
a target sample acquisition module configured to obtain a target sample of structured data; the target sample comprises at least one characteristic value, and each characteristic value corresponds to one characteristic of the structured data;
a disturbance feature determination module configured to determine a feature to be disturbed from at least one feature of the structured data;
a disturbance range determination module configured to determine a disturbance range corresponding to the feature to be disturbed;
and the disturbance processing module is configured to disturb the characteristic value corresponding to the feature to be disturbed in the target sample in the disturbance range so as to obtain a new sample.
11. The apparatus of claim 10, wherein the perturbation feature determination module is configured to perform:
inputting the target sample into a machine recognition model needing to be trained, and learning each feature in the structured data by the machine recognition model according to the label of the target sample;
determining the importance of each feature in the structured data in learning the target sample by the machine recognition model;
and selecting the first N characteristics with the importance degrees from the characteristics of the structured data as the characteristics to be disturbed, wherein N is an integer not less than 1.
12. The apparatus of claim 10, wherein the target sample is in a sample set comprising at least two original samples, and the target sample is a sample selected from the at least two original samples.
13. The apparatus of claim 12, wherein the target sample acquisition module is configured to perform: inputting the at least two original samples into a machine recognition model needing to be trained to obtain a score output by the machine recognition model aiming at each original sample; taking an original sample with high score value as the target sample;
and/or the presence of a gas in the gas,
the disturbance range determination module is configured to perform: selecting a minimum characteristic value and a maximum characteristic value from at least two characteristic values of at least two original samples corresponding to the characteristics to be disturbed; determining the characteristic range of the characteristic to be disturbed by utilizing the minimum characteristic value and the maximum characteristic value; and obtaining the disturbance range of the feature to be disturbed by utilizing the feature range of the feature to be disturbed.
14. The apparatus of claim 13, wherein the perturbation processing module is configured to perform at least one of:
calculating a median by using the lower limit value and the upper limit value of the disturbance range, and replacing a characteristic value corresponding to the feature to be disturbed in the target sample with the median;
determining a numerical value in the disturbance range, and replacing the characteristic value corresponding to the feature to be disturbed in the target sample with the numerical value;
calculating an average value of at least two characteristic values of the at least two original samples, which correspond to the characteristics to be disturbed, and replacing the characteristic values in the target sample, which correspond to the characteristics to be disturbed, with the average value.
15. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-9.
CN202110952742.4A 2021-08-19 2021-08-19 Sample generation method and device Pending CN113780365A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110952742.4A CN113780365A (en) 2021-08-19 2021-08-19 Sample generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110952742.4A CN113780365A (en) 2021-08-19 2021-08-19 Sample generation method and device

Publications (1)

Publication Number Publication Date
CN113780365A true CN113780365A (en) 2021-12-10

Family

ID=78838322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110952742.4A Pending CN113780365A (en) 2021-08-19 2021-08-19 Sample generation method and device

Country Status (1)

Country Link
CN (1) CN113780365A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953248A (en) * 2023-03-01 2023-04-11 支付宝(杭州)信息技术有限公司 Wind control method, device, equipment and medium based on Shapril additive interpretation
CN117540791A (en) * 2024-01-03 2024-02-09 支付宝(杭州)信息技术有限公司 Method and device for countermeasure training

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190188588A1 (en) * 2017-12-14 2019-06-20 Microsoft Technology Licensing, Llc Feature contributors and influencers in machine learned predictive models
CN110033094A (en) * 2019-02-22 2019-07-19 阿里巴巴集团控股有限公司 A kind of model training method and device based on disturbance sample
CN110363243A (en) * 2019-07-12 2019-10-22 腾讯科技(深圳)有限公司 The appraisal procedure and device of disaggregated model
WO2020073492A1 (en) * 2018-10-12 2020-04-16 平安科技(深圳)有限公司 Data security processing method and apparatus, and computer device and storage medium
CN111488422A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Incremental method and device for structured data sample, electronic equipment and medium
CN111814117A (en) * 2020-07-13 2020-10-23 深圳前海微众银行股份有限公司 Model interpretation method, device and readable storage medium
CN112308238A (en) * 2020-11-11 2021-02-02 深圳前海微众银行股份有限公司 Analytical model training method and device, electronic equipment and storage medium
CN112447292A (en) * 2020-11-25 2021-03-05 南京大学 Human body index-stroke relation analysis system based on machine learning interpretability
US20210158183A1 (en) * 2019-11-25 2021-05-27 International Business Machines Corporation Trustworthiness of artificial intelligence models in presence of anomalous data
CN112990383A (en) * 2021-05-11 2021-06-18 支付宝(杭州)信息技术有限公司 Method and device for generating confrontation sample
CN113053516A (en) * 2021-03-26 2021-06-29 安徽科大讯飞医疗信息技术有限公司 Countermeasure sample generation method, device, equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190188588A1 (en) * 2017-12-14 2019-06-20 Microsoft Technology Licensing, Llc Feature contributors and influencers in machine learned predictive models
WO2020073492A1 (en) * 2018-10-12 2020-04-16 平安科技(深圳)有限公司 Data security processing method and apparatus, and computer device and storage medium
CN111488422A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Incremental method and device for structured data sample, electronic equipment and medium
CN110033094A (en) * 2019-02-22 2019-07-19 阿里巴巴集团控股有限公司 A kind of model training method and device based on disturbance sample
CN110363243A (en) * 2019-07-12 2019-10-22 腾讯科技(深圳)有限公司 The appraisal procedure and device of disaggregated model
US20210158183A1 (en) * 2019-11-25 2021-05-27 International Business Machines Corporation Trustworthiness of artificial intelligence models in presence of anomalous data
CN111814117A (en) * 2020-07-13 2020-10-23 深圳前海微众银行股份有限公司 Model interpretation method, device and readable storage medium
CN112308238A (en) * 2020-11-11 2021-02-02 深圳前海微众银行股份有限公司 Analytical model training method and device, electronic equipment and storage medium
CN112447292A (en) * 2020-11-25 2021-03-05 南京大学 Human body index-stroke relation analysis system based on machine learning interpretability
CN113053516A (en) * 2021-03-26 2021-06-29 安徽科大讯飞医疗信息技术有限公司 Countermeasure sample generation method, device, equipment and storage medium
CN112990383A (en) * 2021-05-11 2021-06-18 支付宝(杭州)信息技术有限公司 Method and device for generating confrontation sample

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
沈言玉;张三峰;曹玖新;: "一种基于对抗样本的验证码安全性增强方法", 网络空间安全, no. 08, 25 August 2020 (2020-08-25) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953248A (en) * 2023-03-01 2023-04-11 支付宝(杭州)信息技术有限公司 Wind control method, device, equipment and medium based on Shapril additive interpretation
CN117540791A (en) * 2024-01-03 2024-02-09 支付宝(杭州)信息技术有限公司 Method and device for countermeasure training
CN117540791B (en) * 2024-01-03 2024-04-05 支付宝(杭州)信息技术有限公司 Method and device for countermeasure training

Similar Documents

Publication Publication Date Title
CN107608964B (en) Live broadcast content screening method, device, equipment and storage medium based on barrage
CN109284371B (en) Anti-fraud method, electronic device, and computer-readable storage medium
JP2007128195A (en) Image processing system
CN115063875B (en) Model training method, image processing method and device and electronic equipment
CN115082920B (en) Deep learning model training method, image processing method and device
KR102370910B1 (en) Method and apparatus for few-shot image classification based on deep learning
US9418058B2 (en) Processing method for social media issue and server device supporting the same
CN113780365A (en) Sample generation method and device
US20200364216A1 (en) Method, apparatus and storage medium for updating model parameter
CN111783767B (en) Character recognition method, character recognition device, electronic equipment and storage medium
CN112101042A (en) Text emotion recognition method and device, terminal device and storage medium
CN110909768B (en) Method and device for acquiring marked data
CN112420125A (en) Molecular attribute prediction method and device, intelligent equipment and terminal
Lu et al. Crowdsourcing evaluation of saliency-based XAI methods
CN112784054A (en) Concept graph processing apparatus, concept graph processing method, and computer-readable medium
CN116776157B (en) Model learning method supporting modal increase and device thereof
CN108550019B (en) Resume screening method and device
CN115564578B (en) Fraud recognition model generation method
CN111488400A (en) Data classification method, device and computer readable storage medium
US20230177251A1 (en) Method, device, and system for analyzing unstructured document
CN117011737A (en) Video classification method and device, electronic equipment and storage medium
CN114898184A (en) Model training method, data processing method and device and electronic equipment
CN114331932A (en) Target image generation method and device, computing equipment and computer storage medium
JP6509391B1 (en) Computer system
US20240037407A1 (en) Learning apparatus, trained model generation method, classification apparatus, classification method, and computer readable recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination