WO2023035362A1

WO2023035362A1 - Polluted sample data detecting method and apparatus for model training

Info

Publication number: WO2023035362A1
Application number: PCT/CN2021/124044
Authority: WO
Inventors: 刘胜; 魏国富; 夏玉明; 周晓勇; 马影; 殷钱安; 梁淑云; 余贤喆; 陶景龙; 王启凡; 徐�明
Original assignee: 上海观安信息技术股份有限公司
Priority date: 2021-09-07
Filing date: 2021-10-15
Publication date: 2023-03-16
Also published as: CN113495886A

Abstract

Provided are a polluted sample data detecting method and apparatus for model training. The method comprises: acquiring sample attribute data corresponding to platform users to be detected (101); respectively hashing the sample attribute data into corresponding hash buckets in different hash tables by means of a preset locality-sensitive hashing algorithm (102); determining data, which is located in the same hash buckets as the sample attribute data, in the different hash tables as first sample attribute data (103); selecting second sample attribute data similar to the sample attribute data from the first sample attribute data (104); and separately determining, on the basis of the amount of sample data corresponding to the second sample attribute data, whether the sample attribute data is polluted sample data (105).

Description

Method and device for detecting polluted sample data used for model training

technical field

The invention relates to the field of information technology, in particular to a method and device for detecting polluted sample data used for model training.

Background technique

Today, with the Internet becoming more and more developed, people are shopping online more and more, so e-commerce platforms often launch various promotions to attract visitors. These promotions not only attract normal users, but also attract various criminals Note that identifying the abnormal behavior of criminals is of great significance to the security of network platforms. With the development of artificial intelligence, a large number of user sample data can be used to train models for abnormal behavior detection. Since user sample data is artificial intelligence If the attacker pollutes the sample data to make the artificial intelligence algorithm learn wrong data features, it will change the classification boundary of the model, which will seriously affect the execution effect of the abnormal behavior detection model. Therefore, before the model training It is necessary to perform pollution detection on the user's sample data.

At present, in the process of performing pollution detection on the training set of the abnormal behavior detection model, it is usually checked whether there is contaminated sample data in the sample data by checking the sample data. However, the methods of polluting sample data include copying, subtle transformation, synthesis and other means. This simple method of duplicate checking cannot detect the transformed and synthesized polluted sample data, which will lead to low detection accuracy of polluted sample data. , the security of the sample data cannot be guaranteed, which in turn affects the detection accuracy of the abnormal behavior detection model.

Contents of the invention

The invention provides a method and device for detecting polluted sample data for model training, which mainly aims to improve the detection accuracy of polluted sample data, ensure the safety of sample data, and thereby improve the detection accuracy of abnormal behavior detection models.

According to a first aspect of the present invention, a method for detecting contaminated sample data for model training is provided, including:

Acquiring sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;

Using the preset local sensitive hash algorithm, each sample attribute data is hashed into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets;

determining the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data;

Screening out second sample attribute data similar to the respective sample attribute data from the first sample attribute data;

Based on the amount of sample data corresponding to the second sample attribute data, it is determined whether each of the sample attribute data is contaminated sample data.

According to a second aspect of the present invention, there is provided a detection device for contaminated sample data for model training, comprising:

An acquisition unit, configured to acquire sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;

The hash unit is used to hash each sample attribute data into corresponding hash buckets in different hash tables by using a preset local sensitive hash algorithm, wherein any hash table includes multiple hashes bucket;

A determining unit, configured to determine the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data;

a screening unit, configured to screen out second sample attribute data similar to the respective sample attribute data from the first sample attribute data;

A judging unit, configured to respectively judge whether each sample attribute data is polluted sample data based on the amount of sample data corresponding to the second sample attribute data.

According to a third aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented:

According to a fourth aspect of the present invention, a computer device is provided, including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the following steps when executing the program:

The invention provides a method and device for detecting contaminated sample data for model training. Compared with the current method of detecting whether there is contaminated sample data in the sample data through sample duplication, the present invention can obtain the samples to be detected. The sample attribute data corresponding to each platform user, the sample attribute data at least includes the device attribute data, risk control data and business data corresponding to each platform user; and using the preset local sensitive hash algorithm, each sample attribute The data is hashed into corresponding hash buckets in different hash tables, wherein any one hash table includes multiple hash buckets; The data in the same hash bucket is determined as the first sample attribute data; and the second sample attribute data similar to the respective sample attribute data is filtered out from the first sample attribute data; finally based on the According to the amount of sample data corresponding to the second sample attribute data, respectively determine whether each of the sample attribute data is polluted sample data, thereby improving the detection accuracy of polluted sample data, ensuring the security of sample attribute data, and improving the abnormal behavior detection model. detection accuracy.

Description of drawings

The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:

Fig. 1 shows a flow chart of a method for detecting contaminated sample data used for model training provided by an embodiment of the present invention;

FIG. 2 shows a flow chart of another method for detecting contaminated sample data used for model training provided by an embodiment of the present invention;

FIG. 3 shows a schematic diagram of a hash table provided by an embodiment of the present invention;

Fig. 4 shows a schematic structural diagram of a detection device for contaminated sample data used for model training provided by an embodiment of the present invention;

FIG. 5 shows a schematic structural diagram of another detection device for contaminated sample data used for model training provided by an embodiment of the present invention;

FIG. 6 shows a schematic diagram of a physical structure of a computer device provided by an embodiment of the present invention.

Detailed ways

Hereinafter, the present invention will be described in detail with reference to the drawings and examples. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.

At present, in the process of performing contamination detection on sample data, it is usually checked whether there is contaminated sample data in the sample data by performing a duplicate check on the sample data. However, the methods of polluting sample data include copying, transforming, synthesizing and other methods. This simple method of duplication checking cannot detect the transformed and synthesized polluted sample data, which will lead to low detection accuracy of polluted sample data and cannot Ensure the safety of sample data, which in turn affects the detection accuracy of the abnormal behavior detection model.

In order to solve the above problems, an embodiment of the present invention provides a method for detecting contaminated sample data for model training, as shown in FIG. 1 , the method includes:

101. Obtain sample attribute data corresponding to each platform user to be detected.

Wherein, the sample attribute data includes at least the device attribute data, risk control data, and business data corresponding to the users of each platform. When the user operates the e-commerce APP, the business personnel will bury the important checkpoints in the operation process, After the user triggers a preset point, a corresponding device attribute information will be generated. The device attribute information specifically includes: device ID, device model, APP name, APP version number, etc., and the risk control data includes all request information and personal information of the user. , the personal information includes the user name, mobile phone number, etc., and the business data includes information such as all orders, refunds, and order details of the user. In order to overcome the defects in the prior art that the detection accuracy of polluted sample data is low, which in turn affects the detection accuracy of user abnormal behaviors, the embodiment of the present invention hashes the sample attribute data of users on each platform to the corresponding hash buckets in different hash tables and filter out the second sample attribute data similar to each sample attribute data from the first sample attribute data in the same hash bucket as each sample attribute data, based on the sample data corresponding to the second sample attribute data Quantity, to determine whether each sample attribute data is polluted sample data, compared with the simple duplicate checking method in the prior art, the embodiment of the present invention has a higher detection accuracy for polluted sample data, thereby improving the detection accuracy of the abnormal behavior detection model . The embodiment of the present invention is mainly applied to the scenario of performing pollution detection on sample attribute data. The executor of the embodiment of the present invention is a device or device capable of performing contamination detection on sample attribute data, which can be specifically set on the client side or the server side.

For the embodiment of the present invention, the device attribute data, risk control data and business data of a large number of platform users are collected in advance, and the above-mentioned collected data is used as the user's sample attribute data. Since the sample attribute data may have been maliciously polluted, in order to ensure The detection accuracy of the follow-up abnormal behavior detection model needs to detect the polluted data in the sample attribute data. For example, collect the sample attribute data of 200 platform users. Whenever the platform user operates on the e-commerce platform, the platform user will be collected If each user collects 5 pieces of sample attribute data, a total of 1,000 pieces of sample attribute data will be collected, and these 1,000 pieces of data will be used as the training data for the abnormal behavior detection model. Before formal training, in order to ensure the detection of subsequent models Accuracy, it is necessary to first detect whether there is any contaminated data in the 1000 sample attribute data, so as to prevent the collected sample attribute data set from being polluted and affect the training effect of the abnormal behavior detection model in the later stage.

102. Using a preset locality-sensitive hashing algorithm, respectively hash each sample attribute data into corresponding hash buckets in different hash tables.

Wherein, any hash table includes multiple hash buckets, and the specific number of hash buckets can be set according to actual business requirements. For the embodiment of the present invention, in the process of performing pollution detection on sample attribute data, it is necessary to count similar sample data through sample matching. Since there are a large number of sample attribute data in the training set, if the calculation between any two sample attribute data In order to reduce the amount of calculation in the sample attribute data matching process, the embodiment of the present invention adopts a preset local sensitive hash algorithm to hash each sample attribute data to the corresponding corresponding hash table in different hash tables. In the hash bucket, since the sample attribute data in the same hash bucket has a high probability of being similar, it can be determined from the first sample attribute data in the same hash bucket with the sample attribute data that it is similar to the sample attribute data The second sample attribute data, which can greatly narrow the scope of data matching from massive data, and reduce the calculation amount of sample attribute data.

Specifically, in the process of hashing each sample attribute data into corresponding hash buckets in different hash tables by using the preset locality-sensitive hash algorithm, it is necessary to find a hash function that satisfies the following conditions,

If d(x,y)≤R, then the probability of h(x)=h(y) is not less than P ₁ ;

If d(x,y)≥cR, then the probability of h(x)=h(y) is not less than P ₂ ;

Among them, d(x, y) is the distance between any two sample attribute data x and y in high-dimensional space, h is the hash function, h(x) and h(y) are the pair of sample attribute data x and y Hash transformation, c is a constant, and it needs to satisfy c>1 and P ₁ >P ₂ , wherein, according to the accuracy rate of similar samples being found, the probability values P ₁ and P ₂ are preset, and the required similar samples are The higher the accuracy rate found, the larger the probability value P ₁ and the smaller P ₂ , on the contrary, the lower the accuracy rate of the required similar samples being found, the smaller the probability value P ₁ and the larger P ₂ , but at the same time It is guaranteed that P ₁ >P ₂ .

In the embodiment of the present invention, it is necessary to determine the number of hash tables and the number of hash functions corresponding to each hash table according to the accuracy of searching similar sample attribute data. It should be noted that the number of divided hash tables The more, and the more hash functions corresponding to each hash table, the higher the accuracy of finding similar sample attribute data, but due to the increase of hash tables and hash functions, the sample attribute data will also be inaccurate. The calculation amount increases, so the accuracy rate of similar sample attribute data search and the calculation amount of sample attribute data are comprehensively considered, and then the number of hash tables in the embodiment of the present invention and the number of hash functions corresponding to each hash table are determined. For example, if the number of hash tables is set to 3, each hash table corresponds to two hash functions, hash table 1 corresponds to hash functions h ₁ and h ₂ , and hash table 2 corresponds to hash functions h ₃ and h _4. Hash table 3 corresponds to hash functions h ₅ and h ₆ .

Specifically, after determining the number of hash tables and the hash function corresponding to each hash table, use the hash function corresponding to each hash table to calculate the hash value of each sample attribute data in different hash tables , and then according to the hash value of each sample attribute data in each hash table, each sample attribute data is hashed into the corresponding hash bucket in each hash table, wherein each hash table corresponds to Multiple hash buckets, each corresponding to a different hash value.

For example, each hash table in hash table 1, hash table 2, and hash table 3 includes 4 hash buckets, and the hash value corresponding to the first hash bucket in these three hash tables is 00 , the hash value corresponding to the second hash bucket is 01, the corresponding hash value of the third hash bucket is 10, and the corresponding hash value of the fourth hash bucket is 11. If the hash table 1 is used to correspond to The hash functions h ₁ and h ₂ of the sample attribute data A are used to calculate the hash value corresponding to 00. Since the hash value corresponding to the sample attribute data A is the same as the hash value corresponding to the first hash bucket, the sample The attribute data A is hashed into the first hash bucket in the hash table 1; if the hash function h ₃ and h ₄ corresponding to the hash table 2 are used to calculate the hash value corresponding to the sample attribute data A is 01, Since the hash value corresponding to the sample attribute data A is the same as the hash value corresponding to the second hash bucket, the sample attribute data A is hashed into the second hash bucket in the hash table 2; Hash functions h ₅ and h ₆ corresponding to table 3, calculate the hash value corresponding to sample attribute data A to be 10, because the hash value corresponding to sample attribute data A is the same as the hash value corresponding to the third hash bucket , so the sample attribute data A is hashed into the third hash bucket in the hash table 3. Therefore, according to the above method, each sample attribute data can be hashed into corresponding hash buckets in different hash tables by using the preset local sensitive hash algorithm.

103. Determine, in the different hash tables, the data in the same hash bucket as the respective sample attribute data as the first sample attribute data.

For the embodiment of the present invention, step 102 has already hashed each sample attribute data into corresponding hash buckets in different hash tables. Since the sample attribute data in the same hash bucket has a high probability of being similar, it can Respectively determine the first sample attribute data in the same hash bucket as each sample attribute data in different hash tables, and then filter the samples similar to each sample attribute data from the first sample attribute data corresponding to each sample attribute data Second sample attribute data.

For example, the sample attribute data A is in the first hash bucket of hash table 1, the third hash bucket of hash table 2, and the fourth hash bucket of hash table 3 respectively. Data A can obtain all sample attribute data in the first hash bucket in hash table 1, all sample attribute data in the third hash bucket in hash table 2, and the fourth sample attribute data in hash table 3 For all the sample attribute data in the hash bucket, use the obtained above sample attribute data as the first sample attribute data corresponding to sample attribute data A, so as to filter the second sample attribute data similar to sample attribute data A from the first sample attribute data. Sample attribute data, for another example, sample attribute data B is in the third hash bucket of hash table 1, the fourth hash bucket of hash table 2, and the first hash bucket of hash table 3 Therefore, for sample attribute data B, all sample attribute data in the third hash bucket in hash table 1 and all sample attribute data in the fourth hash bucket in hash table 2 can be obtained respectively, ha For all the sample attribute data in the first hash bucket in Table 3, use the obtained above sample attribute data as the first sample attribute data corresponding to sample attribute data B, so as to filter and sample from the first sample attribute data The second sample attribute data similar to the attribute data B can obtain the first sample attribute data in the same hash bucket as each sample attribute data according to the above method, thereby greatly reducing the matching range of the sample attribute data.

104. Filter out second sample attribute data similar to the respective sample attribute data from the first sample attribute data.

For the embodiment of the present invention, after determining the matching range (first sample attribute data) corresponding to each sample attribute data, since the preset local sensitive hash algorithm can only ensure that the sample attribute data in the same hash bucket have a large The high probability is similar, but it cannot guarantee that the sample attribute data in the same hash bucket must be similar. Therefore, in order to further improve the detection accuracy of the contaminated sample data, you can filter the first sample attribute data corresponding to each sample attribute data and The second sample attribute data that each sample attribute data is truly similar to.

Specifically, the sample distance between each sample attribute data and its corresponding first sample attribute data can be calculated respectively, and the second sample attribute similar to the sample attribute data is screened from the first sample attribute data according to the sample distance Data, the specific calculation method of the sample distance can use Euclidean distance, cosine distance, Hamming distance and other calculation methods, which are not specifically limited in the embodiment of the present invention, because the larger the sample distance, the smaller the similarity between sample attribute data, On the contrary, the smaller the sample distance, the greater the similarity between the sample attribute data, so the first sample attribute data whose sample distance is smaller than the preset sample distance can be determined as the second sample attribute data similar to each sample attribute data.

For example, sample attribute data in the same hash bucket as sample attribute data A in different hash tables includes sample attribute data B, sample attribute data C, and sample attribute data D, that is, the first sample attribute data corresponding to sample attribute data A Attribute data includes sample attribute data B, sample attribute data C and sample attribute data D, respectively calculate the sample distance between sample attribute data A and sample attribute data B, sample attribute data C and sample attribute data D, and determine the sample attributes by comparison The sample distance between data A and sample attribute data B is less than the preset distance, and the sample distance between sample attribute data A and sample attribute data C is less than the preset distance, so it can be determined that the sample attribute data B and sample attribute data C are different from the sample The attribute data A is similar, that is, the second sample attribute data similar to the sample attribute data A are sample attribute data B and sample attribute data C. For another example, the sample attribute data in the same hash bucket as sample attribute data B in different hash tables includes sample attribute data A, sample attribute data E, and sample attribute data F, that is, the first item corresponding to sample attribute data B This attribute data includes sample attribute data A, sample attribute data E, and sample attribute data F. Since the sorting position corresponding to sample attribute data A is before sample attribute data B, that is, during the distance calculation process for sample attribute data A, Through the sample distance between sample attribute data A and sample attribute data B, and determine that sample attribute data A is similar to sample attribute data B, therefore, for sample attribute data B, only sample attribute data B and sample attribute data E and The sample distance between the sample attribute data F is determined by comparing the sample distance between the sample attribute data B and the sample attribute data E is less than the preset distance, so it can be determined that the sample attribute data A and the sample attribute data E are similar to the sample attribute data B , that is, the second sample attribute data similar to sample attribute data B are sample attribute data A and sample attribute data E. Thus, in the manner described above, the second sample attribute data similar to the respective sample attribute data can be determined respectively.

105. Based on the amount of sample data corresponding to the second sample attribute data, respectively determine whether each of the sample attribute data is contaminated sample data.

For the embodiment of the present invention, after determining the second sample attribute data similar to each sample attribute data, it is necessary to count the amount of sample attribute data corresponding to the second sample attribute data, and combine the actual collection of sample attribute data to determine the sample attribute data. Whether the attribute data volume is normal. If the sample attribute data volume does not exceed the normal range, it means that the sample attribute data is not polluted; if the sample attribute data volume exceeds the normal range, it means that the sample attribute data has been polluted. These data were excluded from the training set.

For example, the training set is 1000 sample attribute data collected from 200 platform users, and each platform user collects 5 sample attribute data. Under normal circumstances, the 5 sample attribute data of the same platform user should be similar. If The amount of sample data corresponding to the second sample attribute data similar to sample attribute data A is 50, far exceeding the normal range of 5. It can be seen that the second sample attribute data similar to sample attribute data A has been polluted, and the attacker is likely to Create data similar to sample attribute data A by copying, subtle transformation, etc., and add it to the training set. In order to avoid contamination of the training data and ensure the detection accuracy of the abnormal behavior detection model, it is necessary to combine the sample attribute data A These second sample attribute data similar to A are excluded from the training set. Therefore, according to the above method, by counting the data amount of the second sample attribute data corresponding to each sample attribute data, it can be determined whether the sample attribute data and the corresponding second sample attribute data have been polluted.

The embodiment of the present invention provides a method for detecting contaminated sample data used for model training. Compared with the current method of detecting whether there is contaminated sample data in the sample data through sample duplication, the present invention can obtain the sample data to be detected. The sample attribute data corresponding to each platform user, the sample attribute data at least includes the device attribute data, risk control data and business data corresponding to each platform user; and using the preset local sensitive hash algorithm, each sample attribute The data is hashed into corresponding hash buckets in different hash tables, wherein any one hash table includes multiple hash buckets; The data in the same hash bucket is determined as the first sample attribute data; and the second sample attribute data similar to the respective sample attribute data is filtered out from the first sample attribute data; finally based on the According to the amount of sample data corresponding to the second sample attribute data, respectively determine whether each of the sample attribute data is polluted sample data, thereby improving the detection accuracy of polluted sample data, ensuring the security of sample attribute data, and improving the abnormal behavior detection model. detection accuracy.

Further, in order to better illustrate the detection process of the above-mentioned contaminated sample data, as a refinement and extension of the above-mentioned embodiment, the embodiment of the present invention provides another detection method of contaminated sample data for model training, as shown in the figure 2, the method includes:

201. Acquire sample attribute data corresponding to each platform user to be detected.

For the embodiment of the present invention, in order to ensure the safety of the training set, it is necessary to obtain the attribute data of each sample in the training set for pollution detection to determine whether it is contaminated data. The specific process for obtaining the sample attribute data is exactly the same as step 101, here No longer.

202. Using a preset locality-sensitive hashing algorithm, respectively hash each sample attribute data into corresponding hash buckets in different hash tables.

Wherein, any hash table includes multiple hash buckets, and the specific number of hash buckets can be set according to actual business requirements. For the embodiment of the present invention, in order to hash each sample attribute data into corresponding hash buckets in different hash tables, step 202 specifically includes: using a preset local sensitive hash algorithm to calculate the hash values in the different hash tables; based on the hash values, hash the respective sample attribute data into corresponding hash buckets in the different hash tables. Further, using the preset locality-sensitive hash algorithm to calculate the hash values of the respective sample attribute data in the different hash tables respectively includes: determining the data dimensions and coordinates corresponding to the respective sample attribute data value; based on the data dimension and the coordinate value, determine the Hamming code corresponding to each sample attribute data; use the hash function corresponding to the different hash tables to extract the corresponding position in the Hamming code Encoding, and determining the extracted encoding as the hash value of each sample attribute data in the different hash tables.

Specifically, the preset local sensitive hashing algorithm adopted in the embodiment of the present invention is mainly a local sensitive hashing algorithm under Hamming distance. Firstly, the data dimensions corresponding to each sample attribute data and the coordinate values at different positions are determined. According to the coordinate values at different positions corresponding to each sample attribute data, determine the maximum coordinate value corresponding to each sample attribute data, and multiply the maximum coordinate value and the data dimension to obtain the Hamming coded digits corresponding to each sample attribute data , perform Hamming encoding on each sample attribute data based on the number of Hamming encoding bits. Further, using hash functions corresponding to different hash tables, the codes at corresponding positions in the Hamming codes corresponding to each sample attribute data are extracted, and the extracted codes are determined as the hash values corresponding to each sample attribute data.

As shown in Figure 3, the training set includes 6 sample attribute data, specifically sample attribute data A=(1,1), sample attribute data B=(2,1), sample attribute data C=(1,2), Sample attribute data D=(2,2), sample attribute data E(4,2), sample attribute data F(4,3), according to the coordinate values of the above sample attribute data at different positions, the maximum coordinate value can be determined to be 4 , the data dimension corresponding to the sample attribute data is 2, so it can be determined that the number of Hamming coding digits is 4*2=8, and then perform 8-bit Hamming coding on each sample attribute data respectively. The specific formula of Hamming coding is as follows:

v(p)=Unaryc(x ₁ )Unaryc(x ₂ )…Unaryc(x _n )

Among them, v(p) represents the Hamming code corresponding to each sample attribute data, x ₁ , x ₂ ... x _n is the coordinate value corresponding to each sample attribute data, n is the data dimension corresponding to the sample attribute data, Unaryc(x) is A string of binary Hamming codes with a length of C. C is the maximum coordinate value of the sample attribute data. After determining the Hamming code corresponding to each coordinate value in the sample attribute data, they are spliced to obtain the Hanming code corresponding to the sample attribute data. Ming code v(p), Unaryc(x) means that in the Hamming code of length C, the code before the x bit is 1, and the code after the x bit is 0.

In the above example, v(A)=Unaryc(x ₁ )Unaryc(x ₂ ), where x ₁ =1, x ₂ =1, and C is the maximum coordinate value 4 among all sample attribute data, so it can be determined The 8-bit Hamming code v(A)=10001000 corresponding to the sample attribute data A can be determined similarly. Code v(C)=10001100, the 8-bit Hamming code corresponding to the sample attribute data D is v(D)=11001100, the 8-bit Hamming code v(E)=11111100 corresponding to the sample attribute data E, and the sample attribute data F corresponds to The 8-bit Hamming code v(F)=11111110, further, according to the accuracy rate of similar sample attribute data search, set 3 hash tables, and each hash table has two hash functions, specifically, The first hash table is composed of hash functions h ₁ and h ₂ , h ₁ and h ₂ are Hamming codes for extracting the 2nd and 4th bits respectively; the second hash table is composed of hash functions h ₃ and h ₄ , h ₃ and h ₄ are Hamming codes for extracting the first and sixth digits respectively; the third hash table is composed of hash functions h ₅ and h ₆ , and h ₅ and h ₆ are for extracting the first The 3-bit and 8th-bit Hamming codes, further, use the above hash function to extract the codes at the corresponding positions in the Hamming codes corresponding to each sample attribute data, and obtain the hash values of the sample attribute data in different hash tables , that is, the sample attribute data A has a hash value of 00 in the first hash table, a hash value of 10 in the second hash table, and a hash value of 00 in the third hash table .

Furthermore, since there are only 4 possibilities for the extracted Hamming codes, which are 00, 01, 10, and 11, it can be set that each hash table includes 4 hash buckets, and the hash corresponding to each hash bucket The hash values are 00, 01, 10, and 11 respectively. According to the hash values of the sample attribute data A in different hash tables, it can be determined that the sample attribute data A is hashed to the first hash bucket of the first hash table In the third hash bucket of the second hash table, and in the first hash bucket of the third hash table, similarly, other sample attribute data can be hashed to the corresponding hashes of different hash tables. Greek barrel, as shown in Figure 3.

203. Determine, in the different hash tables, the data in the same hash bucket as the respective sample attribute data as the first sample attribute data.

For the embodiment of the present invention, the method of extracting the first sample attribute data in the same hash bucket as the sample attribute data in different hash tables is completely the same as that of step 103, and will not be repeated here.

204. Filter out second sample attribute data similar to the respective sample attribute data from the first sample attribute data.

For the embodiment of the present invention, in order to determine the second sample data similar to each sample data, step 204 specifically includes: respectively calculating the sample distance between each sample attribute data and its corresponding first sample attribute data; The first sample attribute data whose sample distance is smaller than the preset distance is determined as the second sample attribute data similar to the respective sample attribute data. Wherein, the sample distance may specifically be a Hamming distance, and for the process of calculating the Hamming distance between the sample attribute data and the corresponding first sample attribute data, the method includes: separately dividing each sample attribute data Comparing the corresponding Hamming code with the Hamming code corresponding to the first sample attribute data, and determining that each sample attribute data and the first sample attribute data have different coded digits; It is determined as the Hamming distance between each sample attribute data and the corresponding first sample attribute data. Wherein, the preset distance may be set according to actual service requirements.

For example, the Hamming code corresponding to sample attribute data A is 110011, the first sample attribute data corresponding to sample attribute data A includes sample attribute data B and sample attribute data C, and the Hamming code corresponding to sample attribute data B is 111011, sample The Hamming code corresponding to attribute data C is 000001, and the preset Hamming distance is 2. When determining the second sample attribute data similar to sample attribute data A, calculate the Hamming distance between sample attribute data A and sample attribute data B respectively. Compare each bit value in the Hamming code corresponding to the sample attribute data A with the bit value at the corresponding position in the Hamming code corresponding to the sample attribute data B, and determine the number of bits with different bit values. By comparison, It is found that the sample attribute data A and the sample attribute data B have a bit value difference, so it can be determined that the Hamming distance between the sample attribute data A and the sample attribute data B is 1, further, because the sample attribute data A and the sample attribute data The Hamming distance between B is less than the preset Hamming distance 2. Therefore, it can be determined that the sample attribute data A is similar to the sample attribute data B. Similarly, the Hamming distance between the sample attribute data A and the sample attribute data C can be determined as 3. Since the Hamming distance between the sample attribute data A and the sample attribute data C is greater than the preset Hamming distance 2, it can be determined that the sample attribute data C is not similar to the sample attribute data A, that is, the first sample attribute data and The second sample attribute data similar to sample attribute data A is sample attribute data B. In this way, the second sample attribute data similar to the respective sample attribute data can be respectively determined in the manner described above.

Further, in the embodiment of the present invention, the positions of each sample attribute data can be sorted in advance, for example, the sorting positions corresponding to the sample attribute data are sample attribute data A, sample attribute data B, sample attribute data C and sample attribute data D, and then According to the sorting position corresponding to each sample attribute data, the first sample attribute data corresponding to each sample attribute data can be sequentially determined, and the sample distance can be calculated. For example, the first sample attribute data corresponding to sample attribute data A can be determined first, and then the sample attribute data can be calculated. The sample distance between attribute data A and its corresponding first sample attribute data, and then determine the first sample attribute data corresponding to sample attribute data B, and calculate the sample attribute data B and its corresponding first sample attribute The sample distance between data.

In a specific application scenario, in order to further reduce the calculation amount of the sample distance, before calculating the sample distance between the sample attribute data and the corresponding first sample attribute data, it is necessary to determine whether the sorting position corresponding to the first sample attribute data is Before the sorting position of the corresponding sample attribute data, if the sorting position corresponding to the first sample attribute data is before the sorting position of the corresponding sample attribute data, it means that the first sample attribute data and sample attribute data have been calculated There is no need to repeat the calculation; if the sorting position corresponding to the first sample attribute data is after the sorting position of the corresponding sample attribute data, it is necessary to calculate the sample attribute data and the corresponding first sample attribute The sample distance between data.

For example, the training set includes sample attribute data A, sample attribute data B, sample attribute data C, and sample attribute data D, and the first sample attribute data corresponding to sample attribute data A includes sample attribute data B and sample attribute data C. The first sample attribute data corresponding to attribute data B includes sample attribute data A and sample attribute data D. In the previous sample distance calculation process, if the second sample attribute data similar to sample attribute data A has been determined to be sample attribute data B, then in the process of determining the second sample attribute data similar to the sample attribute data B, it is not necessary to repeatedly calculate the sample distance between the sample attribute data B and its corresponding first sample attribute data A, and can directly Call the previous calculation results to determine that the sample attribute data B is similar to the first sample attribute data A, and then only need to calculate the sample distance between the sample attribute data B and its corresponding first sample attribute data D, and then determine the two Is it similar. In this way, the calculation amount of the sample distance can be further reduced, and the detection efficiency of the attribute data of the contaminated sample can be improved.

205. Count the amount of sample data corresponding to the users of each platform, and filter out the largest amount of sample data from each amount of sample data.

For the embodiment of the present invention, it can be determined whether each sample attribute data and its corresponding second sample attribute data have been polluted through the amount of sample data corresponding to the second sample attribute data similar to each sample attribute data. The actual collection of sample attribute data needs to be combined in the process of judgment. For example, the collected sample attribute data set includes 1000 sample attribute data of 200 individuals, and 5 sample attribute data of each person are collected. It can be seen that, in this training The collection includes a total of 200 platform users, and each platform user corresponds to 5 pieces of sample data, so it can be determined that the maximum sample data size is 5. For another example, the collected training set includes 200 sample attribute data of 50 individuals, of which 48 individuals collected 4 sample attribute data, one person only collected 1 sample data, and another person collected 7 sample attribute data , so it can be determined that the maximum sample data size is 7. Therefore, according to the above method, the maximum sample data volume can be determined in combination with the actual situation, so as to determine whether the sample attribute data is polluted data according to the maximum sample data volume.

206. Determine whether each sample attribute data is contaminated sample data according to the maximum sample data amount and the sample data amount corresponding to the second sample attribute data.

Among them, the polluted sample data is the sample attribute data polluted by the attacker through copying, subtle changes, compounding and other means. For the embodiment of the present invention, in order to determine whether each sample attribute data is pollution data, step 206 specifically includes: subtracting the sample data volume corresponding to the second sample attribute data from the maximum sample data volume to obtain the The sample size difference corresponding to the attribute data; if the sample size difference corresponding to the target sample attribute data in each sample attribute data is greater than the preset sample size difference, then determine the target sample attribute data and its corresponding second sample attribute data is the polluted sample data. Wherein, the target sample data is any sample data in the training set, and the difference in the number of preset samples can be set according to actual business requirements.

For example, if the maximum sample data volume is determined to be 100, the preset sample size difference is 50, the sample size corresponding to the second sample attribute data similar to sample attribute data A is 200, and the sample size corresponding to the second sample attribute data similar to sample attribute data B is The amount of sample data is 110, so it can be determined that the sample size difference corresponding to the sample attribute data A is 200-100=100. Since the sample size difference is greater than the preset sample size difference of 50, it can be determined that the sample attribute data A and its corresponding The attribute data of the second sample has already been polluted, that is, the attacker may maliciously create contaminated sample attribute data similar to sample attribute data A by copying, transforming, etc., and add it to the training set, resulting in the second The number of samples of sample attribute data of 200 is far more than the amount of normal sample data of 100. Therefore, it is necessary to exclude all sample attribute data A and its corresponding second sample attribute data from the training set in order to ensure the safety of the training set. Similarly, It is determined that the sample size difference corresponding to the sample attribute data B is 110-100=10. Since the sample data volume is smaller than the preset sample data volume, it can be determined that the sample data volume of the second sample attribute data similar to the sample attribute data B is normal. The scope, that is, the sample attribute data B and its corresponding second sample attribute data are not polluted. In this way, it is possible to sequentially determine whether each sample attribute data in the training set of the abnormal behavior detection model is polluted data in the above manner.

The embodiment of the present invention provides another method for detecting contaminated sample data used for model training. Compared with the current method of detecting whether there is contaminated sample data in the sample data through sample duplication, this method can obtain the The detected sample attribute data corresponding to each platform user, the sample attribute data at least includes the device attribute data, risk control data and business data corresponding to each platform user; and using the preset local sensitive hash algorithm, each sample The attribute data is hashed into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets; at the same time, the different hash tables are combined with the samples The data whose attribute data is located in the same hash bucket is determined as the first sample attribute data; and the second sample attribute data similar to the respective sample attribute data is filtered out from the first sample attribute data; finally based on The amount of sample data corresponding to the second sample attribute data determines whether each sample attribute data is polluted sample data, thereby improving the detection accuracy of polluted sample data, ensuring the security of sample attribute data, and improving abnormal behavior detection. The detection accuracy of the model.

Further, as a specific implementation of FIG. 1 , an embodiment of the present invention provides a detection device for contaminated sample data used for model training. As shown in FIG. 4 , the device includes: an acquisition unit 31, a hash unit 32, A determination unit 33 , a screening unit 34 and a determination unit 35 .

The acquiring unit 31 may be configured to acquire sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data and service data corresponding to each platform user.

The hash unit 32 can be used to hash each sample attribute data into corresponding hash buckets in different hash tables by using a preset local sensitive hash algorithm, wherein any hash table includes Multiple hash buckets.

The determining unit 33 may be configured to determine the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data.

The screening unit 34 may be configured to screen out second sample attribute data similar to the respective sample attribute data from the first sample attribute data.

The determining unit 35 may be configured to determine whether each sample attribute data is contaminated sample data based on the amount of sample data corresponding to the second sample attribute data.

In a specific application scenario, in order to hash the respective sample attribute data into corresponding hash buckets in different hash tables, as shown in FIG. 5 , the hash unit 32 includes: a first computing module 321 and hash module 322 .

The first calculation module 321 may be configured to use a preset locality-sensitive hash algorithm to separately calculate the hash values of the respective sample attribute data in the different hash tables.

The hash module 322 may be configured to hash the respective sample attribute data into corresponding hash buckets in the different hash tables based on the hash value.

Further, in order to calculate the hash values of the respective sample attribute data in the different hash tables, the first calculation module 321 includes: a determination submodule and an extraction submodule.

The determining submodule can be used to determine the data dimension and coordinate value corresponding to each sample attribute data.

The determining submodule may also determine the Hamming code corresponding to each sample attribute data based on the data dimension and the coordinate value.

The extracting submodule can be used to extract the code at the corresponding position in the Hamming code by using the hash function corresponding to the different hash tables, and determine the extracted code as the attribute data of each sample in the Hash values in the different hash tables described above.

Further, in order to filter out second sample attribute data similar to the respective sample attribute data from the first sample attribute data, the screening unit 34 includes: a second calculation module 341 and a determination module 342 .

The second calculation module 341 may be configured to respectively calculate sample distances between the respective sample attribute data and the corresponding first sample attribute data.

The determining module 342 may be configured to determine the first sample attribute data whose sample distance is less than a preset distance as the second sample attribute data similar to the respective sample attribute data.

In a specific application scenario, the sample distance is a Hamming distance, and the second calculation module 341 includes: a comparison submodule and a determination submodule.

The comparison sub-module can be used to respectively compare the Hamming code corresponding to each sample attribute data with the Hamming code corresponding to the first sample attribute data, and determine that each sample attribute data and the first sample attribute data A sample attribute data has different encoding bits.

The determining submodule may be configured to determine the number of digits as a Hamming distance between each sample attribute data and its corresponding first sample attribute data.

In a specific application scenario, in order to determine whether each sample attribute data is contaminated sample data, the determination unit 35 includes: a statistics module 351 and a determination module 352 .

The statistical module 351 can be used to count the amount of sample data corresponding to each platform user, and filter out the largest amount of sample data from each amount of sample data.

The determining module 352 may be configured to determine whether each sample attribute data is contaminated sample data according to the maximum sample data amount and the sample data amount corresponding to the second sample attribute data.

Further, the determination module 352 includes: a subtraction submodule and a determination submodule.

The subtraction sub-module may be configured to subtract the sample data amount corresponding to the second sample attribute data from the maximum sample data amount to obtain the sample amount difference corresponding to each sample attribute data.

The determining submodule may be configured to determine whether the target sample attribute data and its corresponding second sample attribute are different if the sample size difference corresponding to the target sample attribute data in each sample attribute data is greater than the preset sample size difference. The data are polluted sample data.

It should be noted that, for other corresponding descriptions of the functional modules involved in a detection device for contaminated sample data used for model training provided by the embodiment of the present invention, you can refer to the corresponding description of the method shown in FIG. 1 , which will not be repeated here. .

Based on the method shown in Figure 1 above, correspondingly, an embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented: obtaining each Sample attribute data corresponding to platform users, said sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user; each sample attribute data is hashed using a preset local sensitive hash algorithm into the corresponding hash buckets in different hash tables, wherein any one hash table includes multiple hash buckets; the different hash tables and the sample attribute data are located in the same hash bucket The data in is determined as the first sample attribute data; the second sample attribute data similar to the respective sample attribute data is screened out from the first sample attribute data; based on the sample corresponding to the second sample attribute data The amount of data is used to determine whether each sample attribute data is polluted sample data.

Based on the above-mentioned embodiment of the method shown in FIG. 1 and the device shown in FIG. 4, the embodiment of the present invention also provides a physical structure diagram of a computer device. As shown in FIG. 6, the computer device includes: a processor 41, Memory 42, and the computer program that is stored on the memory 42 and can run on the processor, wherein the memory 42 and the processor 41 are all set on the bus 43 and realize the following steps when the processor 41 executes the program: acquire the The sample attribute data corresponding to each platform user, the sample attribute data at least includes the device attribute data, risk control data and business data corresponding to each platform user; using the preset local sensitive hash algorithm, each sample attribute data Hash into the corresponding hash buckets in different hash tables, wherein any one hash table includes multiple hash buckets; the different hash tables and the sample attribute data are located in the same hash Determining the data in the bucket as the first sample attribute data; screening out the second sample attribute data similar to the respective sample attribute data from the first sample attribute data; corresponding to the second sample attribute data based on the second sample attribute data The amount of sample data is determined respectively to determine whether each sample attribute data is polluted sample data.

Through the technical solution of the present invention, the sample attribute data corresponding to each platform user to be detected is obtained, and the sample attribute data includes at least the equipment attribute data, risk control data and business data corresponding to each platform user; Sensitive hashing algorithm, which hashes each sample attribute data into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets; at the same time, the different The data located in the same hash bucket as the respective sample attribute data in the hash table is determined as the first sample attribute data; The second sample attribute data; finally, based on the amount of sample data corresponding to the second sample attribute data, determine whether each sample attribute data is polluted sample data, thereby improving the detection accuracy of polluted sample data and ensuring sample attribute data security, which in turn can improve the detection accuracy of the abnormal behavior detection model.

Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Alternatively, they may be implemented in program code executable by a computing device so that they may be stored in a storage device to be executed by a computing device, and in some cases in an order different from that shown here The steps shown or described are carried out, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present invention is not limited to any specific combination of hardware and software.

The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

A method for detecting polluted sample data for model training, characterized in that it includes:

Acquiring sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;

Using the preset local sensitive hash algorithm, each sample attribute data is hashed into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets;

determining the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data;

Screening out second sample attribute data similar to the respective sample attribute data from the first sample attribute data;

Based on the amount of sample data corresponding to the second sample attribute data, it is determined whether each of the sample attribute data is contaminated sample data.
The method according to claim 1, wherein said using a preset local sensitive hash algorithm to hash each sample attribute data into corresponding hash buckets in different hash tables, comprising:

calculating the hash values of the respective sample attribute data in the different hash tables by using a preset locality-sensitive hash algorithm;

Based on the hash value, hash the respective sample attribute data into corresponding hash buckets in the different hash tables.
The method according to claim 2, wherein said calculating the hash values of said respective sample attribute data in said different hash tables by using a preset local sensitive hash algorithm comprises:

Determine the data dimension and coordinate value corresponding to each sample attribute data;

Based on the data dimension and the coordinate value, determine the Hamming code corresponding to each sample attribute data;

Using the hash function corresponding to the different hash tables, extract the code at the corresponding position in the Hamming code, and determine the extracted code as the hash of each sample attribute data in the different hash table value.
The method according to claim 3, wherein the filtering out the second sample attribute data similar to the respective sample attribute data from the first sample attribute data comprises:

Calculating sample distances between the respective sample attribute data and the corresponding first sample attribute data;

The first sample attribute data whose sample distance is smaller than a preset distance is determined as the second sample attribute data similar to the respective sample attribute data.
The method according to claim 4, wherein the sample distance is a Hamming distance, and the respective calculation of the sample distance between each sample attribute data and the corresponding first sample attribute data includes:

Comparing the Hamming code corresponding to each sample attribute data with the Hamming code corresponding to the first sample attribute data respectively, and determining that each sample attribute data and the first sample attribute data have different codes number of digits;

The number of digits is determined as a Hamming distance between each sample attribute data and its corresponding first sample attribute data.
The method according to any one of claims 1-5, wherein, based on the amount of sample data corresponding to the second sample attribute data, respectively determining whether each sample attribute data is polluted sample data includes:

Counting the amount of sample data corresponding to each platform user, and selecting the largest amount of sample data from each sample data amount;

According to the maximum sample data volume and the sample data volume corresponding to the second sample attribute data, it is determined whether each sample attribute data is polluted sample data.
The method according to claim 6, wherein, according to the maximum amount of sample data and the amount of sample data corresponding to the second sample attribute data, determining whether each sample attribute data is polluted sample data includes :

Subtracting the sample data volume corresponding to the second sample attribute data from the maximum sample data volume to obtain the sample size difference corresponding to each sample attribute data;

If the sample size difference corresponding to the target sample attribute data in each sample attribute data is greater than the preset sample size difference, it is determined that the target sample attribute data and its corresponding second sample attribute data are contaminated sample data.
A detection device for contaminated sample data for model training, characterized in that it includes:

An acquisition unit, configured to acquire sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;

The hash unit is used to hash each sample attribute data into corresponding hash buckets in different hash tables by using a preset local sensitive hash algorithm, wherein any hash table includes multiple hashes bucket;

A determining unit, configured to determine the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data;

a screening unit, configured to screen out second sample attribute data similar to the respective sample attribute data from the first sample attribute data;

A judging unit, configured to respectively judge whether each sample attribute data is polluted sample data based on the amount of sample data corresponding to the second sample attribute data.
A computer device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, characterized in that, when the computer program is executed by the processor, it implements any one of claims 1 to 7. steps of the method described above.
A computer-readable storage medium, on which a computer program is stored, wherein, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are realized.