CN113495886A

CN113495886A - Method and device for detecting pollution sample data for model training

Info

Publication number: CN113495886A
Application number: CN202111041760.3A
Authority: CN
Inventors: 刘胜; 魏国富; 夏玉明; 周晓勇; 马影; 殷钱安; 梁淑云; 余贤喆; 陶景龙; 王启凡; 徐�明
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2021-10-12
Also published as: WO2023035362A1

Abstract

The invention discloses a method and a device for detecting polluted sample data for model training, relates to the technical field of information, and mainly aims to improve the detection precision of the polluted sample data so as to ensure the detection precision of an abnormal behavior detection model. The method comprises the following steps: acquiring sample attribute data corresponding to each platform user to be detected; respectively hashing each sample attribute data into corresponding hash buckets in different hash tables by using a preset locality sensitive hash algorithm; determining data in the same hash bucket as the sample attribute data in the different hash tables as first sample attribute data; screening out second sample attribute data similar to the sample attribute data from the first sample attribute data; and respectively judging whether each sample attribute data is pollution sample data or not based on the sample data size corresponding to the second sample attribute data. The invention is suitable for pollution detection of sample attribute data.

Description

Method and device for detecting pollution sample data for model training

Technical Field

The invention relates to the technical field of information, in particular to a method and a device for detecting pollution sample data for model training.

Background

Today, with the internet being more and more developed, people are shopping online, so the e-commerce platform often pushes various preferential activities to attract visitors, the preferential activities attract normal users and the attention of various lawbreakers, the identification of the abnormal behavior of the lawbreakers has important significance on the safety of a network platform, and with the development of the field of artificial intelligence, a model for detecting the abnormal behavior can be trained by utilizing sample data of a large number of users, because the sample data of the user is the basis of the artificial intelligence, if an attacker learns wrong data characteristics by polluting the sample data and an artificial intelligence algorithm, the classification boundary of the model is changed, therefore, the execution effect of the abnormal behavior detection model is seriously affected, and therefore, the pollution detection of the sample data of the user is necessary before the model training.

At present, in the process of detecting contamination of a training set of an abnormal behavior detection model, it is usually performed to check for duplication of sample data to detect whether there is contaminated sample data in the sample data. However, the method for polluting the sample data includes means such as copying, subtle transformation, synthesis, and the like, and the simple duplication checking method cannot detect the transformed and synthesized polluted sample data, so that the detection accuracy of the polluted sample data is low, the safety of the sample data cannot be ensured, and the detection accuracy of the abnormal behavior detection model is further influenced.

Disclosure of Invention

The invention provides a method and a device for detecting polluted sample data for model training, which mainly aim to improve the detection precision of the polluted sample data, ensure the safety of the sample data and further improve the detection precision of an abnormal behavior detection model.

According to a first aspect of the present invention, there is provided a method for detecting contaminated sample data for model training, comprising:

acquiring sample attribute data corresponding to each platform user to be detected, wherein the sample attribute data at least comprises equipment attribute data, wind control data and service data corresponding to each platform user, the equipment attribute data comprises equipment identification and application program identification, the wind control data comprises request information and personal information of the platform user, and the service data comprises order information and order return information of the platform user;

respectively hashing each sample attribute data into corresponding hash buckets in different hash tables by using a preset locality sensitive hash algorithm, wherein any one hash table comprises a plurality of hash buckets;

determining data in the same hash bucket as the sample attribute data in the different hash tables as first sample attribute data;

screening out second sample attribute data similar to the sample attribute data from the first sample attribute data;

and respectively judging whether each sample attribute data is pollution sample data or not based on the sample data size corresponding to the second sample attribute data.

According to a second aspect of the present invention, there is provided an apparatus for detecting contaminated sample data for model training, comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring sample attribute data corresponding to each platform user to be detected, the sample attribute data at least comprises equipment attribute data, wind control data and service data corresponding to each platform user, the equipment attribute data comprises equipment identification and application program identification, the wind control data comprises request information and personal information of the platform user, and the service data comprises order information and order return information of the platform user;

the hash unit is used for respectively hashing each sample attribute data into corresponding hash buckets in different hash tables by using a preset locality sensitive hash algorithm, wherein any one hash table comprises a plurality of hash buckets;

a determining unit, configured to determine, as first sample attribute data, data in the same hash bucket as the sample attribute data in the different hash tables;

a screening unit, configured to screen out second sample attribute data similar to the respective sample attribute data from the first sample attribute data;

and the judging unit is used for respectively judging whether each sample attribute data is the pollution sample data or not based on the sample data size corresponding to the second sample attribute data.

According to a third aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the program:

Compared with the current mode of detecting whether the polluted sample data exists in the sample data by sample duplication checking, the method and the device for detecting the polluted sample data for model training can obtain the sample attribute data corresponding to each platform user to be detected, wherein the sample attribute data at least comprises equipment attribute data, wind control data and service data corresponding to each platform user; respectively hashing each sample attribute data into corresponding hash buckets in different hash tables by using a preset locality sensitive hash algorithm, wherein any one hash table comprises a plurality of hash buckets; meanwhile, determining data in the same hash bucket as the sample attribute data in the different hash tables as first sample attribute data; screening out second sample attribute data similar to the sample attribute data from the first sample attribute data; and finally, respectively judging whether each sample attribute data is the polluted sample data or not based on the sample data size corresponding to the second sample attribute data, so that the detection precision of the polluted sample data can be improved, the safety of the sample attribute data is ensured, and the detection precision of the abnormal behavior detection model can be improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a method for detecting contaminated sample data for model training according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for detecting contaminated sample data used for model training according to an embodiment of the present invention;

FIG. 3 illustrates a hash representation provided by an embodiment of the invention;

FIG. 4 is a schematic structural diagram illustrating an apparatus for detecting contaminated sample data for model training according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of another detection apparatus for contaminated sample data used for model training according to an embodiment of the present invention;

fig. 6 shows a physical structure diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Currently, in the process of detecting contamination of sample data, it is usually performed to check for duplication of sample data to detect whether there is contaminated sample data in the sample data. However, the method for polluting the sample data includes means such as copying, transforming, and synthesizing, and the simple duplication checking method cannot detect the transformed and synthesized polluted sample data, so that the detection accuracy of the polluted sample data is low, the safety of the sample data cannot be ensured, and the detection accuracy of the abnormal behavior detection model is further affected.

In order to solve the above problem, an embodiment of the present invention provides a method for detecting contaminated sample data for model training, as shown in fig. 1, the method includes:

101. and acquiring sample attribute data corresponding to each platform user to be detected.

Wherein, sample attribute data includes at least equipment attribute data, wind accuse data and the service data that each platform user corresponds, equipment attribute data includes equipment identification and application program identification, wind accuse data includes platform user's request information and personal information, the service data includes platform user's order information and receipt information, equipment identification specifically can be equipment ID and equipment model, the application program identification specifically can be APP name and APP version number, when the user operates at the electricity merchant APP, the business personnel bury some reason to the important customs pass in the operation flow, will produce a corresponding equipment attribute data, wind accuse data or service data after the user triggers preset point once, this equipment attribute data specifically includes: the system comprises a platform user, a device ID, a device model, an APP name, an APP version number and the like, wherein the wind control data comprises all request information and personal information of the user, the request information comprises a request, a placing request, a refund request and the like of the platform user for participating in preferential activities, the personal information comprises a user name, a mobile phone number and the like, and the service data comprises all orders, refunds, order details and the like of the platform user. In order to overcome the defects that the detection precision of the polluted sample data is low and the detection precision of the abnormal behavior of the user is influenced in the prior art, the embodiment of the invention hashes the sample attribute data of each platform user into the corresponding hash buckets in different hash tables, and screens out the second sample attribute data similar to the sample attribute data from the first sample attribute data in the same hash bucket with the sample attribute data, so that whether each sample attribute data is the polluted sample data can be judged based on the sample data size corresponding to the second sample attribute data. The embodiment of the invention is mainly applied to the scene of carrying out pollution detection on the sample attribute data. The execution subject of the embodiment of the present invention is a device or an apparatus capable of performing contamination detection on sample attribute data, and may be specifically set at a client or a server side.

For the embodiment of the invention, a large amount of equipment attribute data, wind control data and service data of platform users are collected in advance, the collected data are used as sample attribute data of the users, because the sample attribute data may be maliciously polluted, in order to ensure the detection accuracy of a subsequent abnormal behavior detection model, polluted data in the sample attribute data need to be detected, for example, sample attribute data of 200 platform users are collected, each time a platform user operates on an e-commerce platform, the sample attribute data of the platform user is collected, for example, each user collects 5 pieces of sample attribute data, 1000 pieces of sample attribute data are collected altogether, the 1000 pieces of data are used as training data of the abnormal behavior detection model, before formal training, in order to ensure the detection accuracy of the subsequent model, whether polluted data exist in the 1000 pieces of sample attribute data or not needs to be detected first, so as to prevent the collected sample attribute data set from being polluted and influencing the training effect of the later abnormal behavior detection model.

102. And respectively hashing the attribute data of each sample into corresponding hash buckets in different hash tables by using a preset locality sensitive hashing algorithm.

Any hash table comprises a plurality of hash buckets, and the specific number of the hash buckets can be set according to actual service requirements. For the embodiment of the invention, in the process of carrying out pollution detection on sample attribute data, similar sample data amount needs to be counted through sample matching, because a large amount of sample attribute data exists in a training set, if the similarity between any two sample attribute data is calculated, the calculation amount is huge, and in order to reduce the calculation amount in the process of matching the sample attribute data, the embodiment of the invention adopts a preset local sensitive hash algorithm to hash each sample attribute data into corresponding hash buckets in different hash tables, because the sample attribute data in the same hash bucket has a large probability of being similar, second sample attribute data similar to the sample attribute data can be determined from the first sample attribute data in the same hash bucket with the sample attribute data, thereby greatly reducing the data matching range from mass data, the amount of calculation of the sample attribute data is reduced.

Specifically, in the process of hashing each sample attribute data into a corresponding hash bucket in different hash tables by using a preset locality sensitive hashing algorithm, a hash function meeting the following conditions needs to be found,

if it is

The probability of h (x) = h (y) is not less than P₁；

If it is

The probability of h (x) = h (y) is not less than P₂；

Wherein d (x, y) is the distance between any two sample attribute data x, y in the high-dimensional space, h is a hash function, h (x) and h (y) are hash transformations to the sample attribute data x and y, c is a constant, and c is required to satisfy c>1 and P_1>P₂Wherein, according to the found accuracy of the similar samples, a probability value P is preset₁And P₂The higher the accuracy with which similar samples are found, the higher the probability value P₁The larger, P₂The smaller, and conversely the lower the accuracy with which the required similar sample is found, the probability value P₁The smaller, P₂Larger but at the same time guarantee P_1>P₂。

In the embodiment of the present invention, the number of hash tables and the number of hash functions corresponding to each hash table need to be determined according to the accuracy of similar sample attribute data lookup, it should be noted that the greater the number of divided hash tables and the greater the number of hash functions corresponding to each hash table, the higher the accuracy of similar sample attribute data lookup is, but the greater the number of hash tables and hash functions is, the greater the calculation amount of sample attribute data is, so the problem of the accuracy of similar sample attribute data lookup and the calculation amount of sample attribute data is comprehensively considered, and the number of hash tables and the number of hash functions corresponding to each hash table in the embodiment of the present invention are determined, for example, the number of hash tables is set to 3, each hash table corresponds to two hash functions, and hash table 1 corresponds to hash function h₁And h₂The hash table 2 corresponds to a hash function h₃And h₄The hash table 3 corresponds to a hash function h₅And h₆。

Specifically, after the number of hash tables and the hash function corresponding to each hash table are determined, the hash function corresponding to each hash table is used to calculate the hash value of each sample attribute data in different hash tables, and then each sample attribute data is hashed into a corresponding hash bucket in each hash table according to the hash value of each sample attribute data in each hash table, wherein each hash table corresponds to a plurality of hash buckets, and the hash values corresponding to the hash buckets are different.

For example, each of the hash tables 1,2 and 3 includes 4 hash buckets, the hash value corresponding to the first hash bucket in the three hash tables is 00, the hash value corresponding to the second hash bucket is 01, the hash value corresponding to the third hash bucket is 10, the hash value corresponding to the fourth hash bucket is 11, if the hash function h corresponding to the hash table 1 is used₁And h_2，Calculating the hash value corresponding to the sample attribute data A to be 00, and hashing the sample attribute data A into a first hash bucket in the hash table 1 because the hash value corresponding to the sample attribute data A is the same as the hash value corresponding to the first hash bucket; if the corresponding hash function h of the hash table 2 is utilized₃And h₄Calculating the hash value corresponding to the sample attribute data a to be 01, and hashing the sample attribute data a into a second hash bucket in the hash table 2 because the hash value corresponding to the sample attribute data a is the same as the hash value corresponding to the second hash bucket; if the corresponding hash function h of the hash table 3 is utilized₅And h₆The hash value corresponding to the sample attribute data a is calculated to be 10, and since the hash value corresponding to the sample attribute data a is the same as the hash value corresponding to the third hash bucket, the sample attribute data a is hashed into the third hash bucket in the hash table 3. Therefore, according to the method, the preset locality sensitive hashing algorithm can be used for hashing the sample attribute data into the corresponding hash buckets in different hash tables.

103. And determining the data in the same hash bucket as the sample attribute data in the different hash tables as the first sample attribute data.

For the embodiment of the present invention, in step 102, each sample attribute data is hashed into a corresponding hash bucket in different hash tables, and since sample attribute data in the same hash bucket are similar with a high probability, first sample attribute data in the same hash bucket with each sample attribute data in different hash tables can be respectively determined, and then second sample attribute data similar to each sample attribute data is screened from the first sample attribute data corresponding to each sample attribute data.

For example, the sample attribute data a is respectively in the first hash bucket of the hash table 1, the third hash bucket of the hash table 2 and the fourth hash bucket of the hash table 3, and for the sample attribute data a, all the sample attribute data in the first hash bucket of the hash table 1, all the sample attribute data in the third hash bucket of the hash table 2 and all the sample attribute data in the fourth hash bucket of the hash table 3 may be respectively obtained, and the obtained sample attribute data is taken as the first sample attribute data corresponding to the sample attribute data a, so as to screen the second sample attribute data similar to the sample attribute data a from the first sample attribute data, and for example, the sample attribute data B is respectively in the third hash bucket of the hash table 1, the fourth hash bucket of the hash table 2 and the first hash bucket of the hash table 3, therefore, for the sample attribute data B, all sample attribute data in the third hash bucket in the hash table 1, all sample attribute data in the fourth hash bucket in the hash table 2, and all sample attribute data in the first hash bucket in the hash table 3 may be obtained, and the obtained sample attribute data may be used as the first sample attribute data corresponding to the sample attribute data B, so as to filter the second sample attribute data similar to the sample attribute data B from the first sample attribute data, and thus, the first sample attribute data in the same hash bucket as each sample attribute data may be obtained according to the above manner, and further, the matching range of the sample attribute data is greatly reduced.

104. And screening out second sample attribute data similar to the sample attribute data from the first sample attribute data.

For the embodiment of the present invention, after the matching range (first sample attribute data) corresponding to each sample attribute data is determined, since the preset locality-sensitive hash algorithm can only ensure that the sample attribute data in the same hash bucket are similar with a high probability, but cannot ensure that the sample attribute data in the same hash bucket are similar to each other, in order to further improve the detection accuracy of the contaminated sample data, second sample attribute data that is truly similar to each sample attribute data may be screened from the first sample attribute data corresponding to each sample attribute data.

Specifically, the sample distance between each sample attribute data and the corresponding first sample attribute data may be calculated respectively, and according to the sample distance, second sample attribute data similar to the sample attribute data may be screened from the first sample attribute data, and the specific calculation manner of the sample distance may adopt calculation manners such as an euclidean distance, a cosine distance, and a hamming distance.

For example, the sample attribute data in the same hash bucket as the sample attribute data a in different hash tables includes sample attribute data B, sample attribute data C and sample attribute data D, that is, the first sample attribute data corresponding to the sample attribute data a includes sample attribute data B, sample attribute data C, and sample attribute data D, sample distances between the sample attribute data a and the sample attribute data B, and between the sample attribute data C and the sample attribute data D are calculated respectively, comparing the sample attribute data A and the sample attribute data B to determine that the sample distance between the sample attribute data A and the sample attribute data B is less than a preset distance and the sample distance between the sample attribute data A and the sample attribute data C is less than a preset distance, it can be determined that sample attribute data B and sample attribute data C are similar to sample attribute data a, i.e. the second sample property data similar to sample property data a are sample property data B and sample property data C. For another example, the sample attribute data in the same hash bucket as the sample attribute data B in different hash tables includes sample attribute data a, sample attribute data E, and sample attribute data F, that is, the first sample attribute data corresponding to the sample attribute data B includes the sample attribute data a, the sample attribute data E, and the sample attribute data F, and since the sorting position corresponding to the sample attribute data a is before the sample attribute data B, that is, in the process of performing distance calculation for the sample attribute data a, the sample distance between the sample attribute data a and the sample attribute data B has been already calculated, and the sample attribute data a and the sample attribute data B are determined to be similar, for the sample attribute data B, only the sample distance between the sample attribute data B, the sample attribute data E, and the sample attribute data F can be calculated, and the sample distance between the sample attribute data B and the sample attribute data E is determined to be smaller than the preset distance by comparison, it can thus be determined that the sample attribute data a and the sample attribute data E are similar to the sample attribute data B, i.e. the second sample attribute data similar to the sample attribute data B are the sample attribute data a and the sample attribute data E. Thereby, in the above manner, the second sample attribute data similar to the respective sample attribute data can be determined separately.

105. And respectively judging whether each sample attribute data is pollution sample data or not based on the sample data size corresponding to the second sample attribute data.

For the embodiment of the invention, after second sample attribute data similar to each sample attribute data is respectively determined, the sample attribute data volume corresponding to the second sample attribute data needs to be counted, and whether the sample attribute data volume is normal or not is judged by combining the actual acquisition condition of the sample attribute data, and if the sample attribute data volume does not exceed the normal range, the sample attribute data are not polluted; if the sample attribute data amount exceeds the normal range, the sample attribute data is polluted, and the data needs to be excluded from the training set.

For example, the training set is 1000 pieces of sample attribute data collected from 200 platform users, each platform user collects 5 pieces of sample attribute data, and normally, 5 pieces of sample attribute data of the same platform user should be similar, if the amount of sample data corresponding to the second sample attribute data similar to the sample attribute data a is 50, which is far beyond the normal range of 5, it can be seen that the second sample attribute data similar to the sample attribute data a is contaminated, and an attacker is likely to create data similar to the sample attribute data a by means of copying, fine transformation, etc., add the created data to the training set, in order to avoid the training data from being contaminated and ensure the detection accuracy of the abnormal behavior detection model, the sample attribute data a and the second sample attribute data similar to the sample attribute data a need to be excluded from the training set. Therefore, according to the above manner, by counting the data amount of the second sample attribute data corresponding to each sample attribute data, it can be determined whether the sample attribute data and the corresponding second sample attribute data are contaminated.

Compared with the current mode of detecting whether the polluted sample data exists in the sample data by sample duplication checking, the method for detecting the polluted sample data for model training can obtain the sample attribute data corresponding to each platform user to be detected, wherein the sample attribute data at least comprises equipment attribute data, wind control data and service data corresponding to each platform user; respectively hashing each sample attribute data into corresponding hash buckets in different hash tables by using a preset locality sensitive hash algorithm, wherein any one hash table comprises a plurality of hash buckets; meanwhile, determining data in the same hash bucket as the sample attribute data in the different hash tables as first sample attribute data; screening out second sample attribute data similar to the sample attribute data from the first sample attribute data; and finally, respectively judging whether each sample attribute data is the polluted sample data or not based on the sample data size corresponding to the second sample attribute data, so that the detection precision of the polluted sample data can be improved, the safety of the sample attribute data is ensured, and the detection precision of the abnormal behavior detection model can be improved.

Further, in order to better describe the detection process of the contaminated sample data, as a refinement and an extension to the above embodiment, an embodiment of the present invention provides another detection method of contaminated sample data for model training, as shown in fig. 2, where the method includes:

201. and acquiring sample attribute data corresponding to each platform user to be detected.

For the embodiment of the present invention, in order to ensure the safety of the training set, each sample attribute data in the training set needs to be acquired for contamination detection, so as to determine whether the sample attribute data is contaminated data, and the specific process for acquiring the sample attribute data is completely the same as that in step 101, and is not described herein again.

202. And respectively hashing the attribute data of each sample into corresponding hash buckets in different hash tables by using a preset locality sensitive hashing algorithm.

Any hash table comprises a plurality of hash buckets, and the specific number of the hash buckets can be set according to actual service requirements. For the embodiment of the present invention, in order to hash each sample attribute data into the corresponding hash bucket in different hash tables, step 202 specifically includes: respectively calculating hash values of the sample attribute data in the different hash tables by using a preset locality sensitive hash algorithm; and hashing the sample attribute data into corresponding hash buckets in different hash tables based on the hash values. Further, the calculating the hash values of the sample attribute data in the different hash tables by using a preset locality sensitive hash algorithm includes: determining data dimensions and coordinate values corresponding to the sample attribute data; determining Hamming codes corresponding to the sample attribute data based on the data dimension and the coordinate values; and extracting codes at corresponding positions in the Hamming codes by using hash functions corresponding to the different hash tables, and determining the extracted codes as hash values of the sample attribute data in the different hash tables.

Specifically, the preset locality sensitive hash algorithm adopted in the embodiment of the present invention is mainly a locality sensitive hash algorithm at a hamming distance, and includes determining a data dimension and coordinate values at different positions corresponding to each sample attribute data, determining a maximum coordinate value corresponding to each sample attribute data together according to the coordinate values at different positions corresponding to each sample attribute data, multiplying the maximum coordinate value and the data dimension to obtain a hamming code number corresponding to each sample attribute data, and performing hamming coding on each sample attribute data based on the hamming code number. Further, the codes at the corresponding positions in the hamming codes corresponding to the sample attribute data are extracted by using the hash functions corresponding to different hash tables, and the extracted codes are determined as the hash values corresponding to the sample attribute data.

As shown in fig. 3, the training set includes 6 sample attribute data, specifically, sample attribute data a = (1, 1), sample attribute data B = (2, 1), sample attribute data C = (1, 2), sample attribute data D = (2, 2), sample attribute data E (4, 2), and sample attribute data F (4, 3), and according to the coordinate values of the sample attribute data at different positions, the maximum coordinate value can be determined to be 4, and the data dimension corresponding to the sample attribute data is 2, so that the hamming code number can be determined to be 4 × 2=8, and then 8-bit hamming coding is performed on each sample attribute data, where a specific formula of the hamming coding is as follows:

wherein v (p) represents the corresponding Hamming code of each sample attribute data,

the method comprises the steps that coordinate values corresponding to sample attribute data are represented, n is a data dimension corresponding to the sample attribute data, unaryc (x) is a string of binary Hamming codes with the length of C, C is the maximum coordinate value of the sample attribute data, the Hamming codes corresponding to the coordinate values in the sample attribute data are determined and then spliced to obtain Hamming codes v (p) corresponding to the sample attribute data, unaryc (x) represents that the codes before x bits in the Hamming codes with the length of C are 1, and the codes after the x bits are 0.

In the above example of the above, the first,

wherein x is₁=1，x₂=1, C is the maximum coordinate value 4 of all sample attribute data, and thus the 8-bit hamming code corresponding to the sample attribute data a can be determined

Same principle ofCan determine the corresponding 8-bit Hamming code of the sample attribute data B

8-bit Hamming code corresponding to sample attribute data C

The 8-bit Hamming code corresponding to the sample attribute data D is

8-bit Hamming code corresponding to sample attribute data E

8-bit Hamming code corresponding to sample attribute data F

Furthermore, 3 hash tables are set according to the accuracy of similar sample attribute data lookup, and each hash table has two hash functions, specifically, the first hash table is formed by a hash function h₁And h₂Composition of h₁And h₂Extracting Hamming codes of 2 nd bit and 4 th bit respectively; the second hash table is formed by a hash function h₃And h₄Composition of h₃And h₄Extracting Hamming codes of the 1 st bit and the 6 th bit respectively; the third hash table is composed of a hash function h₅And h₆Composition of h₅And h₆The hash functions are used to extract hamming codes of the 3 rd bit and the 8 th bit respectively, and further, codes at corresponding positions in the hamming codes corresponding to each sample attribute data are extracted by using the hash functions, so as to obtain hash values of the sample attribute data in different hash tables, that is, the hash value of the sample attribute data a in the first hash table is 00, the hash value in the second hash table is 10, and the hash value in the third hash table is 00.

Further, since there are only 4 possible hamming codes extracted, which are 00,01,10, and 11, each hash table may be set to include 4 hash buckets, and hash values corresponding to each hash bucket are 00,01,10, and 11, respectively, according to hash values of the sample attribute data a in different hash tables, it may be determined that the sample attribute data a is hashed into a first hash bucket of a first hash table, a third hash bucket of a second hash table, and a first hash bucket of a third hash table, and similarly, other sample attribute data may be hashed into corresponding hash buckets of different hash tables, as shown in fig. 3.

203. And determining the data in the same hash bucket as the sample attribute data in the different hash tables as the first sample attribute data.

For the embodiment of the present invention, the manner of extracting the first sample attribute data in the same hash bucket as the sample attribute data in different hash tables is completely the same as that in step 103, and is not described herein again.

204. And screening out second sample attribute data similar to the sample attribute data from the first sample attribute data.

For the embodiment of the present invention, in order to determine the second sample data similar to each sample data, step 204 specifically includes: respectively calculating the sample distance between each sample attribute data and the corresponding first sample attribute data; and determining the first sample attribute data with the sample distance smaller than the preset distance as second sample attribute data similar to each sample attribute data. The sample distance may specifically be a hamming distance, and the method includes, for a process of calculating the hamming distance between the sample attribute data and the first sample attribute data corresponding to the sample attribute data: comparing the Hamming code corresponding to each sample attribute data with the Hamming code corresponding to the first sample attribute data respectively, and determining that each sample attribute data and the first sample attribute data have different coded bit numbers; determining the number of bits as a hamming distance between the respective sample attribute data and its corresponding first sample attribute data. The preset distance can be set according to actual service requirements.

For example, the hamming code corresponding to the sample attribute data a is 110011, the hamming code corresponding to the sample attribute data a is 000001, and the preset hamming distance is 2, when the second sample attribute data similar to the sample attribute data a is determined, the hamming distance between the sample attribute data a and the sample attribute data B is calculated, the bit values in the hamming code corresponding to the sample attribute data a are compared with the bit values in the hamming code corresponding to the sample attribute data B, the bit numbers with different bit values are determined, and the comparison shows that the bit values of the sample attribute data a and the sample attribute data B are different, so that the hamming distance between the sample attribute data a and the sample attribute data B can be determined to be 1, further, since the hamming distance between the sample attribute data a and the sample attribute data B is smaller than the preset hamming distance 2, it may be determined that the sample attribute data a is similar to the sample attribute data B, and similarly, it may be determined that the hamming distance between the sample attribute data a and the sample attribute data C is 3, and since the hamming distance between the sample attribute data a and the sample attribute data C is greater than the preset hamming distance 2, it may be determined that the sample attribute data C is not similar to the sample attribute data a, that is, the second sample attribute data similar to the sample attribute data a in the first sample attribute data is the sample attribute data B. Whereby second sample attribute data similar to the respective sample attribute data can be determined separately in the above-described manner.

Further, in the embodiment of the present invention, position sorting may be performed on each sample attribute data in advance, for example, the sorting positions corresponding to the sample attribute data are sample attribute data a, sample attribute data B, sample attribute data C, and sample attribute data D, then, first sample attribute data corresponding to each sample attribute data may be sequentially determined according to the sorting positions corresponding to each sample attribute data, and a sample distance is calculated, for example, first sample attribute data corresponding to sample attribute data a is determined first, then, a sample distance between sample attribute data a and first sample attribute data corresponding to the sample attribute data a is calculated, then, first sample attribute data corresponding to sample attribute data B is determined, and a sample distance between sample attribute data B and first sample attribute data corresponding to the sample attribute data B is calculated.

In a specific application scenario, in order to further reduce the calculation amount of the sample distance, before calculating the sample distance between the sample attribute data and the corresponding first sample attribute data, it needs to be determined whether the sorting position corresponding to the first sample attribute data is before the sorting position of the corresponding sample attribute data, and if the sorting position corresponding to the first sample attribute data is before the sorting position of the corresponding sample attribute data, it indicates that the sample distance between the first sample attribute data and the sample attribute data has been calculated, and the calculation does not need to be repeated; if the sorting position corresponding to the first sample attribute data is behind the sorting position of the sample attribute data corresponding thereto, the sample distance between the sample attribute data and the first sample attribute data corresponding thereto needs to be calculated.

For example, the training set includes sample attribute data a, sample attribute data B, sample attribute data C, and sample attribute data D, the first sample attribute data corresponding to the sample attribute data a includes sample attribute data B and sample attribute data C, the first sample attribute data corresponding to the sample attribute data B includes sample attribute data a and sample attribute data D, in the previous sample distance calculation process, if it has been determined that the second sample attribute data similar to the sample attribute data a is sample attribute data B, in the process of determining the second sample attribute data similar to the sample attribute data B, it is not necessary to repeatedly calculate the sample distance between the sample attribute data B and the first sample attribute data a corresponding thereto, the previous calculation result may be directly called, it is determined that the sample attribute data B is similar to the first sample attribute data a, then, only the sample distance between the sample attribute data B and the first sample attribute data D corresponding thereto needs to be calculated, and then whether the two are similar is determined. Therefore, the calculation amount of the sample distance can be further reduced, and the detection efficiency of the attribute data of the polluted sample is improved.

205. And counting the sample data size corresponding to each platform user, and screening out the maximum sample data size from each sample data size.

For the embodiment of the present invention, it may be determined whether each sample attribute data and the corresponding second sample attribute data thereof are contaminated or not through the sample data size corresponding to the second sample attribute data similar to each sample attribute data, and in the process of determining the sample data size, the actual acquisition condition of the sample attribute data needs to be combined, for example, the acquired sample attribute data set includes 1000 sample attribute data of 200 persons, and 5 sample attribute data of each person are acquired, so that it can be known that 200 platform users are included in the training set, and the sample data size corresponding to each platform user is 5, and therefore the maximum sample data size can be determined to be 5. For another example, the training set collected includes 200 sample attribute data of 50 persons, wherein 48 persons each collect 4 sample attribute data, one person only collects 1 sample data, and another person collects 7 sample attribute data, so that the maximum sample data size can be determined to be 7. Therefore, the maximum sample data size can be determined by combining the actual situation according to the mode, so that whether the sample attribute data is the polluted data or not can be judged according to the maximum sample data size.

206. And judging whether each sample attribute data is pollution sample data or not according to the maximum sample data size and the sample data size corresponding to the second sample attribute data.

Wherein, the polluted sample data is sample attribute data polluted by means of copying, subtle change, compounding and the like by an attacker. For the embodiment of the present invention, in order to determine whether each sample attribute data is a pollution data, step 206 specifically includes: subtracting the maximum sample data size from the sample data size corresponding to the second sample attribute data to obtain a sample size difference corresponding to each sample attribute data; and if the sample number difference corresponding to the target sample attribute data in the sample attribute data is larger than a preset sample number difference, determining that the target sample attribute data and the second sample attribute data corresponding to the target sample attribute data are pollution sample data. The target sample data is any sample data in the training set, and the preset sample quantity difference can be set according to actual service requirements.

For example, it is determined that the maximum sample data size is 100, the preset sample number difference is 50, the sample number corresponding to the second sample attribute data similar to the sample attribute data a is 200, and the sample data size corresponding to the second sample attribute data similar to the sample attribute data B is 110, so that the sample number difference corresponding to the sample attribute data a can be determined to be 200- And dividing the sample size range by the number of samples corresponding to the sample attribute data B to ensure the safety of the training set, and similarly, determining that the sample size difference corresponding to the sample attribute data B is 110-. Therefore, whether the attribute data of each sample in the training set of the abnormal behavior detection model is pollution data or not can be sequentially judged according to the mode.

Compared with the current mode of detecting whether the polluted sample data exists in the sample data by sample duplicate checking, the method for detecting the polluted sample data for model training provided by the embodiment of the invention can obtain the sample attribute data corresponding to each platform user to be detected, wherein the sample attribute data at least comprises the equipment attribute data, the wind control data and the service data corresponding to each platform user; respectively hashing each sample attribute data into corresponding hash buckets in different hash tables by using a preset locality sensitive hash algorithm, wherein any one hash table comprises a plurality of hash buckets; meanwhile, determining data in the same hash bucket as the sample attribute data in the different hash tables as first sample attribute data; screening out second sample attribute data similar to the sample attribute data from the first sample attribute data; and finally, respectively judging whether each sample attribute data is the polluted sample data or not based on the sample data size corresponding to the second sample attribute data, so that the detection precision of the polluted sample data can be improved, the safety of the sample attribute data is ensured, and the detection precision of the abnormal behavior detection model can be improved.

Further, as a specific implementation of fig. 1, an embodiment of the present invention provides an apparatus for detecting contaminated sample data for model training, as shown in fig. 4, the apparatus includes: an acquisition unit 31, a hash unit 32, a determination unit 33, a filtering unit 34, and a determination unit 35.

The obtaining unit 31 may be configured to obtain sample attribute data corresponding to each platform user to be detected, where the sample attribute data at least includes device attribute data, wind control data, and service data corresponding to each platform user, the device attribute data includes a device identifier and an application identifier, the wind control data includes request information and personal information of the platform user, and the service data includes order information and order return information of the platform user.

The hash unit 32 may be configured to hash each sample attribute data into corresponding hash buckets in different hash tables respectively by using a preset locality-sensitive hash algorithm, where any one hash table includes a plurality of hash buckets.

The determining unit 33 may be configured to determine, as the first sample attribute data, data in the same hash bucket as the respective sample attribute data in the different hash tables.

The screening unit 34 may be configured to screen out second sample attribute data similar to the respective sample attribute data from the first sample attribute data.

The determining unit 35 may be configured to determine whether each sample attribute data is a contaminated sample data based on the sample data size corresponding to the second sample attribute data.

In a specific application scenario, in order to hash the sample attribute data into corresponding hash buckets in different hash tables, as shown in fig. 5, the hash unit 32 includes: a first computation module 321 and a hash module 322.

The first calculating module 321 may be configured to calculate hash values of the sample attribute data in the different hash tables respectively by using a preset locality sensitive hash algorithm.

The hash module 322 may be configured to hash the sample attribute data into corresponding hash buckets in the different hash tables based on the hash values.

Further, in order to calculate the hash value of each sample attribute data in the different hash table, the first calculating module 321 includes: a determination submodule and an extraction submodule.

The determining submodule may be configured to determine a data dimension and a coordinate value corresponding to each sample attribute data.

The determining sub-module may further determine, based on the data dimension and the coordinate value, a hamming code corresponding to each sample attribute data.

The extracting sub-module may be configured to extract, by using hash functions corresponding to the different hash tables, codes at corresponding positions in the hamming codes, and determine the extracted codes as hash values of the sample attribute data in the different hash tables.

Further, in order to screen out second sample attribute data similar to the respective sample attribute data from the first sample attribute data, the screening unit 34 includes: a second calculation module 341 and a determination module 342.

The second calculating module 341 may be configured to calculate sample distances between the respective sample attribute data and the corresponding first sample attribute data.

The determining module 342 may be configured to determine the first sample attribute data with the sample distance smaller than the preset distance as the second sample attribute data similar to the respective sample attribute data.

In a specific application scenario, the sample distance is a hamming distance, and the second calculating module 341 includes: a comparison submodule and a determination submodule.

The comparison submodule may be configured to compare the hamming code corresponding to each sample attribute data with the hamming code corresponding to the first sample attribute data, and determine that each sample attribute data and the first sample attribute data have different coded bit numbers.

The determining submodule may be configured to determine the number of bits as a hamming distance between each sample attribute data and the corresponding first sample attribute data.

In a specific application scenario, in order to determine whether each sample attribute data is a contaminated sample data, the determining unit 35 includes: a statistics module 351 and a decision module 352.

The statistics module 351 may be configured to count the sample data size corresponding to each platform user, and screen out the maximum sample data size from each sample data size.

The determining module 352 may be configured to determine whether each sample attribute data is a contaminated sample data according to the maximum sample data size and the sample data size corresponding to the second sample attribute data.

Further, the determining module 352 includes: a subtraction submodule and a decision submodule.

The subtraction submodule may be configured to subtract the sample data size corresponding to the second sample attribute data from the maximum sample data size, so as to obtain a sample number difference corresponding to each sample attribute data.

The determining sub-module may be configured to determine that the target sample attribute data and the second sample attribute data corresponding to the target sample attribute data are contamination sample data if the sample number difference corresponding to the target sample attribute data in the sample attribute data is greater than a preset sample number difference.

It should be noted that other corresponding descriptions of the functional modules involved in the detection apparatus for contaminated sample data for model training according to the embodiment of the present invention may refer to the corresponding description of the method shown in fig. 1, and are not described herein again.

Based on the method shown in fig. 1, correspondingly, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following steps: acquiring sample attribute data corresponding to each platform user to be detected, wherein the sample attribute data at least comprises equipment attribute data, wind control data and service data corresponding to each platform user, the equipment attribute data comprises equipment identification and application program identification, the wind control data comprises request information and personal information of the platform user, and the service data comprises order information and order return information of the platform user; respectively hashing each sample attribute data into corresponding hash buckets in different hash tables by using a preset locality sensitive hash algorithm, wherein any one hash table comprises a plurality of hash buckets; determining data in the same hash bucket as the sample attribute data in the different hash tables as first sample attribute data; screening out second sample attribute data similar to the sample attribute data from the first sample attribute data; and respectively judging whether each sample attribute data is pollution sample data or not based on the sample data size corresponding to the second sample attribute data.

Based on the above embodiments of the method shown in fig. 1 and the apparatus shown in fig. 4, an embodiment of the present invention further provides an entity structure diagram of a computer device, as shown in fig. 6, where the computer device includes: a processor 41, a memory 42, and a computer program stored on the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are both arranged on a bus 43 such that when the processor 41 executes the program, the following steps are performed: acquiring sample attribute data corresponding to each platform user to be detected, wherein the sample attribute data at least comprises equipment attribute data, wind control data and service data corresponding to each platform user, the equipment attribute data comprises equipment identification and application program identification, the wind control data comprises request information and personal information of the platform user, and the service data comprises order information and order return information of the platform user; respectively hashing each sample attribute data into corresponding hash buckets in different hash tables by using a preset locality sensitive hash algorithm, wherein any one hash table comprises a plurality of hash buckets; determining data in the same hash bucket as the sample attribute data in the different hash tables as first sample attribute data; screening out second sample attribute data similar to the sample attribute data from the first sample attribute data; and respectively judging whether each sample attribute data is pollution sample data or not based on the sample data size corresponding to the second sample attribute data.

According to the technical scheme, sample attribute data corresponding to each platform user to be detected is obtained, wherein the sample attribute data at least comprises equipment attribute data, wind control data and service data corresponding to each platform user; respectively hashing each sample attribute data into corresponding hash buckets in different hash tables by using a preset locality sensitive hash algorithm, wherein any one hash table comprises a plurality of hash buckets; meanwhile, determining data in the same hash bucket as the sample attribute data in the different hash tables as first sample attribute data; screening out second sample attribute data similar to the sample attribute data from the first sample attribute data; and finally, respectively judging whether each sample attribute data is the polluted sample data or not based on the sample data size corresponding to the second sample attribute data, so that the detection precision of the polluted sample data can be improved, the safety of the sample attribute data is ensured, and the detection precision of the abnormal behavior detection model can be improved.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for detecting polluted sample data used for model training is characterized by comprising the following steps:

2. The method according to claim 1, wherein the hashing each sample attribute data into a corresponding hash bucket in different hash tables respectively by using a predetermined locality sensitive hashing algorithm comprises:

respectively calculating hash values of the sample attribute data in the different hash tables by using a preset locality sensitive hash algorithm;

and hashing the sample attribute data into corresponding hash buckets in different hash tables based on the hash values.

3. The method according to claim 2, wherein the separately calculating the hash value of each sample attribute data in the different hash tables by using a predetermined locality-sensitive hash algorithm comprises:

determining data dimensions and coordinate values corresponding to the sample attribute data;

determining Hamming codes corresponding to the sample attribute data based on the data dimension and the coordinate values;

and extracting codes at corresponding positions in the Hamming codes by using hash functions corresponding to the different hash tables, and determining the extracted codes as hash values of the sample attribute data in the different hash tables.

4. The method of claim 3, wherein the screening out second sample attribute data from the first sample attribute data that is similar to the respective sample attribute data comprises:

respectively calculating the sample distance between each sample attribute data and the corresponding first sample attribute data;

and determining the first sample attribute data with the sample distance smaller than the preset distance as second sample attribute data similar to each sample attribute data.

5. The method according to claim 4, wherein the sample distance is a Hamming distance, and the calculating the sample distance between each sample attribute data and the corresponding first sample attribute data comprises:

comparing the Hamming code corresponding to each sample attribute data with the Hamming code corresponding to the first sample attribute data respectively, and determining that each sample attribute data and the first sample attribute data have different coded bit numbers;

determining the number of bits as a hamming distance between the respective sample attribute data and its corresponding first sample attribute data.

6. The method according to any one of claims 1 to 5, wherein the determining whether each sample attribute data is a contaminated sample data based on the sample data size corresponding to the second sample attribute data comprises:

counting sample data size corresponding to each platform user, and screening out the maximum sample data size from each sample data size;

and judging whether each sample attribute data is pollution sample data or not according to the maximum sample data size and the sample data size corresponding to the second sample attribute data.

7. The method according to claim 6, wherein said determining whether each sample attribute data is a dirty sample data according to the maximum sample data size and the sample data size corresponding to the second sample attribute data comprises:

subtracting the maximum sample data size from the sample data size corresponding to the second sample attribute data to obtain a sample size difference corresponding to each sample attribute data;

and if the sample number difference corresponding to the target sample attribute data in the sample attribute data is larger than a preset sample number difference, determining that the target sample attribute data and the second sample attribute data corresponding to the target sample attribute data are pollution sample data.

8. An apparatus for detecting contaminated sample data for model training, comprising:

9. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.