WO2023035362A1 - Polluted sample data detecting method and apparatus for model training - Google Patents

Polluted sample data detecting method and apparatus for model training Download PDF

Info

Publication number
WO2023035362A1
WO2023035362A1 PCT/CN2021/124044 CN2021124044W WO2023035362A1 WO 2023035362 A1 WO2023035362 A1 WO 2023035362A1 CN 2021124044 W CN2021124044 W CN 2021124044W WO 2023035362 A1 WO2023035362 A1 WO 2023035362A1
Authority
WO
WIPO (PCT)
Prior art keywords
attribute data
sample
sample attribute
data
hash
Prior art date
Application number
PCT/CN2021/124044
Other languages
French (fr)
Chinese (zh)
Inventor
刘胜
魏国富
夏玉明
周晓勇
马影
殷钱安
梁淑云
余贤喆
陶景龙
王启凡
徐�明
Original Assignee
上海观安信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海观安信息技术股份有限公司 filed Critical 上海观安信息技术股份有限公司
Publication of WO2023035362A1 publication Critical patent/WO2023035362A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries

Definitions

  • the invention relates to the field of information technology, in particular to a method and device for detecting polluted sample data used for model training.
  • the invention provides a method and device for detecting polluted sample data for model training, which mainly aims to improve the detection accuracy of polluted sample data, ensure the safety of sample data, and thereby improve the detection accuracy of abnormal behavior detection models.
  • a method for detecting contaminated sample data for model training including:
  • sample attribute data corresponding to each platform user to be detected the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;
  • each sample attribute data is hashed into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets;
  • each of the sample attribute data is contaminated sample data.
  • a detection device for contaminated sample data for model training comprising:
  • An acquisition unit configured to acquire sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;
  • the hash unit is used to hash each sample attribute data into corresponding hash buckets in different hash tables by using a preset local sensitive hash algorithm, wherein any hash table includes multiple hashes bucket;
  • a determining unit configured to determine the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data
  • a screening unit configured to screen out second sample attribute data similar to the respective sample attribute data from the first sample attribute data
  • a judging unit configured to respectively judge whether each sample attribute data is polluted sample data based on the amount of sample data corresponding to the second sample attribute data.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented:
  • sample attribute data corresponding to each platform user to be detected the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;
  • each sample attribute data is hashed into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets;
  • each of the sample attribute data is contaminated sample data.
  • a computer device including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the following steps when executing the program:
  • sample attribute data corresponding to each platform user to be detected the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;
  • each sample attribute data is hashed into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets;
  • each of the sample attribute data is contaminated sample data.
  • the invention provides a method and device for detecting contaminated sample data for model training. Compared with the current method of detecting whether there is contaminated sample data in the sample data through sample duplication, the present invention can obtain the samples to be detected.
  • the sample attribute data corresponding to each platform user the sample attribute data at least includes the device attribute data, risk control data and business data corresponding to each platform user; and using the preset local sensitive hash algorithm, each sample attribute
  • the data is hashed into corresponding hash buckets in different hash tables, wherein any one hash table includes multiple hash buckets;
  • the data in the same hash bucket is determined as the first sample attribute data; and the second sample attribute data similar to the respective sample attribute data is filtered out from the first sample attribute data; finally based on the According to the amount of sample data corresponding to the second sample attribute data, respectively determine whether each of the sample attribute data is polluted sample data, thereby improving the detection accuracy of polluted sample data, ensuring the security of sample attribute data, and improving the abnormal behavior detection model. detection accuracy.
  • Fig. 1 shows a flow chart of a method for detecting contaminated sample data used for model training provided by an embodiment of the present invention
  • FIG. 2 shows a flow chart of another method for detecting contaminated sample data used for model training provided by an embodiment of the present invention
  • FIG. 3 shows a schematic diagram of a hash table provided by an embodiment of the present invention
  • Fig. 4 shows a schematic structural diagram of a detection device for contaminated sample data used for model training provided by an embodiment of the present invention
  • FIG. 5 shows a schematic structural diagram of another detection device for contaminated sample data used for model training provided by an embodiment of the present invention
  • FIG. 6 shows a schematic diagram of a physical structure of a computer device provided by an embodiment of the present invention.
  • an embodiment of the present invention provides a method for detecting contaminated sample data for model training, as shown in FIG. 1 , the method includes:
  • the sample attribute data includes at least the device attribute data, risk control data, and business data corresponding to the users of each platform.
  • the business personnel When the user operates the e-commerce APP, the business personnel will bury the important checkpoints in the operation process, After the user triggers a preset point, a corresponding device attribute information will be generated.
  • the device attribute information specifically includes: device ID, device model, APP name, APP version number, etc.
  • the risk control data includes all request information and personal information of the user.
  • the personal information includes the user name, mobile phone number, etc.
  • the business data includes information such as all orders, refunds, and order details of the user.
  • the embodiment of the present invention hashes the sample attribute data of users on each platform to the corresponding hash buckets in different hash tables and filter out the second sample attribute data similar to each sample attribute data from the first sample attribute data in the same hash bucket as each sample attribute data, based on the sample data corresponding to the second sample attribute data Quantity, to determine whether each sample attribute data is polluted sample data, compared with the simple duplicate checking method in the prior art, the embodiment of the present invention has a higher detection accuracy for polluted sample data, thereby improving the detection accuracy of the abnormal behavior detection model .
  • the embodiment of the present invention is mainly applied to the scenario of performing pollution detection on sample attribute data.
  • the executor of the embodiment of the present invention is a device or device capable of performing contamination detection on sample attribute data, which can be specifically set on the client side or the server side.
  • the device attribute data, risk control data and business data of a large number of platform users are collected in advance, and the above-mentioned collected data is used as the user's sample attribute data. Since the sample attribute data may have been maliciously polluted, in order to ensure The detection accuracy of the follow-up abnormal behavior detection model needs to detect the polluted data in the sample attribute data. For example, collect the sample attribute data of 200 platform users. Whenever the platform user operates on the e-commerce platform, the platform user will be collected If each user collects 5 pieces of sample attribute data, a total of 1,000 pieces of sample attribute data will be collected, and these 1,000 pieces of data will be used as the training data for the abnormal behavior detection model.
  • any hash table includes multiple hash buckets, and the specific number of hash buckets can be set according to actual business requirements.
  • the embodiment of the present invention in the process of performing pollution detection on sample attribute data, it is necessary to count similar sample data through sample matching. Since there are a large number of sample attribute data in the training set, if the calculation between any two sample attribute data In order to reduce the amount of calculation in the sample attribute data matching process, the embodiment of the present invention adopts a preset local sensitive hash algorithm to hash each sample attribute data to the corresponding corresponding hash table in different hash tables.
  • the sample attribute data in the same hash bucket since the sample attribute data in the same hash bucket has a high probability of being similar, it can be determined from the first sample attribute data in the same hash bucket with the sample attribute data that it is similar to the sample attribute data
  • the second sample attribute data which can greatly narrow the scope of data matching from massive data, and reduce the calculation amount of sample attribute data.
  • d(x, y) is the distance between any two sample attribute data x and y in high-dimensional space
  • h is the hash function
  • h(x) and h(y) are the pair of sample attribute data x and y Hash transformation
  • c is a constant, and it needs to satisfy c>1 and P 1 >P 2 , wherein, according to the accuracy rate of similar samples being found, the probability values P 1 and P 2 are preset, and the required similar samples are The higher the accuracy rate found, the larger the probability value P 1 and the smaller P 2 , on the contrary, the lower the accuracy rate of the required similar samples being found, the smaller the probability value P 1 and the larger P 2 , but at the same time It is guaranteed that P 1 >P 2 .
  • the number of hash tables and the number of hash functions corresponding to each hash table it is necessary to determine the number of hash tables and the number of hash functions corresponding to each hash table according to the accuracy of searching similar sample attribute data. It should be noted that the number of divided hash tables The more, and the more hash functions corresponding to each hash table, the higher the accuracy of finding similar sample attribute data, but due to the increase of hash tables and hash functions, the sample attribute data will also be inaccurate. The calculation amount increases, so the accuracy rate of similar sample attribute data search and the calculation amount of sample attribute data are comprehensively considered, and then the number of hash tables in the embodiment of the present invention and the number of hash functions corresponding to each hash table are determined.
  • each hash table corresponds to two hash functions
  • hash table 1 corresponds to hash functions h 1 and h 2
  • hash table 2 corresponds to hash functions h 3 and h 4.
  • Hash table 3 corresponds to hash functions h 5 and h 6 .
  • each sample attribute data is hashed into the corresponding hash bucket in each hash table, wherein each hash table corresponds to Multiple hash buckets, each corresponding to a different hash value.
  • each hash table in hash table 1, hash table 2, and hash table 3 includes 4 hash buckets, and the hash value corresponding to the first hash bucket in these three hash tables is 00 , the hash value corresponding to the second hash bucket is 01, the corresponding hash value of the third hash bucket is 10, and the corresponding hash value of the fourth hash bucket is 11. If the hash table 1 is used to correspond to The hash functions h 1 and h 2 of the sample attribute data A are used to calculate the hash value corresponding to 00.
  • each sample attribute data can be hashed into corresponding hash buckets in different hash
  • step 102 has already hashed each sample attribute data into corresponding hash buckets in different hash tables. Since the sample attribute data in the same hash bucket has a high probability of being similar, it can Respectively determine the first sample attribute data in the same hash bucket as each sample attribute data in different hash tables, and then filter the samples similar to each sample attribute data from the first sample attribute data corresponding to each sample attribute data Second sample attribute data.
  • the sample attribute data A is in the first hash bucket of hash table 1, the third hash bucket of hash table 2, and the fourth hash bucket of hash table 3 respectively.
  • Data A can obtain all sample attribute data in the first hash bucket in hash table 1, all sample attribute data in the third hash bucket in hash table 2, and the fourth sample attribute data in hash table 3
  • For all the sample attribute data in the hash bucket use the obtained above sample attribute data as the first sample attribute data corresponding to sample attribute data A, so as to filter the second sample attribute data similar to sample attribute data A from the first sample attribute data.
  • Sample attribute data for another example, sample attribute data B is in the third hash bucket of hash table 1, the fourth hash bucket of hash table 2, and the first hash bucket of hash table 3 Therefore, for sample attribute data B, all sample attribute data in the third hash bucket in hash table 1 and all sample attribute data in the fourth hash bucket in hash table 2 can be obtained respectively, ha
  • the second sample attribute data similar to the attribute data B can obtain the first sample attribute data in the same hash bucket as each sample attribute data according to the above method, thereby greatly reducing the matching range of the sample attribute data.
  • the preset local sensitive hash algorithm can only ensure that the sample attribute data in the same hash bucket have a large The high probability is similar, but it cannot guarantee that the sample attribute data in the same hash bucket must be similar. Therefore, in order to further improve the detection accuracy of the contaminated sample data, you can filter the first sample attribute data corresponding to each sample attribute data and The second sample attribute data that each sample attribute data is truly similar to.
  • the sample distance between each sample attribute data and its corresponding first sample attribute data can be calculated respectively, and the second sample attribute similar to the sample attribute data is screened from the first sample attribute data according to the sample distance Data
  • the specific calculation method of the sample distance can use Euclidean distance, cosine distance, Hamming distance and other calculation methods, which are not specifically limited in the embodiment of the present invention, because the larger the sample distance, the smaller the similarity between sample attribute data, On the contrary, the smaller the sample distance, the greater the similarity between the sample attribute data, so the first sample attribute data whose sample distance is smaller than the preset sample distance can be determined as the second sample attribute data similar to each sample attribute data.
  • sample attribute data in the same hash bucket as sample attribute data A in different hash tables includes sample attribute data B, sample attribute data C, and sample attribute data D, that is, the first sample attribute data corresponding to sample attribute data A
  • Attribute data includes sample attribute data B, sample attribute data C and sample attribute data D, respectively calculate the sample distance between sample attribute data A and sample attribute data B, sample attribute data C and sample attribute data D, and determine the sample attributes by comparison
  • the sample distance between data A and sample attribute data B is less than the preset distance
  • the sample distance between sample attribute data A and sample attribute data C is less than the preset distance, so it can be determined that the sample attribute data B and sample attribute data C are different from the sample
  • the attribute data A is similar, that is, the second sample attribute data similar to the sample attribute data A are sample attribute data B and sample attribute data C.
  • sample attribute data in the same hash bucket as sample attribute data B in different hash tables includes sample attribute data A, sample attribute data E, and sample attribute data F, that is, the first item corresponding to sample attribute data B
  • This attribute data includes sample attribute data A, sample attribute data E, and sample attribute data F.
  • sample attribute data A Since the sorting position corresponding to sample attribute data A is before sample attribute data B, that is, during the distance calculation process for sample attribute data A, Through the sample distance between sample attribute data A and sample attribute data B, and determine that sample attribute data A is similar to sample attribute data B, therefore, for sample attribute data B, only sample attribute data B and sample attribute data E and The sample distance between the sample attribute data F is determined by comparing the sample distance between the sample attribute data B and the sample attribute data E is less than the preset distance, so it can be determined that the sample attribute data A and the sample attribute data E are similar to the sample attribute data B , that is, the second sample attribute data similar to sample attribute data B are sample attribute data A and sample attribute data E.
  • the second sample attribute data similar to the respective sample attribute data can be determined respectively.
  • the sample attribute data volume is normal. If the sample attribute data volume does not exceed the normal range, it means that the sample attribute data is not polluted; if the sample attribute data volume exceeds the normal range, it means that the sample attribute data has been polluted. These data were excluded from the training set.
  • the training set is 1000 sample attribute data collected from 200 platform users, and each platform user collects 5 sample attribute data. Under normal circumstances, the 5 sample attribute data of the same platform user should be similar. If The amount of sample data corresponding to the second sample attribute data similar to sample attribute data A is 50, far exceeding the normal range of 5. It can be seen that the second sample attribute data similar to sample attribute data A has been polluted, and the attacker is likely to Create data similar to sample attribute data A by copying, subtle transformation, etc., and add it to the training set. In order to avoid contamination of the training data and ensure the detection accuracy of the abnormal behavior detection model, it is necessary to combine the sample attribute data A These second sample attribute data similar to A are excluded from the training set. Therefore, according to the above method, by counting the data amount of the second sample attribute data corresponding to each sample attribute data, it can be determined whether the sample attribute data and the corresponding second sample attribute data have been polluted.
  • the embodiment of the present invention provides a method for detecting contaminated sample data used for model training. Compared with the current method of detecting whether there is contaminated sample data in the sample data through sample duplication, the present invention can obtain the sample data to be detected.
  • the sample attribute data corresponding to each platform user the sample attribute data at least includes the device attribute data, risk control data and business data corresponding to each platform user; and using the preset local sensitive hash algorithm, each sample attribute
  • the data is hashed into corresponding hash buckets in different hash tables, wherein any one hash table includes multiple hash buckets;
  • the data in the same hash bucket is determined as the first sample attribute data; and the second sample attribute data similar to the respective sample attribute data is filtered out from the first sample attribute data; finally based on the According to the amount of sample data corresponding to the second sample attribute data, respectively determine whether each of the sample attribute data is polluted sample data, thereby improving the detection accuracy of polluted sample data, ensuring the security of sample attribute data, and improving the abnormal behavior detection model
  • the embodiment of the present invention provides another detection method of contaminated sample data for model training, as shown in the figure 2, the method includes:
  • step 101 in order to ensure the safety of the training set, it is necessary to obtain the attribute data of each sample in the training set for pollution detection to determine whether it is contaminated data.
  • the specific process for obtaining the sample attribute data is exactly the same as step 101, here No longer.
  • any hash table includes multiple hash buckets, and the specific number of hash buckets can be set according to actual business requirements.
  • step 202 specifically includes: using a preset local sensitive hash algorithm to calculate the hash values in the different hash tables; based on the hash values, hash the respective sample attribute data into corresponding hash buckets in the different hash tables.
  • using the preset locality-sensitive hash algorithm to calculate the hash values of the respective sample attribute data in the different hash tables respectively includes: determining the data dimensions and coordinates corresponding to the respective sample attribute data value; based on the data dimension and the coordinate value, determine the Hamming code corresponding to each sample attribute data; use the hash function corresponding to the different hash tables to extract the corresponding position in the Hamming code Encoding, and determining the extracted encoding as the hash value of each sample attribute data in the different hash tables.
  • the preset local sensitive hashing algorithm adopted in the embodiment of the present invention is mainly a local sensitive hashing algorithm under Hamming distance.
  • the data dimensions corresponding to each sample attribute data and the coordinate values at different positions are determined.
  • determine the maximum coordinate value corresponding to each sample attribute data determine the maximum coordinate value corresponding to each sample attribute data, and multiply the maximum coordinate value and the data dimension to obtain the Hamming coded digits corresponding to each sample attribute data , perform Hamming encoding on each sample attribute data based on the number of Hamming encoding bits.
  • the codes at corresponding positions in the Hamming codes corresponding to each sample attribute data are extracted, and the extracted codes are determined as the hash values corresponding to each sample attribute data.
  • the specific formula of Hamming coding is as follows:
  • v(p) represents the Hamming code corresponding to each sample attribute data
  • x 1 , x 2 ... x n is the coordinate value corresponding to each sample attribute data
  • n is the data dimension corresponding to the sample attribute data
  • Unaryc(x) is A string of binary Hamming codes with a length of C.
  • C is the maximum coordinate value of the sample attribute data. After determining the Hamming code corresponding to each coordinate value in the sample attribute data, they are spliced to obtain the Hanming code corresponding to the sample attribute data.
  • Ming code v(p), Unaryc(x) means that in the Hamming code of length C, the code before the x bit is 1, and the code after the x bit is 0.
  • each hash table includes 4 hash buckets, and the hash corresponding to each hash bucket The hash values are 00, 01, 10, and 11 respectively.
  • the sample attribute data A is hashed to the first hash bucket of the first hash table
  • the third hash bucket of the second hash table and in the first hash bucket of the third hash table, similarly, other sample attribute data can be hashed to the corresponding hashes of different hash tables.
  • Greek barrel as shown in Figure 3.
  • the method of extracting the first sample attribute data in the same hash bucket as the sample attribute data in different hash tables is completely the same as that of step 103, and will not be repeated here.
  • step 204 specifically includes: respectively calculating the sample distance between each sample attribute data and its corresponding first sample attribute data; The first sample attribute data whose sample distance is smaller than the preset distance is determined as the second sample attribute data similar to the respective sample attribute data.
  • the sample distance may specifically be a Hamming distance
  • the method includes: separately dividing each sample attribute data Comparing the corresponding Hamming code with the Hamming code corresponding to the first sample attribute data, and determining that each sample attribute data and the first sample attribute data have different coded digits; It is determined as the Hamming distance between each sample attribute data and the corresponding first sample attribute data.
  • the preset distance may be set according to actual service requirements.
  • the Hamming code corresponding to sample attribute data A is 110011
  • the first sample attribute data corresponding to sample attribute data A includes sample attribute data B and sample attribute data C
  • the Hamming code corresponding to sample attribute data B is 111011
  • sample The Hamming code corresponding to attribute data C is 000001
  • the preset Hamming distance is 2.
  • the sample attribute data A and the sample attribute data B have a bit value difference, so it can be determined that the Hamming distance between the sample attribute data A and the sample attribute data B is 1, further, because the sample attribute data A and the sample attribute data The Hamming distance between B is less than the preset Hamming distance 2. Therefore, it can be determined that the sample attribute data A is similar to the sample attribute data B. Similarly, the Hamming distance between the sample attribute data A and the sample attribute data C can be determined as 3.
  • the Hamming distance between the sample attribute data A and the sample attribute data C is greater than the preset Hamming distance 2, it can be determined that the sample attribute data C is not similar to the sample attribute data A, that is, the first sample attribute data and The second sample attribute data similar to sample attribute data A is sample attribute data B. In this way, the second sample attribute data similar to the respective sample attribute data can be respectively determined in the manner described above.
  • the positions of each sample attribute data can be sorted in advance, for example, the sorting positions corresponding to the sample attribute data are sample attribute data A, sample attribute data B, sample attribute data C and sample attribute data D, and then According to the sorting position corresponding to each sample attribute data, the first sample attribute data corresponding to each sample attribute data can be sequentially determined, and the sample distance can be calculated.
  • the first sample attribute data corresponding to sample attribute data A can be determined first, and then the sample attribute data can be calculated.
  • the sample distance between attribute data A and its corresponding first sample attribute data and then determine the first sample attribute data corresponding to sample attribute data B, and calculate the sample attribute data B and its corresponding first sample attribute The sample distance between data.
  • the sorting position corresponding to the first sample attribute data is Before the sorting position of the corresponding sample attribute data, if the sorting position corresponding to the first sample attribute data is before the sorting position of the corresponding sample attribute data, it means that the first sample attribute data and sample attribute data have been calculated There is no need to repeat the calculation; if the sorting position corresponding to the first sample attribute data is after the sorting position of the corresponding sample attribute data, it is necessary to calculate the sample attribute data and the corresponding first sample attribute The sample distance between data.
  • the training set includes sample attribute data A, sample attribute data B, sample attribute data C, and sample attribute data D
  • the first sample attribute data corresponding to sample attribute data A includes sample attribute data B and sample attribute data C
  • the first sample attribute data corresponding to attribute data B includes sample attribute data A and sample attribute data D.
  • the second sample attribute data similar to sample attribute data A has been determined to be sample attribute data B, then in the process of determining the second sample attribute data similar to the sample attribute data B, it is not necessary to repeatedly calculate the sample distance between the sample attribute data B and its corresponding first sample attribute data A, and can directly Call the previous calculation results to determine that the sample attribute data B is similar to the first sample attribute data A, and then only need to calculate the sample distance between the sample attribute data B and its corresponding first sample attribute data D, and then determine the two Is it similar. In this way, the calculation amount of the sample distance can be further reduced, and the detection efficiency of the attribute data of the contaminated sample can be improved.
  • each sample attribute data and its corresponding second sample attribute data have been polluted through the amount of sample data corresponding to the second sample attribute data similar to each sample attribute data.
  • the actual collection of sample attribute data needs to be combined in the process of judgment.
  • the collected sample attribute data set includes 1000 sample attribute data of 200 individuals, and 5 sample attribute data of each person are collected. It can be seen that, in this training The collection includes a total of 200 platform users, and each platform user corresponds to 5 pieces of sample data, so it can be determined that the maximum sample data size is 5.
  • the collected training set includes 200 sample attribute data of 50 individuals, of which 48 individuals collected 4 sample attribute data, one person only collected 1 sample data, and another person collected 7 sample attribute data , so it can be determined that the maximum sample data size is 7. Therefore, according to the above method, the maximum sample data volume can be determined in combination with the actual situation, so as to determine whether the sample attribute data is polluted data according to the maximum sample data volume.
  • step 206 specifically includes: subtracting the sample data volume corresponding to the second sample attribute data from the maximum sample data volume to obtain the The sample size difference corresponding to the attribute data; if the sample size difference corresponding to the target sample attribute data in each sample attribute data is greater than the preset sample size difference, then determine the target sample attribute data and its corresponding second sample attribute data is the polluted sample data.
  • the target sample data is any sample data in the training set, and the difference in the number of preset samples can be set according to actual business requirements.
  • the maximum sample data volume is determined to be 100
  • the preset sample size difference is 50
  • the sample size corresponding to the second sample attribute data similar to sample attribute data A is 200
  • sample size difference is greater than the preset sample size difference of 50, it can be determined that the sample attribute data A and its corresponding The attribute data of the second sample has already been polluted, that is, the attacker may maliciously create contaminated sample attribute data similar to sample attribute data A by copying, transforming, etc., and add it to the training set, resulting in the second
  • the number of samples of sample attribute data of 200 is far more than the amount of normal sample data of 100. Therefore, it is necessary to exclude all sample attribute data A and its corresponding second sample attribute data from the training set in order to ensure the safety of the training set.
  • sample data volume is smaller than the preset sample data volume, it can be determined that the sample data volume of the second sample attribute data similar to the sample attribute data B is normal.
  • the scope that is, the sample attribute data B and its corresponding second sample attribute data are not polluted. In this way, it is possible to sequentially determine whether each sample attribute data in the training set of the abnormal behavior detection model is polluted data in the above manner.
  • the embodiment of the present invention provides another method for detecting contaminated sample data used for model training. Compared with the current method of detecting whether there is contaminated sample data in the sample data through sample duplication, this method can obtain the The detected sample attribute data corresponding to each platform user, the sample attribute data at least includes the device attribute data, risk control data and business data corresponding to each platform user; and using the preset local sensitive hash algorithm, each sample The attribute data is hashed into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets; at the same time, the different hash tables are combined with the samples The data whose attribute data is located in the same hash bucket is determined as the first sample attribute data; and the second sample attribute data similar to the respective sample attribute data is filtered out from the first sample attribute data; finally based on The amount of sample data corresponding to the second sample attribute data determines whether each sample attribute data is polluted sample data, thereby improving the detection accuracy of polluted sample data, ensuring the security of sample attribute data, and improving abnormal behavior detection
  • an embodiment of the present invention provides a detection device for contaminated sample data used for model training.
  • the device includes: an acquisition unit 31, a hash unit 32, A determination unit 33 , a screening unit 34 and a determination unit 35 .
  • the acquiring unit 31 may be configured to acquire sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data and service data corresponding to each platform user.
  • the hash unit 32 can be used to hash each sample attribute data into corresponding hash buckets in different hash tables by using a preset local sensitive hash algorithm, wherein any hash table includes Multiple hash buckets.
  • the determining unit 33 may be configured to determine the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data.
  • the screening unit 34 may be configured to screen out second sample attribute data similar to the respective sample attribute data from the first sample attribute data.
  • the determining unit 35 may be configured to determine whether each sample attribute data is contaminated sample data based on the amount of sample data corresponding to the second sample attribute data.
  • the hash unit 32 includes: a first computing module 321 and hash module 322 .
  • the first calculation module 321 may be configured to use a preset locality-sensitive hash algorithm to separately calculate the hash values of the respective sample attribute data in the different hash tables.
  • the hash module 322 may be configured to hash the respective sample attribute data into corresponding hash buckets in the different hash tables based on the hash value.
  • the first calculation module 321 includes: a determination submodule and an extraction submodule.
  • the determining submodule can be used to determine the data dimension and coordinate value corresponding to each sample attribute data.
  • the determining submodule may also determine the Hamming code corresponding to each sample attribute data based on the data dimension and the coordinate value.
  • the extracting submodule can be used to extract the code at the corresponding position in the Hamming code by using the hash function corresponding to the different hash tables, and determine the extracted code as the attribute data of each sample in the Hash values in the different hash tables described above.
  • the screening unit 34 includes: a second calculation module 341 and a determination module 342 .
  • the second calculation module 341 may be configured to respectively calculate sample distances between the respective sample attribute data and the corresponding first sample attribute data.
  • the determining module 342 may be configured to determine the first sample attribute data whose sample distance is less than a preset distance as the second sample attribute data similar to the respective sample attribute data.
  • the sample distance is a Hamming distance
  • the second calculation module 341 includes: a comparison submodule and a determination submodule.
  • the comparison sub-module can be used to respectively compare the Hamming code corresponding to each sample attribute data with the Hamming code corresponding to the first sample attribute data, and determine that each sample attribute data and the first sample attribute data A sample attribute data has different encoding bits.
  • the determining submodule may be configured to determine the number of digits as a Hamming distance between each sample attribute data and its corresponding first sample attribute data.
  • the determination unit 35 in order to determine whether each sample attribute data is contaminated sample data, includes: a statistics module 351 and a determination module 352 .
  • the statistical module 351 can be used to count the amount of sample data corresponding to each platform user, and filter out the largest amount of sample data from each amount of sample data.
  • the determining module 352 may be configured to determine whether each sample attribute data is contaminated sample data according to the maximum sample data amount and the sample data amount corresponding to the second sample attribute data.
  • the determination module 352 includes: a subtraction submodule and a determination submodule.
  • the subtraction sub-module may be configured to subtract the sample data amount corresponding to the second sample attribute data from the maximum sample data amount to obtain the sample amount difference corresponding to each sample attribute data.
  • the determining submodule may be configured to determine whether the target sample attribute data and its corresponding second sample attribute are different if the sample size difference corresponding to the target sample attribute data in each sample attribute data is greater than the preset sample size difference.
  • the data are polluted sample data.
  • an embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented: obtaining each Sample attribute data corresponding to platform users, said sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user; each sample attribute data is hashed using a preset local sensitive hash algorithm into the corresponding hash buckets in different hash tables, wherein any one hash table includes multiple hash buckets; the different hash tables and the sample attribute data are located in the same hash bucket The data in is determined as the first sample attribute data; the second sample attribute data similar to the respective sample attribute data is screened out from the first sample attribute data; based on the sample corresponding to the second sample attribute data The amount of data is used to determine whether each sample attribute data is polluted sample data.
  • the embodiment of the present invention also provides a physical structure diagram of a computer device.
  • the computer device includes: a processor 41, Memory 42, and the computer program that is stored on the memory 42 and can run on the processor, wherein the memory 42 and the processor 41 are all set on the bus 43 and realize the following steps when the processor 41 executes the program: acquire the The sample attribute data corresponding to each platform user, the sample attribute data at least includes the device attribute data, risk control data and business data corresponding to each platform user; using the preset local sensitive hash algorithm, each sample attribute data Hash into the corresponding hash buckets in different hash tables, wherein any one hash table includes multiple hash buckets; the different hash tables and the sample attribute data are located in the same hash Determining the data in the bucket as the first sample attribute data; screening out the second sample attribute data similar to the respective sample attribute data from the first sample attribute data; corresponding to the second sample attribute data
  • the sample attribute data corresponding to each platform user to be detected is obtained, and the sample attribute data includes at least the equipment attribute data, risk control data and business data corresponding to each platform user; Sensitive hashing algorithm, which hashes each sample attribute data into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets; at the same time, the different The data located in the same hash bucket as the respective sample attribute data in the hash table is determined as the first sample attribute data; The second sample attribute data; finally, based on the amount of sample data corresponding to the second sample attribute data, determine whether each sample attribute data is polluted sample data, thereby improving the detection accuracy of polluted sample data and ensuring sample attribute data security, which in turn can improve the detection accuracy of the abnormal behavior detection model.
  • Sensitive hashing algorithm which hashes each sample attribute data into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets; at the same time, the different The data located in the same hash bucket
  • each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Alternatively, they may be implemented in program code executable by a computing device so that they may be stored in a storage device to be executed by a computing device, and in some cases in an order different from that shown here
  • the steps shown or described are carried out, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation.
  • the present invention is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Complex Calculations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided are a polluted sample data detecting method and apparatus for model training. The method comprises: acquiring sample attribute data corresponding to platform users to be detected (101); respectively hashing the sample attribute data into corresponding hash buckets in different hash tables by means of a preset locality-sensitive hashing algorithm (102); determining data, which is located in the same hash buckets as the sample attribute data, in the different hash tables as first sample attribute data (103); selecting second sample attribute data similar to the sample attribute data from the first sample attribute data (104); and separately determining, on the basis of the amount of sample data corresponding to the second sample attribute data, whether the sample attribute data is polluted sample data (105).

Description

用于模型训练的污染样本数据的检测方法及装置Method and device for detecting polluted sample data used for model training 技术领域technical field
本发明涉及信息技术领域,尤其是涉及一种用于模型训练的污染样本数据的检测方法及装置。The invention relates to the field of information technology, in particular to a method and device for detecting polluted sample data used for model training.
背景技术Background technique
在互联网越来越发达的今天,人们越来越多的在网上购物,因此电商平台往往推出各种优惠活动来吸引访客,这些优惠活动在吸引正常用户的同时,也吸引了各种不法分子的注意,识别出不法分子的异常行为对于网络平台安全有重要意义,随着人工智能领域的发展,可以利用大量用户的样本数据训练用于异常行为检测的模型,由于用户的样本数据是人工智能的基础,如果攻击者通过对样本数据进行污染,使人工智能算法学习到错误的数据特征,则会改变模型的分类边界,从而严重影响异常行为检测模型的执行效果,因此,在进行模型训练之前有必要对用户的样本数据进行污染检测。Today, with the Internet becoming more and more developed, people are shopping online more and more, so e-commerce platforms often launch various promotions to attract visitors. These promotions not only attract normal users, but also attract various criminals Note that identifying the abnormal behavior of criminals is of great significance to the security of network platforms. With the development of artificial intelligence, a large number of user sample data can be used to train models for abnormal behavior detection. Since user sample data is artificial intelligence If the attacker pollutes the sample data to make the artificial intelligence algorithm learn wrong data features, it will change the classification boundary of the model, which will seriously affect the execution effect of the abnormal behavior detection model. Therefore, before the model training It is necessary to perform pollution detection on the user's sample data.
目前,在对异常行为检测模型的训练集进行污染检测的过程中,通常通过对样本数据进行查重,来检测样本数据中是否存在被污染的样本数据。然而,污染样本数据的方式,包括复制、细微的变换、合成等手段,这种简单的查重方式无法对经过变换、合成的污染样本数据进行检测,从而会导致污染样本数据的检测精度较低,无法保证样本数据的安全,进而影响异常行为检测模型的检测精度。At present, in the process of performing pollution detection on the training set of the abnormal behavior detection model, it is usually checked whether there is contaminated sample data in the sample data by checking the sample data. However, the methods of polluting sample data include copying, subtle transformation, synthesis and other means. This simple method of duplicate checking cannot detect the transformed and synthesized polluted sample data, which will lead to low detection accuracy of polluted sample data. , the security of the sample data cannot be guaranteed, which in turn affects the detection accuracy of the abnormal behavior detection model.
发明内容Contents of the invention
本发明提供了一种用于模型训练的污染样本数据的检测方法及装置,主要在于能够提高污染样本数据的检测精度,确保样本数据的安全,从而能够提高异常行为检测模型的检测精度。The invention provides a method and device for detecting polluted sample data for model training, which mainly aims to improve the detection accuracy of polluted sample data, ensure the safety of sample data, and thereby improve the detection accuracy of abnormal behavior detection models.
根据本发明的第一个方面,提供一种用于模型训练的污染样本数据的检测方法,包括:According to a first aspect of the present invention, a method for detecting contaminated sample data for model training is provided, including:
获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据;Acquiring sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;
利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶;Using the preset local sensitive hash algorithm, each sample attribute data is hashed into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets;
将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据;determining the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data;
从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据;Screening out second sample attribute data similar to the respective sample attribute data from the first sample attribute data;
基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本 数据。Based on the amount of sample data corresponding to the second sample attribute data, it is determined whether each of the sample attribute data is contaminated sample data.
根据本发明的第二个方面,提供一种用于模型训练的污染样本数据的检测装置,包括:According to a second aspect of the present invention, there is provided a detection device for contaminated sample data for model training, comprising:
获取单元,用于获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据;An acquisition unit, configured to acquire sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;
哈希单元,用于利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶;The hash unit is used to hash each sample attribute data into corresponding hash buckets in different hash tables by using a preset local sensitive hash algorithm, wherein any hash table includes multiple hashes bucket;
确定单元,用于将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据;A determining unit, configured to determine the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data;
筛选单元,用于从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据;a screening unit, configured to screen out second sample attribute data similar to the respective sample attribute data from the first sample attribute data;
判定单元,用于基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据。A judging unit, configured to respectively judge whether each sample attribute data is polluted sample data based on the amount of sample data corresponding to the second sample attribute data.
根据本发明的第三个方面,提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现以下步骤:According to a third aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented:
获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据;Acquiring sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;
利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶;Using the preset local sensitive hash algorithm, each sample attribute data is hashed into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets;
将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据;determining the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data;
从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据;Screening out second sample attribute data similar to the respective sample attribute data from the first sample attribute data;
基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据。Based on the amount of sample data corresponding to the second sample attribute data, it is determined whether each of the sample attribute data is contaminated sample data.
根据本发明的第四个方面,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下步骤:According to a fourth aspect of the present invention, a computer device is provided, including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the following steps when executing the program:
获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据;Acquiring sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;
利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶;Using the preset local sensitive hash algorithm, each sample attribute data is hashed into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets;
将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性 数据;Determining the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data;
从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据;Screening out second sample attribute data similar to the respective sample attribute data from the first sample attribute data;
基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据。Based on the amount of sample data corresponding to the second sample attribute data, it is determined whether each of the sample attribute data is contaminated sample data.
本发明提供的一种用于模型训练的污染样本数据的检测方法及装置,与目前通过样本查重来检测样本数据中是否存在被污染的样本数据的方式相比,本方明能够获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据;并利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶;与此同时,将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据;并从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据;最终基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据,从而能够提高污染样本数据的检测精度,保证样本属性数据的安全,进而能够提高异常行为检测模型的检测精度。The invention provides a method and device for detecting contaminated sample data for model training. Compared with the current method of detecting whether there is contaminated sample data in the sample data through sample duplication, the present invention can obtain the samples to be detected. The sample attribute data corresponding to each platform user, the sample attribute data at least includes the device attribute data, risk control data and business data corresponding to each platform user; and using the preset local sensitive hash algorithm, each sample attribute The data is hashed into corresponding hash buckets in different hash tables, wherein any one hash table includes multiple hash buckets; The data in the same hash bucket is determined as the first sample attribute data; and the second sample attribute data similar to the respective sample attribute data is filtered out from the first sample attribute data; finally based on the According to the amount of sample data corresponding to the second sample attribute data, respectively determine whether each of the sample attribute data is polluted sample data, thereby improving the detection accuracy of polluted sample data, ensuring the security of sample attribute data, and improving the abnormal behavior detection model. detection accuracy.
附图说明Description of drawings
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:
图1示出了本发明实施例提供的一种用于模型训练的污染样本数据的检测方法流程图;Fig. 1 shows a flow chart of a method for detecting contaminated sample data used for model training provided by an embodiment of the present invention;
图2示出了本发明实施例提供的另一种用于模型训练的污染样本数据的检测方法流程图;FIG. 2 shows a flow chart of another method for detecting contaminated sample data used for model training provided by an embodiment of the present invention;
图3示出了本发明实施例提供的哈希表示意图;FIG. 3 shows a schematic diagram of a hash table provided by an embodiment of the present invention;
图4示出了本发明实施例提供的一种用于模型训练的污染样本数据的检测装置的结构示意图;Fig. 4 shows a schematic structural diagram of a detection device for contaminated sample data used for model training provided by an embodiment of the present invention;
图5示出了本发明实施例提供的另一种用于模型训练的污染样本数据的检测装置的结构示意图;FIG. 5 shows a schematic structural diagram of another detection device for contaminated sample data used for model training provided by an embodiment of the present invention;
图6示出了本发明实施例提供的一种计算机设备的实体结构示意图。FIG. 6 shows a schematic diagram of a physical structure of a computer device provided by an embodiment of the present invention.
具体实施方式Detailed ways
下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。Hereinafter, the present invention will be described in detail with reference to the drawings and examples. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.
目前,在对样本数据进行污染检测的过程中,通常通过对样本数据进行查重,来检测样本数据中是否存在被污染的样本数据。然而,污染样本数据的方式,包括复制、变换、合成等手段,这种简单的查重方式无法对经过变换、合成的污染样本数据进行检测,从而会导致污染样本数据的检测精度较低,无 法保证样本数据的安全,进而影响异常行为检测模型的检测精度。At present, in the process of performing contamination detection on sample data, it is usually checked whether there is contaminated sample data in the sample data by performing a duplicate check on the sample data. However, the methods of polluting sample data include copying, transforming, synthesizing and other methods. This simple method of duplication checking cannot detect the transformed and synthesized polluted sample data, which will lead to low detection accuracy of polluted sample data and cannot Ensure the safety of sample data, which in turn affects the detection accuracy of the abnormal behavior detection model.
为了解决上述问题,本发明实施例提供了一种用于模型训练的污染样本数据的检测方法,如图1所示,所述方法包括:In order to solve the above problems, an embodiment of the present invention provides a method for detecting contaminated sample data for model training, as shown in FIG. 1 , the method includes:
101、获取待检测的各个平台用户对应的样本属性数据。101. Obtain sample attribute data corresponding to each platform user to be detected.
其中,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据,用户在电商APP进行操作时,业务人员对操作流程中的重要关卡进行埋点处理,用户触发一次预设点之后就会产生一条相应的设备属性信息,该设备属性信息具体包括:设备ID,设备型号、APP名称、APP版本号等,风控数据包括用户的所有请求信息和个人信息,该个人信息包括用户名、手机号等,业务数据包括用户的所有订单、退单和订单详情等信息。为了克服现有技术中污染样本数据检测精度较低,进而影响用户异常行为检测精度的缺陷,本发明实施例通过将各个平台用户的样本属性数据哈希到不同哈希表中相应的哈希桶内,并从与各个样本属性数据在同一哈希桶内的第一样本属性数据中筛选出与各个样本属性数据相似的第二样本属性数据,能够基于该第二样本属性数据对应的样本数据量,判定各个样本属性数据是否为污染样本数据,相比于现有技术中简单的查重方式,本发明实施例对污染样本数据检测的精度更高,从而能够提高异常行为检测模型的检测精度。本发明实施例主要应用于对样本属性数据进行污染检测的场景。本发明实施例的执行主体为能够对样本属性数据进行污染检测的装置或者设备,具体可以设置在客户端或者服务器一侧。Wherein, the sample attribute data includes at least the device attribute data, risk control data, and business data corresponding to the users of each platform. When the user operates the e-commerce APP, the business personnel will bury the important checkpoints in the operation process, After the user triggers a preset point, a corresponding device attribute information will be generated. The device attribute information specifically includes: device ID, device model, APP name, APP version number, etc., and the risk control data includes all request information and personal information of the user. , the personal information includes the user name, mobile phone number, etc., and the business data includes information such as all orders, refunds, and order details of the user. In order to overcome the defects in the prior art that the detection accuracy of polluted sample data is low, which in turn affects the detection accuracy of user abnormal behaviors, the embodiment of the present invention hashes the sample attribute data of users on each platform to the corresponding hash buckets in different hash tables and filter out the second sample attribute data similar to each sample attribute data from the first sample attribute data in the same hash bucket as each sample attribute data, based on the sample data corresponding to the second sample attribute data Quantity, to determine whether each sample attribute data is polluted sample data, compared with the simple duplicate checking method in the prior art, the embodiment of the present invention has a higher detection accuracy for polluted sample data, thereby improving the detection accuracy of the abnormal behavior detection model . The embodiment of the present invention is mainly applied to the scenario of performing pollution detection on sample attribute data. The executor of the embodiment of the present invention is a device or device capable of performing contamination detection on sample attribute data, which can be specifically set on the client side or the server side.
对于本发明实施例,预先收集大量平台用户的设备属性数据、风控数据和业务数据,并将收集到的上述数据作为用户的样本属性数据,由于该样本属性数据可能遭受过恶意污染,为了保证后续异常行为检测模型的检测精度,需要对样本属性数据中被污染的数据进行检测,如收集200个平台用户的样本属性数据,每当平台用户在电商平台上进行操作时,都会采集平台用户的样本属性数据,如每个用户采集5条样本属性数据,则共收集1000条样本属性数据,将这1000条数据作为异常行为检测模型的训练数据,在正式训练之前,为了保证后续模型的检测精度,需要先检测这1000条样本属性数据中是否存在污染数据,以免收集的样本属性数据集被污染,影响后期异常行为检测模型的训练效果。For the embodiment of the present invention, the device attribute data, risk control data and business data of a large number of platform users are collected in advance, and the above-mentioned collected data is used as the user's sample attribute data. Since the sample attribute data may have been maliciously polluted, in order to ensure The detection accuracy of the follow-up abnormal behavior detection model needs to detect the polluted data in the sample attribute data. For example, collect the sample attribute data of 200 platform users. Whenever the platform user operates on the e-commerce platform, the platform user will be collected If each user collects 5 pieces of sample attribute data, a total of 1,000 pieces of sample attribute data will be collected, and these 1,000 pieces of data will be used as the training data for the abnormal behavior detection model. Before formal training, in order to ensure the detection of subsequent models Accuracy, it is necessary to first detect whether there is any contaminated data in the 1000 sample attribute data, so as to prevent the collected sample attribute data set from being polluted and affect the training effect of the abnormal behavior detection model in the later stage.
102、利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内。102. Using a preset locality-sensitive hashing algorithm, respectively hash each sample attribute data into corresponding hash buckets in different hash tables.
其中,任意一个哈希表中均包括多个哈希桶,哈希桶的具体数量可以根据实际的业务需求进行设定。对于本发明实施例,在对样本属性数据进行污染检测的过程中,需要通过样本匹配,统计相似的样本数据量,由于训练集中存在大量的样本属性数据,如果计算任意两个样本属性数据之间的相似度的话,计 算量巨大,为了减少样本属性数据匹配过程中的计算量,本发明实施例采用了预设局部敏感哈希算法,将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,由于同一哈希桶内的样本属性数据有很大概率是相似的,因此可以从与样本属性数据在同一哈希桶内的第一样本属性数据中确定与样本属性数据相似的第二样本属性数据,从而能够从海量数据中大大缩小了数据匹配的范围,减少了样本属性数据的计算量。Wherein, any hash table includes multiple hash buckets, and the specific number of hash buckets can be set according to actual business requirements. For the embodiment of the present invention, in the process of performing pollution detection on sample attribute data, it is necessary to count similar sample data through sample matching. Since there are a large number of sample attribute data in the training set, if the calculation between any two sample attribute data In order to reduce the amount of calculation in the sample attribute data matching process, the embodiment of the present invention adopts a preset local sensitive hash algorithm to hash each sample attribute data to the corresponding corresponding hash table in different hash tables. In the hash bucket, since the sample attribute data in the same hash bucket has a high probability of being similar, it can be determined from the first sample attribute data in the same hash bucket with the sample attribute data that it is similar to the sample attribute data The second sample attribute data, which can greatly narrow the scope of data matching from massive data, and reduce the calculation amount of sample attribute data.
具体地,在使用预设局部敏感哈希算法,将各个样本属性数据哈希到不同哈希表中相应的哈希桶内的过程中,需要寻找满足以下条件的哈希函数,Specifically, in the process of hashing each sample attribute data into corresponding hash buckets in different hash tables by using the preset locality-sensitive hash algorithm, it is necessary to find a hash function that satisfies the following conditions,
若d(x,y)≤R,则h(x)=h(y)的概率不小于P 1If d(x,y)≤R, then the probability of h(x)=h(y) is not less than P 1 ;
若d(x,y)≥cR,则h(x)=h(y)的概率不小于P 2If d(x,y)≥cR, then the probability of h(x)=h(y) is not less than P 2 ;
其中,d(x,y)是高维空间的任意两个样本属性数据x,y之间的距离,h为哈希函数,h(x)和h(y)是对样本属性数据x和y的哈希变换,c为常数,同时需要满足c>1且P 1>P 2,其中,根据相似样本被查到的准确率,预先设定概率值P 1和P 2,要求的相似样本被查到的准确率越高,概率值P 1越大,P 2越小,相反,要求的相似样本被查找到的准确率越低,概率值P 1越小,P 2越大,但是要同时保证P 1>P 2Among them, d(x, y) is the distance between any two sample attribute data x and y in high-dimensional space, h is the hash function, h(x) and h(y) are the pair of sample attribute data x and y Hash transformation, c is a constant, and it needs to satisfy c>1 and P 1 >P 2 , wherein, according to the accuracy rate of similar samples being found, the probability values P 1 and P 2 are preset, and the required similar samples are The higher the accuracy rate found, the larger the probability value P 1 and the smaller P 2 , on the contrary, the lower the accuracy rate of the required similar samples being found, the smaller the probability value P 1 and the larger P 2 , but at the same time It is guaranteed that P 1 >P 2 .
在本发明实施例中,需要根据相似样本属性数据查找的准确率,确定哈希表的数量以及每个哈希表对应的哈希函数的数量,需要说明的是,划分的哈希表的数量越多,且每个哈希表对应的哈希函数越多,相似样本属性数据被查找到的准确率就越高,但是由于哈希表和哈希函数的增多,也会导致样本属性数据的计算量增加,因此综合考虑相似样本属性数据查找的准确率和样本属性数据计算量的问题,进而确定本发明实施例中哈希表的数量以及每个哈希表对应的哈希函数的数量,例如,设定哈希表的数量为3,每个哈希表对应两个哈希函数,哈希表1对应哈希函数h 1和h 2,哈希表2对应哈希函数h 3和h 4,哈希表3对应哈希函数h 5和h 6In the embodiment of the present invention, it is necessary to determine the number of hash tables and the number of hash functions corresponding to each hash table according to the accuracy of searching similar sample attribute data. It should be noted that the number of divided hash tables The more, and the more hash functions corresponding to each hash table, the higher the accuracy of finding similar sample attribute data, but due to the increase of hash tables and hash functions, the sample attribute data will also be inaccurate. The calculation amount increases, so the accuracy rate of similar sample attribute data search and the calculation amount of sample attribute data are comprehensively considered, and then the number of hash tables in the embodiment of the present invention and the number of hash functions corresponding to each hash table are determined. For example, if the number of hash tables is set to 3, each hash table corresponds to two hash functions, hash table 1 corresponds to hash functions h 1 and h 2 , and hash table 2 corresponds to hash functions h 3 and h 4. Hash table 3 corresponds to hash functions h 5 and h 6 .
具体地,在确定哈希表的数量以及每个哈希表对应的哈希函数之后,利用每个哈希表对应的哈希函数,计算各个样本属性数据在不同哈希表中的哈希值,之后根据各个样本属性数据在每个哈希表中的哈希值,将各个样本属性数据哈希到每个哈希表中相应的哈希桶内,其中,每个哈希表中对应有多个哈希桶,每个哈希桶对应的哈希值不同。Specifically, after determining the number of hash tables and the hash function corresponding to each hash table, use the hash function corresponding to each hash table to calculate the hash value of each sample attribute data in different hash tables , and then according to the hash value of each sample attribute data in each hash table, each sample attribute data is hashed into the corresponding hash bucket in each hash table, wherein each hash table corresponds to Multiple hash buckets, each corresponding to a different hash value.
例如,哈希表1、哈希表2和哈希表3中每个哈希表都包括4个哈希桶,这三个哈希表中第一个哈希桶对应的哈希值为00,第二个哈希桶对应的哈希值为01,第三个哈希桶对应的哈希值为10,第四个哈希桶对应的哈希值为11,如果利用哈希表1对应的哈希函数h 1和h 2,计算样本属性数据A对应的哈 希值为00,由于样本属性数据A对应的哈希值与第一个哈希桶对应的哈希值相同,因此将样本属性数据A哈希到哈希表1中的第一个哈希桶内;如果利用哈希表2对应的哈希函数h 3和h 4,计算样本属性数据A对应的哈希值为01,由于样本属性数据A对应的哈希值与第二个哈希桶对应的哈希值相同,因此将样本属性数据A哈希到哈希表2中的第二个哈希桶内;如果利用哈希表3对应的哈希函数h 5和h 6,计算样本属性数据A对应的哈希值为10,由于样本属性数据A对应的哈希值与第三个哈希桶对应的哈希值相同,因此将样本属性数据A哈希到哈希表3中的第三个哈希桶内。由此按照上述方式能够利用预设局部敏感哈希算法将各个样本属性数据哈希到不同的哈希表中相应的哈希桶内。 For example, each hash table in hash table 1, hash table 2, and hash table 3 includes 4 hash buckets, and the hash value corresponding to the first hash bucket in these three hash tables is 00 , the hash value corresponding to the second hash bucket is 01, the corresponding hash value of the third hash bucket is 10, and the corresponding hash value of the fourth hash bucket is 11. If the hash table 1 is used to correspond to The hash functions h 1 and h 2 of the sample attribute data A are used to calculate the hash value corresponding to 00. Since the hash value corresponding to the sample attribute data A is the same as the hash value corresponding to the first hash bucket, the sample The attribute data A is hashed into the first hash bucket in the hash table 1; if the hash function h 3 and h 4 corresponding to the hash table 2 are used to calculate the hash value corresponding to the sample attribute data A is 01, Since the hash value corresponding to the sample attribute data A is the same as the hash value corresponding to the second hash bucket, the sample attribute data A is hashed into the second hash bucket in the hash table 2; Hash functions h 5 and h 6 corresponding to table 3, calculate the hash value corresponding to sample attribute data A to be 10, because the hash value corresponding to sample attribute data A is the same as the hash value corresponding to the third hash bucket , so the sample attribute data A is hashed into the third hash bucket in the hash table 3. Therefore, according to the above method, each sample attribute data can be hashed into corresponding hash buckets in different hash tables by using the preset local sensitive hash algorithm.
103、将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据。103. Determine, in the different hash tables, the data in the same hash bucket as the respective sample attribute data as the first sample attribute data.
对于本发明实施例,步骤102已经将各个样本属性数据哈希到不同的哈希表中相应的哈希桶内,由于同一哈希桶内的样本属性数据有很大概率是相似的,因此可以分别确定不同的哈希表中与各个样本属性数据在同一哈希桶内的第一样本属性数据,进而从各个样本属性数据对应的第一样本属性数据中筛选与各个样本属性数据相似的第二样本属性数据。For the embodiment of the present invention, step 102 has already hashed each sample attribute data into corresponding hash buckets in different hash tables. Since the sample attribute data in the same hash bucket has a high probability of being similar, it can Respectively determine the first sample attribute data in the same hash bucket as each sample attribute data in different hash tables, and then filter the samples similar to each sample attribute data from the first sample attribute data corresponding to each sample attribute data Second sample attribute data.
例如,样本属性数据A分别在哈希表1的第一个哈希桶内,哈希表2的第三个哈希桶内和哈希表3的第四个哈希桶内,针对样本属性数据A,可以分别获取哈希表1中第一个哈希桶内的所有样本属性数据,哈希表2中第三个哈希桶内的所有样本属性数据,哈希表3中第四个哈希桶内的所有样本属性数据,将获取的上述样本属性数据作为样本属性数据A对应的第一样本属性数据,以便从第一样本属性数据中筛选与样本属性数据A相似的第二样本属性数据,再比如,样本属性数据B分别在哈希表1的第三个哈希桶内,哈希表2的第四个哈希桶内和哈希表3的第一个哈希桶内,因此,针对样本属性数据B,可以分别获取哈希表1中第三个哈希桶内的所有样本属性数据,哈希表2中第四个哈希桶内的所有样本属性数据,哈希表3中第一个哈希桶内的所有样本属性数据,将获取的上述样本属性数据作为样本属性数据B对应的第一样本属性数据,以便从第一样本属性数据中筛选与样本属性数据B相似的第二样本属性数据,由此按照上述方式能够获取与各个样本属性数据在同一哈希桶内的第一样本属性数据,进而大大缩小了样本属性数据的匹配范围。For example, the sample attribute data A is in the first hash bucket of hash table 1, the third hash bucket of hash table 2, and the fourth hash bucket of hash table 3 respectively. Data A can obtain all sample attribute data in the first hash bucket in hash table 1, all sample attribute data in the third hash bucket in hash table 2, and the fourth sample attribute data in hash table 3 For all the sample attribute data in the hash bucket, use the obtained above sample attribute data as the first sample attribute data corresponding to sample attribute data A, so as to filter the second sample attribute data similar to sample attribute data A from the first sample attribute data. Sample attribute data, for another example, sample attribute data B is in the third hash bucket of hash table 1, the fourth hash bucket of hash table 2, and the first hash bucket of hash table 3 Therefore, for sample attribute data B, all sample attribute data in the third hash bucket in hash table 1 and all sample attribute data in the fourth hash bucket in hash table 2 can be obtained respectively, ha For all the sample attribute data in the first hash bucket in Table 3, use the obtained above sample attribute data as the first sample attribute data corresponding to sample attribute data B, so as to filter and sample from the first sample attribute data The second sample attribute data similar to the attribute data B can obtain the first sample attribute data in the same hash bucket as each sample attribute data according to the above method, thereby greatly reducing the matching range of the sample attribute data.
104、从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据。104. Filter out second sample attribute data similar to the respective sample attribute data from the first sample attribute data.
对于本发明实施例,在确定各个样本属性数据对应的匹配范围(第一样本属性数据)后,由于预设局部敏感哈希算法仅能够保证在同一个哈希桶内的样本属性数据有很大概率是相似的,但是不能够保证处于同一哈希桶的样本属性数据一定相似,因此为了进一步提高污染样本数据的检测精度,可以从各 个样本属性数据对应的第一样本属性数据中筛选与各个样本属性数据真正相似的第二样本属性数据。For the embodiment of the present invention, after determining the matching range (first sample attribute data) corresponding to each sample attribute data, since the preset local sensitive hash algorithm can only ensure that the sample attribute data in the same hash bucket have a large The high probability is similar, but it cannot guarantee that the sample attribute data in the same hash bucket must be similar. Therefore, in order to further improve the detection accuracy of the contaminated sample data, you can filter the first sample attribute data corresponding to each sample attribute data and The second sample attribute data that each sample attribute data is truly similar to.
具体地,可以分别计算各个样本属性数据和与其对应的第一样本属性数据之间的样本距离,根据该样本距离,从第一样本属性数据中筛选与样本属性数据相似的第二样本属性数据,该样本距离的具体计算方式可以采用欧式距离、余弦距离、汉明距离等计算方式,本发明实施例不做具体限定,由于样本距离越大,样本属性数据之间的相似度越小,相反样本距离越小,样本属性数据之间的相似度越大,因此可以将样本距离小于预设样本距离的第一样本属性数据确定为与各个样本属性数据相似的第二样本属性数据。Specifically, the sample distance between each sample attribute data and its corresponding first sample attribute data can be calculated respectively, and the second sample attribute similar to the sample attribute data is screened from the first sample attribute data according to the sample distance Data, the specific calculation method of the sample distance can use Euclidean distance, cosine distance, Hamming distance and other calculation methods, which are not specifically limited in the embodiment of the present invention, because the larger the sample distance, the smaller the similarity between sample attribute data, On the contrary, the smaller the sample distance, the greater the similarity between the sample attribute data, so the first sample attribute data whose sample distance is smaller than the preset sample distance can be determined as the second sample attribute data similar to each sample attribute data.
例如,不同哈希表中与样本属性数据A在同一个哈希桶内的样本属性数据包括样本属性数据B、样本属性数据C和样本属性数据D,即样本属性数据A对应的第一样本属性数据包括样本属性数据B、样本属性数据C和样本属性数据D,分别计算样本属性数据A与样本属性数据B、样本属性数据C和样本属性数据D之间的样本距离,通过比较确定样本属性数据A与样本属性数据B之间的样本距离小于预设距离,样本属性数据A与样本属性数据C之间的样本距离小于预设距离,因此可以确定样本属性数据B和样本属性数据C与样本属性数据A相似,即与样本属性数据A相似的第二样本属性数据为样本属性数据B和样本属性数据C。再比如,不同哈希表中与样本属性数据B在同一个哈希桶内的样本属性数据包括样本属性数据A、样本属性数据E和样本属性数据F,即样本属性数据B对应的第一样本属性数据包括样本属性数据A、样本属性数据E和样本属性数据F,由于样本属性数据A对应的排序位置在样本属性数据B之前,即针对样本属性数据A进行距离计算的过程中,已经计算过样本属性数据A与样本属性数据B之间的样本距离,并确定样本属性数据A与样本属性数据B相似,因此,针对样本属性数据B,可以仅计算样本属性数据B与样本属性数据E和样本属性数据F之间的样本距离,通过比较确定样本属性数据B与样本属性数据E之间的样本距离小于预设距离,因此可以确定样本属性数据A和样本属性数据E与样本属性数据B相似,即与样本属性数据B相似的第二样本属性数据为样本属性数据A和样本属性数据E。由此按照上述方式,能够分别确定与各个样本属性数据相似的第二样本属性数据。For example, sample attribute data in the same hash bucket as sample attribute data A in different hash tables includes sample attribute data B, sample attribute data C, and sample attribute data D, that is, the first sample attribute data corresponding to sample attribute data A Attribute data includes sample attribute data B, sample attribute data C and sample attribute data D, respectively calculate the sample distance between sample attribute data A and sample attribute data B, sample attribute data C and sample attribute data D, and determine the sample attributes by comparison The sample distance between data A and sample attribute data B is less than the preset distance, and the sample distance between sample attribute data A and sample attribute data C is less than the preset distance, so it can be determined that the sample attribute data B and sample attribute data C are different from the sample The attribute data A is similar, that is, the second sample attribute data similar to the sample attribute data A are sample attribute data B and sample attribute data C. For another example, the sample attribute data in the same hash bucket as sample attribute data B in different hash tables includes sample attribute data A, sample attribute data E, and sample attribute data F, that is, the first item corresponding to sample attribute data B This attribute data includes sample attribute data A, sample attribute data E, and sample attribute data F. Since the sorting position corresponding to sample attribute data A is before sample attribute data B, that is, during the distance calculation process for sample attribute data A, Through the sample distance between sample attribute data A and sample attribute data B, and determine that sample attribute data A is similar to sample attribute data B, therefore, for sample attribute data B, only sample attribute data B and sample attribute data E and The sample distance between the sample attribute data F is determined by comparing the sample distance between the sample attribute data B and the sample attribute data E is less than the preset distance, so it can be determined that the sample attribute data A and the sample attribute data E are similar to the sample attribute data B , that is, the second sample attribute data similar to sample attribute data B are sample attribute data A and sample attribute data E. Thus, in the manner described above, the second sample attribute data similar to the respective sample attribute data can be determined respectively.
105、基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据。105. Based on the amount of sample data corresponding to the second sample attribute data, respectively determine whether each of the sample attribute data is contaminated sample data.
对于本发明实施例,在分别确定与各个样本属性数据相似的第二样本属性数据之后,需要统计第二样本属性数据对应的样本属性数据量,并结合样本属性数据的实际采集情况,判定该样本属性数据量是否正常,如果该样本属性数据量没有超过正常范围,则说明这些样本属性数据并没有被污染;如果该 样本属性数据量超过正常范围,则说明这些样本属性数据已经被污染,需要从训练集集中排除这些数据。For the embodiment of the present invention, after determining the second sample attribute data similar to each sample attribute data, it is necessary to count the amount of sample attribute data corresponding to the second sample attribute data, and combine the actual collection of sample attribute data to determine the sample attribute data. Whether the attribute data volume is normal. If the sample attribute data volume does not exceed the normal range, it means that the sample attribute data is not polluted; if the sample attribute data volume exceeds the normal range, it means that the sample attribute data has been polluted. These data were excluded from the training set.
例如,训练集集为采集的200个平台用户的1000条样本属性数据,每个平台用户采集5条样本属性数据,正常情况下,同一个平台用户的5条样本属性数据应该是相似的,如果与样本属性数据A相似的第二样本属性数据对应的样本数据量为50条,远远超过正常范围5条,可见与样本属性数据A相似的第二样本属性数据已经被污染,攻击者很可能通过复制、细微变换等手段,打造与样本属性数据A相似的数据,加入至训练集中,为了避免训练数据被污染,保证异常行为检测模型的检测精度,需要将样本属性数据A和与样本属性数据A相似的这些第二样本属性数据从训练集中排除掉。由此按照上述方式,通过统计每个样本属性数据对应的第二样本属性数据的数据量,能够判定该样本属性数据及其对应的第二样本属性数据是否已经被污染。For example, the training set is 1000 sample attribute data collected from 200 platform users, and each platform user collects 5 sample attribute data. Under normal circumstances, the 5 sample attribute data of the same platform user should be similar. If The amount of sample data corresponding to the second sample attribute data similar to sample attribute data A is 50, far exceeding the normal range of 5. It can be seen that the second sample attribute data similar to sample attribute data A has been polluted, and the attacker is likely to Create data similar to sample attribute data A by copying, subtle transformation, etc., and add it to the training set. In order to avoid contamination of the training data and ensure the detection accuracy of the abnormal behavior detection model, it is necessary to combine the sample attribute data A These second sample attribute data similar to A are excluded from the training set. Therefore, according to the above method, by counting the data amount of the second sample attribute data corresponding to each sample attribute data, it can be determined whether the sample attribute data and the corresponding second sample attribute data have been polluted.
本发明实施例提供的一种用于模型训练的污染样本数据的检测方法,与目前通过样本查重来检测样本数据中是否存在被污染的样本数据的方式相比,本方明能够获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据;并利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶;与此同时,将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据;并从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据;最终基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据,从而能够提高污染样本数据的检测精度,保证样本属性数据的安全,进而能够提高异常行为检测模型的检测精度。The embodiment of the present invention provides a method for detecting contaminated sample data used for model training. Compared with the current method of detecting whether there is contaminated sample data in the sample data through sample duplication, the present invention can obtain the sample data to be detected. The sample attribute data corresponding to each platform user, the sample attribute data at least includes the device attribute data, risk control data and business data corresponding to each platform user; and using the preset local sensitive hash algorithm, each sample attribute The data is hashed into corresponding hash buckets in different hash tables, wherein any one hash table includes multiple hash buckets; The data in the same hash bucket is determined as the first sample attribute data; and the second sample attribute data similar to the respective sample attribute data is filtered out from the first sample attribute data; finally based on the According to the amount of sample data corresponding to the second sample attribute data, respectively determine whether each of the sample attribute data is polluted sample data, thereby improving the detection accuracy of polluted sample data, ensuring the security of sample attribute data, and improving the abnormal behavior detection model. detection accuracy.
进一步的,为了更好的说明上述污染样本数据的检测过程,作为对上述实施例的细化和扩展,本发明实施例提供了另一种用于模型训练的污染样本数据的检测方法,如图2所示,所述方法包括:Further, in order to better illustrate the detection process of the above-mentioned contaminated sample data, as a refinement and extension of the above-mentioned embodiment, the embodiment of the present invention provides another detection method of contaminated sample data for model training, as shown in the figure 2, the method includes:
201、获取待检测的各个平台用户对应的样本属性数据。201. Acquire sample attribute data corresponding to each platform user to be detected.
对于本发明实施例,为了保证训练集的安全,需要获取训练集中的各个样本属性数据进行污染检测,以判定其是否为污染数据,针对样本属性数据获取的具体过程与步骤101完全相同,在此不再赘述。For the embodiment of the present invention, in order to ensure the safety of the training set, it is necessary to obtain the attribute data of each sample in the training set for pollution detection to determine whether it is contaminated data. The specific process for obtaining the sample attribute data is exactly the same as step 101, here No longer.
202、利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内。202. Using a preset locality-sensitive hashing algorithm, respectively hash each sample attribute data into corresponding hash buckets in different hash tables.
其中,任意一个哈希表中均包括多个哈希桶,哈希桶的具体数量可以根据实际的业务需求进行设定。对于本发明实施例,为了将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,步骤202具体 包括:利用预设局部敏感哈希算法,分别计算所述各个样本属性数据在所述不同哈希表中的哈希值;基于所述哈希值,将所述各个样本属性数据哈希到所述不同哈希表中相应的哈希桶内。进一步地,所述利用预设局部敏感哈希算法,分别计算所述各个样本属性数据在所述不同哈希表中的哈希值,包括:确定所述各个样本属性数据对应的数据维度和坐标值;基于所述数据维度和所述坐标值,确定所述各个样本属性数据对应的汉明编码;利用所述不同哈希表对应的哈希函数,提取所述汉明编码中相应位置处的编码,并将提取的编码确定为所述各个样本属性数据在所述不同哈希表中的哈希值。Wherein, any hash table includes multiple hash buckets, and the specific number of hash buckets can be set according to actual business requirements. For the embodiment of the present invention, in order to hash each sample attribute data into corresponding hash buckets in different hash tables, step 202 specifically includes: using a preset local sensitive hash algorithm to calculate the hash values in the different hash tables; based on the hash values, hash the respective sample attribute data into corresponding hash buckets in the different hash tables. Further, using the preset locality-sensitive hash algorithm to calculate the hash values of the respective sample attribute data in the different hash tables respectively includes: determining the data dimensions and coordinates corresponding to the respective sample attribute data value; based on the data dimension and the coordinate value, determine the Hamming code corresponding to each sample attribute data; use the hash function corresponding to the different hash tables to extract the corresponding position in the Hamming code Encoding, and determining the extracted encoding as the hash value of each sample attribute data in the different hash tables.
具体地,本发明实施例所采用的预设局部敏感哈希算法主要是在汉明距离下的局部敏感哈希算法,首先确定各个样本属性数据对应的数据维度和不同位置上的坐标值,之后根据各个样本属性数据对应的不同位置上的坐标值,确定各个样本属性数据共同对应的最大坐标值,并将该最大坐标值和数据维度相乘,得到各个样本属性数据对应的汉明编码位数,基于该汉明编码位数对各个样本属性数据进行汉明编码。进一步地,利用不同哈希表对应的哈希函数,提取各个样本属性数据对应的汉明编码中相应位置处的编码,并将提取的编码确定为各个样本属性数据对应的哈希值。Specifically, the preset local sensitive hashing algorithm adopted in the embodiment of the present invention is mainly a local sensitive hashing algorithm under Hamming distance. Firstly, the data dimensions corresponding to each sample attribute data and the coordinate values at different positions are determined. According to the coordinate values at different positions corresponding to each sample attribute data, determine the maximum coordinate value corresponding to each sample attribute data, and multiply the maximum coordinate value and the data dimension to obtain the Hamming coded digits corresponding to each sample attribute data , perform Hamming encoding on each sample attribute data based on the number of Hamming encoding bits. Further, using hash functions corresponding to different hash tables, the codes at corresponding positions in the Hamming codes corresponding to each sample attribute data are extracted, and the extracted codes are determined as the hash values corresponding to each sample attribute data.
如图3所示,训练集中包括6个样本属性数据,具体为样本属性数据A=(1,1),样本属性数据B=(2,1),样本属性数据C=(1,2),样本属性数据D=(2,2),样本属性数据E(4,2),样本属性数据F(4,3),根据上述样本属性数据在不同位置上的坐标值可以确定最大坐标值为4,样本属性数据对应的数据维度为2,因此可以确定汉明编码位数为4*2=8,之后对各个样本属性数据分别进行8位汉明编码,汉明编码的具体公式如下:As shown in Figure 3, the training set includes 6 sample attribute data, specifically sample attribute data A=(1,1), sample attribute data B=(2,1), sample attribute data C=(1,2), Sample attribute data D=(2,2), sample attribute data E(4,2), sample attribute data F(4,3), according to the coordinate values of the above sample attribute data at different positions, the maximum coordinate value can be determined to be 4 , the data dimension corresponding to the sample attribute data is 2, so it can be determined that the number of Hamming coding digits is 4*2=8, and then perform 8-bit Hamming coding on each sample attribute data respectively. The specific formula of Hamming coding is as follows:
v(p)=Unaryc(x 1)Unaryc(x 2)…Unaryc(x n) v(p)=Unaryc(x 1 )Unaryc(x 2 )…Unaryc(x n )
其中,v(p)代表各个样本属性数据对应的汉明编码,x 1,x 2…x n为各个样本属性数据对应的坐标值,n为样本属性数据对应的数据维度,Unaryc(x)是一串长度为C的二进制汉明编码,C为样本属性数据的最大坐标值,在确定样本属性数据中每个坐标值对应的汉明编码后,将其进行拼接,得到样本属性数据对应的汉明编码v(p),Unaryc(x)代表长度为C的汉明编码中x位之前的编码为1,x位之后的编码为0。 Among them, v(p) represents the Hamming code corresponding to each sample attribute data, x 1 , x 2 ... x n is the coordinate value corresponding to each sample attribute data, n is the data dimension corresponding to the sample attribute data, Unaryc(x) is A string of binary Hamming codes with a length of C. C is the maximum coordinate value of the sample attribute data. After determining the Hamming code corresponding to each coordinate value in the sample attribute data, they are spliced to obtain the Hanming code corresponding to the sample attribute data. Ming code v(p), Unaryc(x) means that in the Hamming code of length C, the code before the x bit is 1, and the code after the x bit is 0.
在上面这个例子中,v(A)=Unaryc(x 1)Unaryc(x 2),其中,x 1=1,x 2=1,C为所有样本属性数据中的最大坐标值4,因此可以确定样本属性数据A对应的8位汉明编码v(A)=10001000,同理可以确定样本属性数据B对应的8位汉明编码v(B)=11001000,样本属性数据C对应的8位汉明编码v(C)=10001100,样本属性数据D对应的8位汉明编码为v(D)=11001100,样本属性数据E对应的8位汉明编码v(E)=11111100,样本属性数据F对应的8位汉明编码v(F)=11111110,进一步地, 根据相似样本属性数据查找的准确率,设定3个哈希表,且每个哈希表存在两个哈希函数,具体地,第一个哈希表由哈希函数h 1和h 2构成,h 1和h 2分别为抽取第2位和第4位的汉明编码;第二个哈希表由哈希函数h 3和h 4构成,h 3和h 4分别为抽取第1位和第6位的汉明编码;第三个哈希表由哈希函数h 5和h 6构成,h 5和h 6分别为抽取第3位和第8位的哈明编码,进一步地,利用上述哈希函数提取各个样本属性数据对应的汉明编码中相应位置处的编码,得到样本属性数据在不同哈希表中的哈希值,即样本属性数据A在第一个哈希表中的哈希值为00,在第二个哈希表中的哈希值为10,在第三个哈希表中的哈希值为00。 In the above example, v(A)=Unaryc(x 1 )Unaryc(x 2 ), where x 1 =1, x 2 =1, and C is the maximum coordinate value 4 among all sample attribute data, so it can be determined The 8-bit Hamming code v(A)=10001000 corresponding to the sample attribute data A can be determined similarly. Code v(C)=10001100, the 8-bit Hamming code corresponding to the sample attribute data D is v(D)=11001100, the 8-bit Hamming code v(E)=11111100 corresponding to the sample attribute data E, and the sample attribute data F corresponds to The 8-bit Hamming code v(F)=11111110, further, according to the accuracy rate of similar sample attribute data search, set 3 hash tables, and each hash table has two hash functions, specifically, The first hash table is composed of hash functions h 1 and h 2 , h 1 and h 2 are Hamming codes for extracting the 2nd and 4th bits respectively; the second hash table is composed of hash functions h 3 and h 4 , h 3 and h 4 are Hamming codes for extracting the first and sixth digits respectively; the third hash table is composed of hash functions h 5 and h 6 , and h 5 and h 6 are for extracting the first The 3-bit and 8th-bit Hamming codes, further, use the above hash function to extract the codes at the corresponding positions in the Hamming codes corresponding to each sample attribute data, and obtain the hash values of the sample attribute data in different hash tables , that is, the sample attribute data A has a hash value of 00 in the first hash table, a hash value of 10 in the second hash table, and a hash value of 00 in the third hash table .
进一步地,由于抽取的哈明编码仅存在4种可能,分别是00,01,10,11,因此可以设定每个哈希表包括4个哈希桶,且每个哈希桶对应的哈希值分别为00,01,10,11,根据样本属性数据A在不同哈希表中的哈希值,可以确定样本属性数据A被哈希到第一个哈希表的第一哈希桶内,第二个哈希表的第三个哈希桶,以及第三个哈希表的第一个哈希桶内,同理可以将其他样本属性数据哈希到不同哈希表的相应哈希桶内,如图3所示。Furthermore, since there are only 4 possibilities for the extracted Hamming codes, which are 00, 01, 10, and 11, it can be set that each hash table includes 4 hash buckets, and the hash corresponding to each hash bucket The hash values are 00, 01, 10, and 11 respectively. According to the hash values of the sample attribute data A in different hash tables, it can be determined that the sample attribute data A is hashed to the first hash bucket of the first hash table In the third hash bucket of the second hash table, and in the first hash bucket of the third hash table, similarly, other sample attribute data can be hashed to the corresponding hashes of different hash tables. Greek barrel, as shown in Figure 3.
203、将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据。203. Determine, in the different hash tables, the data in the same hash bucket as the respective sample attribute data as the first sample attribute data.
对于本发明实施例,抽取不同哈希表中与样本属性数据在同一哈希桶内的第一样本属性数据的方式与步骤103完全相同,在此不再赘述。For the embodiment of the present invention, the method of extracting the first sample attribute data in the same hash bucket as the sample attribute data in different hash tables is completely the same as that of step 103, and will not be repeated here.
204、从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据。204. Filter out second sample attribute data similar to the respective sample attribute data from the first sample attribute data.
对于本发明实施例,为了确定与各个样本数据相似的第二样本数据,步骤204具体包括:分别计算所述各个样本属性数据和与其对应的第一样本属性数据之间的样本距离;将所述样本距离小于预设距离的第一样本属性数据确定为与所述各个样本属性数据相似的第二样本属性数据。其中,所述样本距离具体可以为汉明距离,针对计算样本属性数据和与其对应的第一样本属性数据之间的汉明距离的过程,所述方法包括:分别将所述各个样本属性数据对应的汉明编码与所述第一样本属性数据对应的汉明编码进行对比,确定所述各个样本属性数据与所述第一样本属性数据具有不同编码的位数;将所述位数确定为所述各个样本属性数据和与其对应的第一样本属性数据之间的汉明距离。其中,预设距离可以根据实际的业务需求进行设定。For the embodiment of the present invention, in order to determine the second sample data similar to each sample data, step 204 specifically includes: respectively calculating the sample distance between each sample attribute data and its corresponding first sample attribute data; The first sample attribute data whose sample distance is smaller than the preset distance is determined as the second sample attribute data similar to the respective sample attribute data. Wherein, the sample distance may specifically be a Hamming distance, and for the process of calculating the Hamming distance between the sample attribute data and the corresponding first sample attribute data, the method includes: separately dividing each sample attribute data Comparing the corresponding Hamming code with the Hamming code corresponding to the first sample attribute data, and determining that each sample attribute data and the first sample attribute data have different coded digits; It is determined as the Hamming distance between each sample attribute data and the corresponding first sample attribute data. Wherein, the preset distance may be set according to actual service requirements.
例如,样本属性数据A对应的汉明编码为110011,样本属性数据A对应的第一样本属性数据包括样本属性数据B和样本属性数据C,样本属性数据B对应的汉明编码为111011,样本属性数据C对 应的汉明编码为000001,预设汉明距离为2,在确定与样本属性数据A相似的第二样本属性数据时,分别计算样本属性数据A与样本属性数据B之间的汉明距离,将样本属性数据A对应的汉明编码中的各个比特值分别与样本属性数据B对应的汉明编码中相应位置处的比特值进行对比,确定比特值不同的位数,通过对比可发现样本属性数据A与样本属性数据B存在一位比特值不同,因此可以确定样本属性数据A与样本属性数据B之间的汉明距离为1,进一步地,由于样本属性数据A与样本属性数据B之间的汉明距离小于预设汉明距离2,因此,可以确定样本属性数据A与样本属性数据B相似,同理可以确定样本属性数据A与样本属性数据C之间的汉明距离为3,由于样本属性数据A与样本属性数据C之间的汉明距离大于预设汉明距离2,因此可以确定样本属性数据C与样本属性数据A不相似,即第一样本属性数据中与样本属性数据A相似的第二样本属性数据为样本属性数据B。由此按照上述方式能够分别确定与各个样本属性数据相似的第二样本属性数据。For example, the Hamming code corresponding to sample attribute data A is 110011, the first sample attribute data corresponding to sample attribute data A includes sample attribute data B and sample attribute data C, and the Hamming code corresponding to sample attribute data B is 111011, sample The Hamming code corresponding to attribute data C is 000001, and the preset Hamming distance is 2. When determining the second sample attribute data similar to sample attribute data A, calculate the Hamming distance between sample attribute data A and sample attribute data B respectively. Compare each bit value in the Hamming code corresponding to the sample attribute data A with the bit value at the corresponding position in the Hamming code corresponding to the sample attribute data B, and determine the number of bits with different bit values. By comparison, It is found that the sample attribute data A and the sample attribute data B have a bit value difference, so it can be determined that the Hamming distance between the sample attribute data A and the sample attribute data B is 1, further, because the sample attribute data A and the sample attribute data The Hamming distance between B is less than the preset Hamming distance 2. Therefore, it can be determined that the sample attribute data A is similar to the sample attribute data B. Similarly, the Hamming distance between the sample attribute data A and the sample attribute data C can be determined as 3. Since the Hamming distance between the sample attribute data A and the sample attribute data C is greater than the preset Hamming distance 2, it can be determined that the sample attribute data C is not similar to the sample attribute data A, that is, the first sample attribute data and The second sample attribute data similar to sample attribute data A is sample attribute data B. In this way, the second sample attribute data similar to the respective sample attribute data can be respectively determined in the manner described above.
进一步地,在本发明实施例中可以预先对各个样本属性数据进行位置排序,如样本属性数据对应的排序位置为样本属性数据A、样本属性数据B、样本属性数据C和样本属性数据D,之后可以按照各个样本属性数据对应的排序位置,依次确定各个样本属性数据对应的第一样本属性数据,并计算样本距离,如先确定样本属性数据A对应的第一样本属性数据,之后计算样本属性数据A和与其对应的第一样本属性数据之间的样本距离,之后再确定样本属性数据B对应的第一样本属性数据,并计算样本属性数据B和与其对应的第一样本属性数据之间的样本距离。Further, in the embodiment of the present invention, the positions of each sample attribute data can be sorted in advance, for example, the sorting positions corresponding to the sample attribute data are sample attribute data A, sample attribute data B, sample attribute data C and sample attribute data D, and then According to the sorting position corresponding to each sample attribute data, the first sample attribute data corresponding to each sample attribute data can be sequentially determined, and the sample distance can be calculated. For example, the first sample attribute data corresponding to sample attribute data A can be determined first, and then the sample attribute data can be calculated. The sample distance between attribute data A and its corresponding first sample attribute data, and then determine the first sample attribute data corresponding to sample attribute data B, and calculate the sample attribute data B and its corresponding first sample attribute The sample distance between data.
在具体应用场景中,为了进一步减少样本距离的计算量,在计算样本属性数据和与其对应的第一样本属性数据之间的样本距离之前,需要判定第一样本属性数据对应的排序位置是否在与其对应的样本属性数据的排序位置之前,如果第一样本属性数据对应的排序位置在与其对应的样本属性数据的排序位置之前,说明已经计算过该第一样本属性数据与样本属性数据之间的样本距离,不需要再重复计算;如果第一样本属性数据对应的排序位置在与其对应的样本属性数据的排序位置之后,则需要计算样本属性数据和与其对应的第一样本属性数据之间的样本距离。In a specific application scenario, in order to further reduce the calculation amount of the sample distance, before calculating the sample distance between the sample attribute data and the corresponding first sample attribute data, it is necessary to determine whether the sorting position corresponding to the first sample attribute data is Before the sorting position of the corresponding sample attribute data, if the sorting position corresponding to the first sample attribute data is before the sorting position of the corresponding sample attribute data, it means that the first sample attribute data and sample attribute data have been calculated There is no need to repeat the calculation; if the sorting position corresponding to the first sample attribute data is after the sorting position of the corresponding sample attribute data, it is necessary to calculate the sample attribute data and the corresponding first sample attribute The sample distance between data.
例如,训练集集包括样本属性数据A、样本属性数据B、样本属性数据C和样本属性数据D,样本属性数据A对应的第一样本属性数据包括样本属性数据B和样本属性数据C,样本属性数据B对应的第一样本属性数据包括样本属性数据A和样本属性数据D,在之前的样本距离计算过程中,如果已经确定与样本属性数据A相似的第二样本属性数据为样本属性数据B,则在确定与样本属性数据B相似的第二样本属性数据的过程中,便不需要在重复计算样本属性数据B和与其对应的第一样本属性数据A之间的样本距离,可以直接调用之前的计算结果,确定样本属性数据B与第一样本属性数据A相似,之后仅需要计算样本属性数据B和与其对应的第一样本属性数据D之间的样本距离,再判定两者 是否相似。由此能够进一步减少样本距离的计算量,提高污染样本属性数据的检测效率。For example, the training set includes sample attribute data A, sample attribute data B, sample attribute data C, and sample attribute data D, and the first sample attribute data corresponding to sample attribute data A includes sample attribute data B and sample attribute data C. The first sample attribute data corresponding to attribute data B includes sample attribute data A and sample attribute data D. In the previous sample distance calculation process, if the second sample attribute data similar to sample attribute data A has been determined to be sample attribute data B, then in the process of determining the second sample attribute data similar to the sample attribute data B, it is not necessary to repeatedly calculate the sample distance between the sample attribute data B and its corresponding first sample attribute data A, and can directly Call the previous calculation results to determine that the sample attribute data B is similar to the first sample attribute data A, and then only need to calculate the sample distance between the sample attribute data B and its corresponding first sample attribute data D, and then determine the two Is it similar. In this way, the calculation amount of the sample distance can be further reduced, and the detection efficiency of the attribute data of the contaminated sample can be improved.
205、统计所述各个平台用户对应的样本数据量,并从各个样本数据量中筛选出最大样本数据量。205. Count the amount of sample data corresponding to the users of each platform, and filter out the largest amount of sample data from each amount of sample data.
对于本发明实施例,可以通过与各个样本属性数据相似的第二样本属性数据对应的样本数据量,判定各个样本属性数据及其对应的第二样本属性数据是否已经被污染,在对样本数据量进行判定的过程中需要结合样本属性数据的实际采集情况,例如,采集的样本属性数据集包括200个人的1000条样本属性数据,采集每个人的5条样本属性数据,由此可知,在该训练集中共包括200个平台用户,且每个平台用户对应的样本数据量均为5条,因此可以确定最大样本数据量为5。再比如,采集的训练集包括50个人的200条样本属性数据,其中,有48个人每个人采集4条样本属性数据,有一个人只采集1条样本数据,还有一个人采集了7条样本属性数据,因此可以确定最大样本数据量为7条。由此按照上述方式能够结合实际情况,确定最大样本数据量,以便根据该最大样本数据量判定样本属性数据是否为污染数据。For the embodiment of the present invention, it can be determined whether each sample attribute data and its corresponding second sample attribute data have been polluted through the amount of sample data corresponding to the second sample attribute data similar to each sample attribute data. The actual collection of sample attribute data needs to be combined in the process of judgment. For example, the collected sample attribute data set includes 1000 sample attribute data of 200 individuals, and 5 sample attribute data of each person are collected. It can be seen that, in this training The collection includes a total of 200 platform users, and each platform user corresponds to 5 pieces of sample data, so it can be determined that the maximum sample data size is 5. For another example, the collected training set includes 200 sample attribute data of 50 individuals, of which 48 individuals collected 4 sample attribute data, one person only collected 1 sample data, and another person collected 7 sample attribute data , so it can be determined that the maximum sample data size is 7. Therefore, according to the above method, the maximum sample data volume can be determined in combination with the actual situation, so as to determine whether the sample attribute data is polluted data according to the maximum sample data volume.
206、根据所述最大样本数据量和所述第二样本属性数据对应的样本数据量,判定所述各个样本属性数据是否为污染样本数据。206. Determine whether each sample attribute data is contaminated sample data according to the maximum sample data amount and the sample data amount corresponding to the second sample attribute data.
其中,污染样本数据为攻击者通过复制、细微变化、复合等手段污染的样本属性数据。对于本发明实施例,为了判定各个样本属性数据是否为污染数据,步骤206具体包括:将所述第二样本属性数据对应的样本数据量与所述最大样本数据量相减,得到所述各个样本属性数据对应的样本数量差;若所述各个样本属性数据中的目标样本属性数据对应的样本数量差大于预设样本数量差,则判定所述目标样本属性数据及其对应的第二样本属性数据为污染样本数据。其中,目标样本数据为训练集中的任意一个样本数据,预设样本数量差可以根据实际业务需求进行设定。Among them, the polluted sample data is the sample attribute data polluted by the attacker through copying, subtle changes, compounding and other means. For the embodiment of the present invention, in order to determine whether each sample attribute data is pollution data, step 206 specifically includes: subtracting the sample data volume corresponding to the second sample attribute data from the maximum sample data volume to obtain the The sample size difference corresponding to the attribute data; if the sample size difference corresponding to the target sample attribute data in each sample attribute data is greater than the preset sample size difference, then determine the target sample attribute data and its corresponding second sample attribute data is the polluted sample data. Wherein, the target sample data is any sample data in the training set, and the difference in the number of preset samples can be set according to actual business requirements.
例如,确定最大样本数据量为100,预设样本数量差为50,与样本属性数据A相似的第二样本属性数据对应的样本数量为200,与样本属性数据B相似的第二样本属性数据对应的样本数据量为110,由此可以确定样本属性数据A对应的样本数量差为200-100=100,由于该样本数量差大于预设样本数量差50,因此可以确定样本属性数据A及其对应的第二样本属性数据已经存在污染,即攻击者可能通过复制、变换等手段恶意打造与样本属性数据A相似的污染样本属性数据,加入至训练集中,进而导致与样本属性数据A相似的第二样本属性数据的样本数量200远远超过正常样本数据量100,由此需要从训练集中将样本属性数据A及其对应的第二样本属性数据全部排除,以便保证训练集集的安全,同理可以确定样本属性数据B对应的样本数量差为110-100=10,由于该样本数据量小于预设样本数据量,因此可以确定与样本属性数据B相似的第二样本属性数据的样本数据量处于正常范围,即样本属性数 据B及其对应的第二样本属性数据没有被污染。由此按照上述方式能够依次判定异常行为检测模型的训练集中各个样本属性数据是否为污染数据。For example, if the maximum sample data volume is determined to be 100, the preset sample size difference is 50, the sample size corresponding to the second sample attribute data similar to sample attribute data A is 200, and the sample size corresponding to the second sample attribute data similar to sample attribute data B is The amount of sample data is 110, so it can be determined that the sample size difference corresponding to the sample attribute data A is 200-100=100. Since the sample size difference is greater than the preset sample size difference of 50, it can be determined that the sample attribute data A and its corresponding The attribute data of the second sample has already been polluted, that is, the attacker may maliciously create contaminated sample attribute data similar to sample attribute data A by copying, transforming, etc., and add it to the training set, resulting in the second The number of samples of sample attribute data of 200 is far more than the amount of normal sample data of 100. Therefore, it is necessary to exclude all sample attribute data A and its corresponding second sample attribute data from the training set in order to ensure the safety of the training set. Similarly, It is determined that the sample size difference corresponding to the sample attribute data B is 110-100=10. Since the sample data volume is smaller than the preset sample data volume, it can be determined that the sample data volume of the second sample attribute data similar to the sample attribute data B is normal. The scope, that is, the sample attribute data B and its corresponding second sample attribute data are not polluted. In this way, it is possible to sequentially determine whether each sample attribute data in the training set of the abnormal behavior detection model is polluted data in the above manner.
本发明实施例提供的另一种用于模型训练的污染样本数据的检测方法,与目前通过样本查重来检测样本数据中是否存在被污染的样本数据的方式相比,本方明能够获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据;并利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶;与此同时,将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据;并从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据;最终基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据,从而能够提高污染样本数据的检测精度,保证样本属性数据的安全,进而能够提高异常行为检测模型的检测精度。The embodiment of the present invention provides another method for detecting contaminated sample data used for model training. Compared with the current method of detecting whether there is contaminated sample data in the sample data through sample duplication, this method can obtain the The detected sample attribute data corresponding to each platform user, the sample attribute data at least includes the device attribute data, risk control data and business data corresponding to each platform user; and using the preset local sensitive hash algorithm, each sample The attribute data is hashed into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets; at the same time, the different hash tables are combined with the samples The data whose attribute data is located in the same hash bucket is determined as the first sample attribute data; and the second sample attribute data similar to the respective sample attribute data is filtered out from the first sample attribute data; finally based on The amount of sample data corresponding to the second sample attribute data determines whether each sample attribute data is polluted sample data, thereby improving the detection accuracy of polluted sample data, ensuring the security of sample attribute data, and improving abnormal behavior detection. The detection accuracy of the model.
进一步地,作为图1的具体实现,本发明实施例提供了一种用于模型训练的污染样本数据的检测装置,如图4所示,所述装置包括:获取单元31、哈希单元32、确定单元33、筛选单元34和判定单元35。Further, as a specific implementation of FIG. 1 , an embodiment of the present invention provides a detection device for contaminated sample data used for model training. As shown in FIG. 4 , the device includes: an acquisition unit 31, a hash unit 32, A determination unit 33 , a screening unit 34 and a determination unit 35 .
所述获取单元31,可以用于获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据。The acquiring unit 31 may be configured to acquire sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data and service data corresponding to each platform user.
所述哈希单元32,可以用于利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶。The hash unit 32 can be used to hash each sample attribute data into corresponding hash buckets in different hash tables by using a preset local sensitive hash algorithm, wherein any hash table includes Multiple hash buckets.
所述确定单元33,可以用于将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据。The determining unit 33 may be configured to determine the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data.
所述筛选单元34,可以用于从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据。The screening unit 34 may be configured to screen out second sample attribute data similar to the respective sample attribute data from the first sample attribute data.
所述判定单元35,可以用于基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据。The determining unit 35 may be configured to determine whether each sample attribute data is contaminated sample data based on the amount of sample data corresponding to the second sample attribute data.
在具体应用场景中,为了将所述各个样本属性数据哈希到不同哈希表中相应的哈希桶内,如图5所示,所述哈希单元32,包括:第一计算模块321和哈希模块322。In a specific application scenario, in order to hash the respective sample attribute data into corresponding hash buckets in different hash tables, as shown in FIG. 5 , the hash unit 32 includes: a first computing module 321 and hash module 322 .
所述第一计算模块321,可以用于利用预设局部敏感哈希算法,分别计算所述各个样本属性数据在所述不同哈希表中的哈希值。The first calculation module 321 may be configured to use a preset locality-sensitive hash algorithm to separately calculate the hash values of the respective sample attribute data in the different hash tables.
所述哈希模块322,可以用于基于所述哈希值,将所述各个样本属性数据哈希到所述不同哈希表中相应的哈希桶内。The hash module 322 may be configured to hash the respective sample attribute data into corresponding hash buckets in the different hash tables based on the hash value.
进一步地,为了计算所述各个样本属性数据在所述不同哈希表中的哈希值,所述第一计算模块321,包括:确定子模块和提取子模块。Further, in order to calculate the hash values of the respective sample attribute data in the different hash tables, the first calculation module 321 includes: a determination submodule and an extraction submodule.
所述确定子模块,可以用于确定所述各个样本属性数据对应的数据维度和坐标值。The determining submodule can be used to determine the data dimension and coordinate value corresponding to each sample attribute data.
所述确定子模块,还可以基于所述数据维度和所述坐标值,确定所述各个样本属性数据对应的汉明编码。The determining submodule may also determine the Hamming code corresponding to each sample attribute data based on the data dimension and the coordinate value.
所述提取子模块,可以用于利用所述不同哈希表对应的哈希函数,提取所述汉明编码中相应位置处的编码,并将提取的编码确定为所述各个样本属性数据在所述不同哈希表中的哈希值。The extracting submodule can be used to extract the code at the corresponding position in the Hamming code by using the hash function corresponding to the different hash tables, and determine the extracted code as the attribute data of each sample in the Hash values in the different hash tables described above.
进一步地,为了从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据,所述筛选单元34,包括:第二计算模块341和确定模块342。Further, in order to filter out second sample attribute data similar to the respective sample attribute data from the first sample attribute data, the screening unit 34 includes: a second calculation module 341 and a determination module 342 .
所述第二计算模块341,可以用于分别计算所述各个样本属性数据和与其对应的第一样本属性数据之间的样本距离。The second calculation module 341 may be configured to respectively calculate sample distances between the respective sample attribute data and the corresponding first sample attribute data.
所述确定模块342,可以用于将所述样本距离小于预设距离的第一样本属性数据确定为与所述各个样本属性数据相似的第二样本属性数据。The determining module 342 may be configured to determine the first sample attribute data whose sample distance is less than a preset distance as the second sample attribute data similar to the respective sample attribute data.
在具体应用场景中,所述样本距离为汉明距离,所述第二计算模块341,包括:对比子模块和确定子模块。In a specific application scenario, the sample distance is a Hamming distance, and the second calculation module 341 includes: a comparison submodule and a determination submodule.
所述对比子模块,可以用于分别将所述各个样本属性数据对应的汉明编码与所述第一样本属性数据对应的汉明编码进行对比,确定所述各个样本属性数据与所述第一样本属性数据具有不同编码的位数。The comparison sub-module can be used to respectively compare the Hamming code corresponding to each sample attribute data with the Hamming code corresponding to the first sample attribute data, and determine that each sample attribute data and the first sample attribute data A sample attribute data has different encoding bits.
所述确定子模块,可以用于将所述位数确定为所述各个样本属性数据和与其对应的第一样本属性数据之间的汉明距离。The determining submodule may be configured to determine the number of digits as a Hamming distance between each sample attribute data and its corresponding first sample attribute data.
在具体应用场景中,为了判定所述各个样本属性数据是否为污染样本数据,所述判定单元35,包 括:统计模块351和判定模块352。In a specific application scenario, in order to determine whether each sample attribute data is contaminated sample data, the determination unit 35 includes: a statistics module 351 and a determination module 352 .
所述统计模块351,可以用于统计所述各个平台用户对应的样本数据量,并从各个样本数据量中筛选出最大样本数据量。The statistical module 351 can be used to count the amount of sample data corresponding to each platform user, and filter out the largest amount of sample data from each amount of sample data.
所述判定模块352,可以用于根据所述最大样本数据量和所述第二样本属性数据对应的样本数据量,判定所述各个样本属性数据是否为污染样本数据。The determining module 352 may be configured to determine whether each sample attribute data is contaminated sample data according to the maximum sample data amount and the sample data amount corresponding to the second sample attribute data.
进一步地,所述判定模块352,包括:相减子模块和判定子模块。Further, the determination module 352 includes: a subtraction submodule and a determination submodule.
所述相减子模块,可以用于将所述第二样本属性数据对应的样本数据量与所述最大样本数据量相减,得到所述各个样本属性数据对应的样本数量差。The subtraction sub-module may be configured to subtract the sample data amount corresponding to the second sample attribute data from the maximum sample data amount to obtain the sample amount difference corresponding to each sample attribute data.
所述判定子模块,可以用于若所述各个样本属性数据中的目标样本属性数据对应的样本数量差大于预设样本数量差,则判定所述目标样本属性数据及其对应的第二样本属性数据为污染样本数据。The determining submodule may be configured to determine whether the target sample attribute data and its corresponding second sample attribute are different if the sample size difference corresponding to the target sample attribute data in each sample attribute data is greater than the preset sample size difference. The data are polluted sample data.
需要说明的是,本发明实施例提供的一种用于模型训练的污染样本数据的检测装置所涉及各功能模块的其他相应描述,可以参考图1所示方法的对应描述,在此不再赘述。It should be noted that, for other corresponding descriptions of the functional modules involved in a detection device for contaminated sample data used for model training provided by the embodiment of the present invention, you can refer to the corresponding description of the method shown in FIG. 1 , which will not be repeated here. .
基于上述如图1所示方法,相应的,本发明实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现以下步骤:获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据;利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶;将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据;从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据;基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据。Based on the method shown in Figure 1 above, correspondingly, an embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented: obtaining each Sample attribute data corresponding to platform users, said sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user; each sample attribute data is hashed using a preset local sensitive hash algorithm into the corresponding hash buckets in different hash tables, wherein any one hash table includes multiple hash buckets; the different hash tables and the sample attribute data are located in the same hash bucket The data in is determined as the first sample attribute data; the second sample attribute data similar to the respective sample attribute data is screened out from the first sample attribute data; based on the sample corresponding to the second sample attribute data The amount of data is used to determine whether each sample attribute data is polluted sample data.
基于上述如图1所示方法和如图4所示装置的实施例,本发明实施例还提供了一种计算机设备的实体结构图,如图6所示,该计算机设备包括:处理器41、存储器42、及存储在存储器42上并可在处理器上运行的计算机程序,其中存储器42和处理器41均设置在总线43上所述处理器41执行所述程序时实现以下步骤:获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据;利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶;将所述不 同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据;从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据;基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据。Based on the above-mentioned embodiment of the method shown in FIG. 1 and the device shown in FIG. 4, the embodiment of the present invention also provides a physical structure diagram of a computer device. As shown in FIG. 6, the computer device includes: a processor 41, Memory 42, and the computer program that is stored on the memory 42 and can run on the processor, wherein the memory 42 and the processor 41 are all set on the bus 43 and realize the following steps when the processor 41 executes the program: acquire the The sample attribute data corresponding to each platform user, the sample attribute data at least includes the device attribute data, risk control data and business data corresponding to each platform user; using the preset local sensitive hash algorithm, each sample attribute data Hash into the corresponding hash buckets in different hash tables, wherein any one hash table includes multiple hash buckets; the different hash tables and the sample attribute data are located in the same hash Determining the data in the bucket as the first sample attribute data; screening out the second sample attribute data similar to the respective sample attribute data from the first sample attribute data; corresponding to the second sample attribute data based on the second sample attribute data The amount of sample data is determined respectively to determine whether each sample attribute data is polluted sample data.
通过本发明的技术方案,获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据;并利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶;与此同时,将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据;并从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据;最终基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据,从而能够提高污染样本数据的检测精度,保证样本属性数据的安全,进而能够提高异常行为检测模型的检测精度。Through the technical solution of the present invention, the sample attribute data corresponding to each platform user to be detected is obtained, and the sample attribute data includes at least the equipment attribute data, risk control data and business data corresponding to each platform user; Sensitive hashing algorithm, which hashes each sample attribute data into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets; at the same time, the different The data located in the same hash bucket as the respective sample attribute data in the hash table is determined as the first sample attribute data; The second sample attribute data; finally, based on the amount of sample data corresponding to the second sample attribute data, determine whether each sample attribute data is polluted sample data, thereby improving the detection accuracy of polluted sample data and ensuring sample attribute data security, which in turn can improve the detection accuracy of the abnormal behavior detection model.
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Alternatively, they may be implemented in program code executable by a computing device so that they may be stored in a storage device to be executed by a computing device, and in some cases in an order different from that shown here The steps shown or described are carried out, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present invention is not limited to any specific combination of hardware and software.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims (10)

  1. 一种用于模型训练的污染样本数据的检测方法,其特征在于,包括:A method for detecting polluted sample data for model training, characterized in that it includes:
    获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据;Acquiring sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;
    利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶;Using the preset local sensitive hash algorithm, each sample attribute data is hashed into corresponding hash buckets in different hash tables, wherein any hash table includes multiple hash buckets;
    将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据;determining the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data;
    从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据;Screening out second sample attribute data similar to the respective sample attribute data from the first sample attribute data;
    基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据。Based on the amount of sample data corresponding to the second sample attribute data, it is determined whether each of the sample attribute data is contaminated sample data.
  2. 根据权利要求1所述的方法,其特征在于,所述利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,包括:The method according to claim 1, wherein said using a preset local sensitive hash algorithm to hash each sample attribute data into corresponding hash buckets in different hash tables, comprising:
    利用预设局部敏感哈希算法,分别计算所述各个样本属性数据在所述不同哈希表中的哈希值;calculating the hash values of the respective sample attribute data in the different hash tables by using a preset locality-sensitive hash algorithm;
    基于所述哈希值,将所述各个样本属性数据哈希到所述不同哈希表中相应的哈希桶内。Based on the hash value, hash the respective sample attribute data into corresponding hash buckets in the different hash tables.
  3. 根据权利要求2所述的方法,其特征在于,所述利用预设局部敏感哈希算法,分别计算所述各个样本属性数据在所述不同哈希表中的哈希值,包括:The method according to claim 2, wherein said calculating the hash values of said respective sample attribute data in said different hash tables by using a preset local sensitive hash algorithm comprises:
    确定所述各个样本属性数据对应的数据维度和坐标值;Determine the data dimension and coordinate value corresponding to each sample attribute data;
    基于所述数据维度和所述坐标值,确定所述各个样本属性数据对应的汉明编码;Based on the data dimension and the coordinate value, determine the Hamming code corresponding to each sample attribute data;
    利用所述不同哈希表对应的哈希函数,提取所述汉明编码中相应位置处的编码,并将提取的编码确定为所述各个样本属性数据在所述不同哈希表中的哈希值。Using the hash function corresponding to the different hash tables, extract the code at the corresponding position in the Hamming code, and determine the extracted code as the hash of each sample attribute data in the different hash table value.
  4. 根据权利要求3所述的方法,其特征在于,所述从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据,包括:The method according to claim 3, wherein the filtering out the second sample attribute data similar to the respective sample attribute data from the first sample attribute data comprises:
    分别计算所述各个样本属性数据和与其对应的第一样本属性数据之间的样本距离;Calculating sample distances between the respective sample attribute data and the corresponding first sample attribute data;
    将所述样本距离小于预设距离的第一样本属性数据确定为与所述各个样本属性数据相似的第二样本属性数据。The first sample attribute data whose sample distance is smaller than a preset distance is determined as the second sample attribute data similar to the respective sample attribute data.
  5. 根据权利要求4所述的方法,其特征在于,所述样本距离为汉明距离,所述分别计算所述各个样本属性数据和与其对应的第一样本属性数据之间的样本距离,包括:The method according to claim 4, wherein the sample distance is a Hamming distance, and the respective calculation of the sample distance between each sample attribute data and the corresponding first sample attribute data includes:
    分别将所述各个样本属性数据对应的汉明编码与所述第一样本属性数据对应的汉明编码进行对 比,确定所述各个样本属性数据与所述第一样本属性数据具有不同编码的位数;Comparing the Hamming code corresponding to each sample attribute data with the Hamming code corresponding to the first sample attribute data respectively, and determining that each sample attribute data and the first sample attribute data have different codes number of digits;
    将所述位数确定为所述各个样本属性数据和与其对应的第一样本属性数据之间的汉明距离。The number of digits is determined as a Hamming distance between each sample attribute data and its corresponding first sample attribute data.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据,包括:The method according to any one of claims 1-5, wherein, based on the amount of sample data corresponding to the second sample attribute data, respectively determining whether each sample attribute data is polluted sample data includes:
    统计所述各个平台用户对应的样本数据量,并从各个样本数据量中筛选出最大样本数据量;Counting the amount of sample data corresponding to each platform user, and selecting the largest amount of sample data from each sample data amount;
    根据所述最大样本数据量和所述第二样本属性数据对应的样本数据量,判定所述各个样本属性数据是否为污染样本数据。According to the maximum sample data volume and the sample data volume corresponding to the second sample attribute data, it is determined whether each sample attribute data is polluted sample data.
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述最大样本数据量和所述第二样本属性数据对应的样本数据量,判定所述各个样本属性数据是否为污染样本数据,包括:The method according to claim 6, wherein, according to the maximum amount of sample data and the amount of sample data corresponding to the second sample attribute data, determining whether each sample attribute data is polluted sample data includes :
    将所述第二样本属性数据对应的样本数据量与所述最大样本数据量相减,得到所述各个样本属性数据对应的样本数量差;Subtracting the sample data volume corresponding to the second sample attribute data from the maximum sample data volume to obtain the sample size difference corresponding to each sample attribute data;
    若所述各个样本属性数据中的目标样本属性数据对应的样本数量差大于预设样本数量差,则判定所述目标样本属性数据及其对应的第二样本属性数据为污染样本数据。If the sample size difference corresponding to the target sample attribute data in each sample attribute data is greater than the preset sample size difference, it is determined that the target sample attribute data and its corresponding second sample attribute data are contaminated sample data.
  8. 一种用于模型训练的污染样本数据的检测装置,其特征在于,包括:A detection device for contaminated sample data for model training, characterized in that it includes:
    获取单元,用于获取待检测的各个平台用户对应的样本属性数据,所述样本属性数据至少包括所述各个平台用户对应的设备属性数据、风控数据和业务数据;An acquisition unit, configured to acquire sample attribute data corresponding to each platform user to be detected, the sample attribute data at least including device attribute data, risk control data, and business data corresponding to each platform user;
    哈希单元,用于利用预设局部敏感哈希算法,分别将各个样本属性数据哈希到不同哈希表中相应的哈希桶内,其中,任意一个哈希表中均包括多个哈希桶;The hash unit is used to hash each sample attribute data into corresponding hash buckets in different hash tables by using a preset local sensitive hash algorithm, wherein any hash table includes multiple hashes bucket;
    确定单元,用于将所述不同哈希表中与所述各个样本属性数据位于同一个哈希桶中的数据确定为第一样本属性数据;A determining unit, configured to determine the data in the same hash bucket as the respective sample attribute data in the different hash tables as the first sample attribute data;
    筛选单元,用于从所述第一样本属性数据中筛选出与所述各个样本属性数据相似的第二样本属性数据;a screening unit, configured to screen out second sample attribute data similar to the respective sample attribute data from the first sample attribute data;
    判定单元,用于基于所述第二样本属性数据对应的样本数据量,分别判定所述各个样本属性数据是否为污染样本数据。A judging unit, configured to respectively judge whether each sample attribute data is polluted sample data based on the amount of sample data corresponding to the second sample attribute data.
  9. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的方法的步骤。A computer device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, characterized in that, when the computer program is executed by the processor, it implements any one of claims 1 to 7. steps of the method described above.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的方法的步骤。A computer-readable storage medium, on which a computer program is stored, wherein, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are realized.
PCT/CN2021/124044 2021-09-07 2021-10-15 Polluted sample data detecting method and apparatus for model training WO2023035362A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111041760.3A CN113495886A (en) 2021-09-07 2021-09-07 Method and device for detecting pollution sample data for model training
CN202111041760.3 2021-09-07

Publications (1)

Publication Number Publication Date
WO2023035362A1 true WO2023035362A1 (en) 2023-03-16

Family

ID=77996132

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/124044 WO2023035362A1 (en) 2021-09-07 2021-10-15 Polluted sample data detecting method and apparatus for model training

Country Status (2)

Country Link
CN (1) CN113495886A (en)
WO (1) WO2023035362A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116662853A (en) * 2023-05-29 2023-08-29 新禾数字科技(无锡)有限公司 Method and system for automatically identifying analysis result of pollution source

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113495886A (en) * 2021-09-07 2021-10-12 上海观安信息技术股份有限公司 Method and device for detecting pollution sample data for model training

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160072833A1 (en) * 2014-09-04 2016-03-10 Electronics And Telecommunications Research Institute Apparatus and method for searching for similar malicious code based on malicious code feature information
CN106649715A (en) * 2016-12-21 2017-05-10 中国人民解放军国防科学技术大学 Cross-media retrieval method based on local sensitive hash algorithm and neural network
CN107358075A (en) * 2017-07-07 2017-11-17 四川大学 A kind of fictitious users detection method based on hierarchical clustering
CN110610084A (en) * 2018-06-15 2019-12-24 武汉安天信息技术有限责任公司 Dex file-based sample maliciousness determination method and related device
US10778707B1 (en) * 2016-05-12 2020-09-15 Amazon Technologies, Inc. Outlier detection for streaming data using locality sensitive hashing
CN112733140A (en) * 2020-12-28 2021-04-30 上海观安信息技术股份有限公司 Detection method and system for model tilt attack
CN112989334A (en) * 2019-12-12 2021-06-18 华为技术有限公司 Data detection method for machine learning and related equipment
CN113495886A (en) * 2021-09-07 2021-10-12 上海观安信息技术股份有限公司 Method and device for detecting pollution sample data for model training

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866287B (en) * 2019-10-31 2021-12-17 大连理工大学 Point attack method for generating countercheck sample based on weight spectrum

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160072833A1 (en) * 2014-09-04 2016-03-10 Electronics And Telecommunications Research Institute Apparatus and method for searching for similar malicious code based on malicious code feature information
US10778707B1 (en) * 2016-05-12 2020-09-15 Amazon Technologies, Inc. Outlier detection for streaming data using locality sensitive hashing
CN106649715A (en) * 2016-12-21 2017-05-10 中国人民解放军国防科学技术大学 Cross-media retrieval method based on local sensitive hash algorithm and neural network
CN107358075A (en) * 2017-07-07 2017-11-17 四川大学 A kind of fictitious users detection method based on hierarchical clustering
CN110610084A (en) * 2018-06-15 2019-12-24 武汉安天信息技术有限责任公司 Dex file-based sample maliciousness determination method and related device
CN112989334A (en) * 2019-12-12 2021-06-18 华为技术有限公司 Data detection method for machine learning and related equipment
CN112733140A (en) * 2020-12-28 2021-04-30 上海观安信息技术股份有限公司 Detection method and system for model tilt attack
CN113495886A (en) * 2021-09-07 2021-10-12 上海观安信息技术股份有限公司 Method and device for detecting pollution sample data for model training

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116662853A (en) * 2023-05-29 2023-08-29 新禾数字科技(无锡)有限公司 Method and system for automatically identifying analysis result of pollution source
CN116662853B (en) * 2023-05-29 2024-04-30 新禾数字科技(无锡)有限公司 Method and system for automatically identifying analysis result of pollution source

Also Published As

Publication number Publication date
CN113495886A (en) 2021-10-12

Similar Documents

Publication Publication Date Title
US11816078B2 (en) Automatic entity resolution with rules detection and generation system
US11003896B2 (en) Entity recognition from an image
RU2708356C1 (en) System and method for two-stage classification of files
WO2023035362A1 (en) Polluted sample data detecting method and apparatus for model training
CN103582884A (en) Robust feature matching for visual search
JP2013541754A (en) Method and arrangement for handling data sets, data processing program and computer program product
CN110245714B (en) Image recognition method and device and electronic equipment
JP2019220144A (en) Methods, devices and systems for data augmentation to improve fraud detection
US20220279045A1 (en) Global iterative clustering algorithm to model entities' behaviors and detect anomalies
US11403875B2 (en) Processing method of learning face recognition by artificial intelligence module
CN111177367A (en) Case classification method, classification model training method and related products
CN113221032A (en) Link risk detection method, device and storage medium
CN111046087A (en) Data processing method, device, equipment and storage medium
CN112883730A (en) Similar text matching method and device, electronic equipment and storage medium
CN114398685A (en) Government affair data processing method and device, computer equipment and storage medium
CN111368128B (en) Target picture identification method, device and computer readable storage medium
CN113691525A (en) Traffic data processing method, device, equipment and storage medium
CN112966272A (en) Internet of things Android malicious software detection method based on countermeasure network
CN108009233B (en) Image restoration method and device, computer equipment and storage medium
CN113672976B (en) Sensitive information detection method and device
CN114565044B (en) Seal identification method and system
CN117827991B (en) Method and system for identifying personal identification information in semi-structured data
CN118075233A (en) Legal domain name identification method and device and computer equipment
CN112187768B (en) Method, device and equipment for detecting bad information website and readable storage medium
CN117112545A (en) Data processing method, device and equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21956535

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21956535

Country of ref document: EP

Kind code of ref document: A1