WO2022033396A1 - 信用阈值的训练方法及装置、ip地址的检测方法及装置 - Google Patents

信用阈值的训练方法及装置、ip地址的检测方法及装置 Download PDF

Info

Publication number
WO2022033396A1
WO2022033396A1 PCT/CN2021/111096 CN2021111096W WO2022033396A1 WO 2022033396 A1 WO2022033396 A1 WO 2022033396A1 CN 2021111096 W CN2021111096 W CN 2021111096W WO 2022033396 A1 WO2022033396 A1 WO 2022033396A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
feature
addresses
value
credit
Prior art date
Application number
PCT/CN2021/111096
Other languages
English (en)
French (fr)
Inventor
王相
钟清华
Original Assignee
百果园技术(新加坡)有限公司
王相
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 王相 filed Critical 百果园技术(新加坡)有限公司
Priority to US18/041,275 priority Critical patent/US20230328087A1/en
Priority to EP21855449.1A priority patent/EP4199421A1/en
Publication of WO2022033396A1 publication Critical patent/WO2022033396A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1466Active attacks involving interception, injection, modification, spoofing of data unit addresses, e.g. hijacking, packet injection or TCP sequence number attacks

Definitions

  • the present application relates to the technical field of operation monitoring, for example, to a method and device for training a credit threshold, and a method and device for detecting an Internet Protocol (IP) address.
  • IP Internet Protocol
  • the website When users register, log in, etc., the website usually sends verification codes through text messages, emails, etc., so that users can register and log in by entering the verification codes.
  • Websites generally monitor the user's IP address. There are two main monitoring methods:
  • This application proposes a method and device for training a credit threshold, and a method and device for detecting an IP address, so as to solve the problem of monitoring the behavior of an IP address through a single threshold such as frequency and verification rate, which is prone to false interception and circumvention of management and control. .
  • This application provides a training method for a credit threshold, including:
  • At least two correlation coefficients are hierarchically calculated for each service feature, wherein the correlation coefficient is used to represent the correlation between the each service feature and the validity of the IP address;
  • For each IP address generate a credit value representing the legitimacy of each IP address according to the correlation coefficient corresponding to the service feature;
  • the evaluation index is an index for evaluating the legitimacy of predicting the plurality of IP addresses using service characteristics of the plurality of IP addresses
  • a credit value corresponding to the evaluation index is a credit threshold, wherein the credit threshold is used to classify the legality of the IP address.
  • the application also provides a method for detecting an IP address, including:
  • the correlation coefficient corresponding to each service feature wherein the correlation coefficient is used to represent the correlation between the each service feature and the validity of the IP address
  • the credit value is compared with a preset credit threshold to determine the validity of the IP address, so as to predict the validity of the IP address by using the service characteristics of the IP address.
  • the application also provides a credit threshold training device, including:
  • the historical business feature statistics module is configured to count various business features from historical data of business operations triggered by multiple IP addresses;
  • a correlation coefficient calculation module configured to calculate at least two correlation coefficients for each type of service feature hierarchically, wherein the correlation coefficient is used to represent the correlation between the each type of service feature and the legitimacy of the IP address;
  • a credit value calculation module configured to generate a credit value representing the legitimacy of each IP address according to the correlation coefficient corresponding to the service feature for each IP address;
  • An evaluation index generation module configured to generate an evaluation index for the plurality of IP addresses, wherein the evaluation index is an index for evaluating the validity of predicting the plurality of IP addresses by using the service characteristics of the plurality of IP addresses ;
  • the credit threshold determination module is configured to determine a credit value corresponding to the evaluation index when the evaluation index meets the target condition, and determine a credit threshold value, wherein the credit threshold value is used to classify the legality of the IP address.
  • the application also provides an IP address detection device, including:
  • the real-time business feature statistics module is set to count various business features from the real-time data of business operations triggered by IP addresses;
  • a correlation coefficient query module configured to query the correlation coefficient corresponding to each service feature, wherein the correlation coefficient is used to represent the correlation between the each service feature and the validity of the IP address;
  • a credit value generating module configured to generate a credit value representing the legitimacy of the IP address according to the correlation coefficient corresponding to each service feature
  • the legitimacy determination module is configured to compare the credit value with a preset credit threshold to determine the legitimacy of the IP address, so as to predict the legitimacy of the IP address by using the service characteristics of the IP address.
  • the application also provides a computer equipment, including:
  • a memory arranged to store at least one program
  • the at least one processor When the at least one program is executed by the at least one processor, the at least one processor implements the above-mentioned training of the credit threshold or the above-mentioned method for detecting the IP address.
  • the present application also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned training of the credit threshold or the above-mentioned method for detecting an IP address is implemented.
  • Embodiment 1 is a flowchart of a method for training a credit threshold provided in Embodiment 1 of the present application;
  • FIG. 2 is a flowchart of a method for detecting an IP address according to Embodiment 2 of the present application
  • FIG. 3 is a schematic diagram of a business operation provided in Embodiment 2 of the present application.
  • FIG. 4 is a schematic structural diagram of a credit threshold training device provided in Embodiment 3 of the present application.
  • FIG. 5 is a schematic structural diagram of a device for detecting an IP address according to Embodiment 4 of the present application.
  • FIG. 6 is a schematic structural diagram of a computer device according to Embodiment 5 of the present application.
  • the anomalies reflected by IP addresses are also different.
  • the abnormality reflected by the IP address may include the following:
  • the attacker uses a phone number as the target of the attack, cyclically calls the interfaces used for registration, login and other behaviors in different websites, and frequently sends SMS messages carrying verification codes to the phone number to achieve the purpose of attacking the phone number.
  • the attacker uses a website as the target of the attack, constantly changing various interface parameters, such as phone number, IP address, etc., and cyclically calling the interface used for registration, login, etc.
  • the SMS verification code greatly increases the cost paid by the website to send the SMS, so as to achieve the purpose of attacking the website.
  • the attacker keeps changing various interface parameters, such as phone number, IP address, etc., and cyclically calls the interface used for registration, login and other behaviors in the website to frequently send Different phone numbers send text messages with verification codes to obtain electronic coupons, physical gifts and other properties rewarded by registered new users, or to buy a large number of products at low prices, etc., to achieve the purpose of profit.
  • various interface parameters such as phone number, IP address, etc.
  • IP addresses For websites, it is inefficient to divide IP addresses into abnormal IP addresses or normal IP addresses according to the relationship between some behavioral data (such as frequency, verification rate, etc.) and thresholds.
  • some behavioral data such as frequency, verification rate, etc.
  • SMS refers to the following behaviors of IP address requesting SMS to illustrate the effect of different behaviors on determining whether an IP address is abnormal.
  • IP address the number of times the IP address requests SMS (the number of SMS requests), the number of times the verification code in the SMS is verified (the number of verifications), and the success rate of the verification code (the success rate of verification)
  • the number of accounts logged in at IP addresses number of accounts
  • the total number of phone numbers that receive verification codes total number of phone numbers
  • the number of phone numbers that receive verification codes across countries or regions cross-country or region number of phone numbers.
  • the first 3 rows of each table are historical behavior data earlier than the previous day, and the 4th row is the previous day's behavior data.
  • the number of SMS requests by the IP address is increasing. If monitoring is performed based on the threshold of the number of requests and the threshold is low, the IP address may be intercepted by mistake. However, from other dimensions, the verification is successful. There is no change in the rate and the number of phone numbers across countries or regions. The increase in the number of IP address requests is due to the increase in the number of phone numbers, which brings growth to the business and is a normal behavior.
  • the number of SMS requests by the IP address is increasing, and considering other dimensions, the verification success rate and the number of accounts have changed greatly, that is, the number of SMS requests has increased, but the number of verifications has not. If the IP address increases suddenly and a large number of invalid requests are made, it may be that the IP address is attacking the website. If the monitoring is based on the threshold of the number of requests and the threshold is high, the IP address may be missed.
  • the number of SMS requests by the IP address is increasing.
  • the verification success rate and the number of phone numbers across countries or regions have changed greatly, that is, for a large number of requests,
  • the success rate of verification drops, and there are sudden requests for a large number of phone numbers across countries or regions. It may be that the IP address is used by the agent to attack the website. If the monitoring is based on the threshold of the number of requests and the threshold is high, the IP address may be will be missed.
  • Embodiment 1 is a flowchart of a method for training a credit threshold provided in Embodiment 1 of the present application. This embodiment can be applied to obtain the degree of influence of different features on anomaly detection through historical data learning, thereby constructing an effective credit scoring mechanism.
  • the method can be performed by a credit threshold training device, which can be implemented by software and/or hardware, and can be configured in computer equipment, such as a server, a workstation, a personal computer (Personal Computer, PC) , etc., including the following steps:
  • Step 101 Count various service features from historical data of service operations triggered based on multiple IP addresses.
  • the client requests the server (server) for services such as registration, login, password recall, payment, etc., thereby triggering corresponding business operations, such as sending a short message containing a verification code to a designated phone number, Send an email containing a verification link to the specified email address, and the client uses the verification code, verification link and other information for verification.
  • services such as registration, login, password recall, payment, etc.
  • the IP address of each client is recorded on the server side, and the data generated when each client performs business operations is recorded as historical data.
  • the computer device acquires the historical data, and marks positive and negative samples for the validity of each IP address, where positive samples are abnormal IP addresses, and negative samples are normal IP addresses.
  • the historical data in abnormal IP addresses are generally all abnormal behaviors, and the historical data of normal IP addresses are not mixed to avoid interference during training.
  • preprocessing can be performed, such as data cleaning, missing value processing, outlier processing, etc., so as to convert historical data into formatted data that can be used for training.
  • Data cleaning can be used to clean spam samples, for example, fake IP addresses using emulators or Virtual Private Network (VPN) proxies.
  • VPN Virtual Private Network
  • Missing value processing can refer to finding and filtering historical data with empty IP addresses, and not using these historical data to participate in training.
  • Outlier processing can find and filter abnormal IP addresses.
  • abnormal IP addresses are fake IP addresses that use simulators or VPN proxies, and do not use these historical data to participate in training.
  • the ratio between the sample size of positive samples and the sample size of negative samples is within a preset range, and the difference between the sample size of positive samples and the sample size of negative samples cannot be too large. Therefore, The scale balance can be achieved by downsampling the negative samples.
  • IP address For historical data of business operations, you can use IP address as a statistical dimension to count various business features from the historical data of business operations.
  • the business operation triggered based on the multiple IP addresses is a registration operation, wherein the registration operation includes sending a short message containing a verification code and verifying the verification code.
  • the number of SMS requests from the IP address the number of times to verify the verification code, the success rate of the verification code, the number of accounts logged in at the IP address, the total number of phone numbers that receive verification codes, the number of phone numbers that receive verification codes across countries or regions quantity.
  • the client can verify the same verification code multiple times in one verification operation, and the number of verification codes can refer to the cumulative number of verifications.
  • Step 102 Calculate at least two correlation coefficients for each type of service feature hierarchically.
  • IP addresses Different business characteristics have different degrees of importance in describing the legitimacy of IP addresses. For example, for registration operations, the success rate of verification codes, the number of phone numbers that receive verification codes across countries or regions, etc. are strongly related services. Features, that is, have a greater impact on the legitimacy, the number of requests for SMS messages, and the number of accounts logged in at the IP address are weakly correlated business features, that is, the impact on the legitimacy is small.
  • the correlation coefficient of this kind of service feature corresponding to each value can be adaptively learned, and the correlation coefficient is used to represent this kind of service feature Correlation with the legitimacy of IP addresses.
  • the correlation coefficient can be calculated using the method of weight of evidence (WOE).
  • WOE is an algorithm that transforms continuous variables into discrete variables, which can be used to describe the effect of different business characteristics on legitimacy.
  • Influence degree in this embodiment, step 102 may include the following steps:
  • Step 1021 Set multiple feature ranges for each service feature.
  • the business feature is a continuous variable, and each business feature is divided into multiple continuous feature ranges within its numerical range, so that the business feature is converted into discrete variables.
  • Step 1022 Divide the plurality of IP addresses into feature subsets corresponding to different feature ranges according to the value of each service feature.
  • the value of its service feature can be compared with the corresponding feature range one by one. If the value of the service feature is within a feature range, the IP address can be divided into feature subsets corresponding to the feature range.
  • Step 1023 Calculate the evidence weight of each feature range according to the IP addresses in the feature subset corresponding to each feature range, and use the evidence weight as the correlation coefficient of the service feature within each feature range.
  • the IP address corresponding to the service feature in the feature subset can be used to calculate the evidence weight for the feature range, and the evidence weight can be regarded as the service feature within the feature range. correlation coefficients within.
  • the IP address is marked with a first state indicating the validity of the IP address, and the first state is the true state of the IP address, including normal and abnormal.
  • For a feature range corresponding to a service feature count the number of IP addresses whose first state is abnormal in the feature subset corresponding to the feature range and the number of all IP addresses whose first state is abnormal corresponding to the service feature.
  • the ratio between the two is taken as the first ratio; statistics are between the number of IP addresses whose first state is normal in the feature subset corresponding to the feature range and the number of all IP addresses whose first state corresponding to this service feature is normal.
  • ratio as the second ratio.
  • the logarithm of the ratio between the first ratio and the second ratio is taken as the weight of evidence for the service feature within the feature range.
  • WOE i represents the evidence weight corresponding to the ith feature range
  • y i represents the number of IP addresses whose first status is abnormal in the feature subset corresponding to the ith feature range
  • ya represents the number of IP addresses corresponding to this service feature.
  • the number of all IP addresses whose status is abnormal then Represents the first ratio
  • n i represents the number of normal IP addresses in the first state in the feature subset corresponding to the i-th feature range
  • na represents the number of all IP addresses whose first state is normal corresponding to this service feature
  • ln represents the logarithm with the natural number e as the base.
  • the number of requested SMS messages is counted as a service feature, and the number of requested SMS messages is a continuous variable. After discretization processing, it is divided into 6 feature ranges, and the WOE value of each feature range can be obtained as shown in Table 4 below. Show:
  • the number of positive samples in the feature range is positively correlated with the value of WOE. From Table 4 above, it can be seen that the more abnormal IP addresses in the feature subset corresponding to the feature range, the greater the WOE. Therefore, the WOE can represent business features and legal sexual relevance.
  • WOE describes the direction and size of the influence of business characteristics on legitimacy within the current scope of this characteristic.
  • the WOE is positive, the business feature has a positive impact on the individual's judgment within the current feature range, and when the WOE is negative, the business feature has a negative impact on the individual's judgment within the current feature range.
  • the size of the WOE value is the embodiment of the impact.
  • the above method of calculating the correlation coefficient is only an example.
  • other methods of calculating the correlation coefficient may be set according to the actual situation of the business operation. , IV), receiver operating characteristic curve (Receiver Operating Characteristic Curve, ROC), information entropy and other ways to calculate the correlation coefficient, etc., the embodiments of the present application are not limited to this.
  • Step 103 For each IP address, generate a credit value representing the legitimacy of each IP address according to the correlation coefficient corresponding to the service feature.
  • each IP address traverses each IP address, analyze the business operation of the IP address by synthesizing the business characteristics of the IP address, and analyze the business operation of the IP address according to the correlation coefficient that the business characteristics of the IP address affect the legality
  • the degree of influence on legitimacy is quantified as a credit value, which reflects the credibility of the business operation triggered by the IP address for legitimacy.
  • step 103 may include the following steps:
  • Step 1031 For each service feature of each IP address, query the correlation coefficient associated with the feature range where the value of each service feature is located.
  • each feature range is associated with a correlation coefficient.
  • the value of the service feature is compared with a plurality of corresponding feature ranges, thereby determining the feature range where the value of the service feature is located, and extracting the correlation coefficient associated with the feature range.
  • Step 1032 Find the feature weights trained for each service feature.
  • a feature weight can be trained for each service feature in advance, and the feature weight can be used to indicate the importance of predicting the validity of an IP address according to the service feature.
  • the feature weight is a model parameter in the classification model, and the classification model is used to predict the second state (including normal and abnormal) indicating the legitimacy of the IP address according to the service feature.
  • Feature weights trained for each business feature when classifying the model are trained.
  • Step 1033 Calculate the candidate value of each service feature based on the correlation coefficient and feature weight.
  • the correlation coefficient and the feature weight can be used as variables to calculate the candidate value of each service feature, so that the candidate value is positively correlated with the correlation coefficient and the feature weight, that is, the larger the correlation coefficient, the larger the candidate value, the higher the correlation coefficient.
  • the first product between the correlation coefficient and the feature weight is calculated, and the first sum value between the first product and the sub-regression intercept is calculated, wherein the sub-regression intercept is the regression intercept corresponding to the IP address.
  • the ratio between the types of business characteristics, and the regression intercept is used to predict the legitimacy of IP addresses.
  • the offset is the ratio between the offset and the type of service feature corresponding to the IP address.
  • the candidate values are represented as follows:
  • Score i represents the candidate value of the ith service feature
  • woei represents the correlation coefficient (such as the WOE value) of the ith service feature
  • wi represents the feature of the ith service feature weight
  • a represents the regression intercept
  • factor represents the scale factor
  • offset represents the offset
  • factor can be set according to risk preference
  • the above method of calculating the candidate value is only an example.
  • other methods of calculating the candidate value may be set according to the actual situation of the business operation.
  • the correlation coefficient and the feature weight are linearly fused, and so on. The embodiment does not limit this.
  • Step 1034 Sum up all the candidate values, and use the summation result as a credit value representing the validity of each IP address.
  • the sum of the candidate values of all service characteristics can be calculated as the credit value of the IP address regarding the legitimacy.
  • the credit value is represented as follows:
  • the credit value is calculated jointly with the correlation coefficient and the feature weight.
  • the influence degree of the business feature on the legality in the local area is referenced, and on the other hand, the importance of the business feature on the legality is referenced globally, which can improve the credit value. accuracy.
  • Step 104 Generate evaluation indicators for multiple IP addresses.
  • the service features of multiple IP addresses can be used to predict the validity of the multiple IP addresses, and an evaluation index can be generated for this predicted operation, that is, the evaluation index is used to evaluate the predicted IP address using the service features.
  • An indicator of the legitimacy of an address is used to evaluate the predicted IP address using the service features.
  • step 104 may include the following steps:
  • Step 1041 Input the service feature corresponding to each IP address into a classification model, and predict a second state representing the validity of each IP address through the classification model.
  • a classification model can be pre-trained, the classification model belongs to a binary classification model, and can be used to predict the second state (including normal and abnormal) indicating the validity of the IP address according to the service characteristics of the IP address.
  • the classification model may include machine learning models such as Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), and may also include Convolutional Neural Network (CNN) ) and other deep learning models, the type of the classification model is not limited in this embodiment.
  • SVM Support Vector Machine
  • LR Logistic Regression
  • RF Random Forest
  • CNN Convolutional Neural Network
  • the feature weight of business features can be set in the model parameters of the classification model, and the feature weight can be used to calculate the credit value. Therefore, you can choose a simple structure (that is, fewer model parameters) and samples required for training. For a small number of classification models, when training the classification model, the feature weights of business features are trained together.
  • LR is expressed as follows:
  • w T is the feature weight
  • x is the service feature corresponding to the IP address
  • b is the regression intercept
  • p is the second state indicating the validity of the IP address.
  • the business features can be input into the classification model to predict the second state representing the legitimacy of the IP address, and the loss value LOSS between the first state and the second state can be calculated using a preset loss function.
  • the loss value LOSS can be It reflects the degree of inconsistency between the first state and the second state, that is, the degree to which an abnormal IP address is predicted to be a normal IP address, or the degree to which a normal IP address is predicted to be an abnormal IP address.
  • the loss function F(w) is expressed as:
  • N is the number of samples, n ⁇ N, and p is the second state, that is, the predicted value.
  • y n represents the nth first state, that is, the true value.
  • the loss value is calculated in each iterative training, it can be determined whether the loss value is less than or equal to a preset threshold.
  • the loss value is less than or equal to the preset threshold, it is determined that the training of the classification model is completed, and the structure of the classification model and its model parameters can be stored.
  • the loss value is greater than the preset threshold, update the model parameters in the classification model according to stochastic gradients, etc., and return to the second state of inputting the business features into the classification model to predict the validity of the IP address, thereby entering the next iterative training. .
  • the classification model When classifying an IP address, the classification model can be started and its model parameters can be loaded, the service characteristics of the IP address can be input into the classification model for processing, and the classification model can output a second state indicating the validity of the IP address, thereby predicting the IP address.
  • the IP address is a normal IP address or an abnormal IP address.
  • the model parameters of the classification model include feature weights set for each business feature, and the optimal feature weights are found through multiple iterations of training.
  • the model parameters of the classification model also include a regression intercept, and the optimal regression intercept is found through multiple iterations of training.
  • the LR model you can find the feature weights and regression intercepts trained for each business feature, and load the feature weights and regression intercepts in the logistic regression model.
  • the loading is complete, input the business features corresponding to the IP address into the logistic regression model.
  • the model predicts a second state representing the legitimacy of the IP address.
  • model parameters may be loaded to predict the second state representing the validity of the IP address, which is not limited in this embodiment.
  • Step 1042 setting multiple credit value ranges.
  • multiple credit value ranges may be set for the overall range of credit values of multiple IP addresses sampled this time.
  • the overall range of credit values of multiple IP addresses sampled this time is [3, 35]
  • the multiple credit value ranges can be set to [3, 5], [3, 10], [3, 20], [3, 35].
  • Multiple credit value ranges can also be preset before calculating the credit value of each IP address, and the multiple credit value ranges can be used to divide the credit values of multiple IP addresses sampled at any time, for example, setting the multiple credit value ranges.
  • the value range is (- ⁇ , 5], (- ⁇ , 10], (- ⁇ , 20], (- ⁇ , 40], where the credit value 40 may be the maximum credit value determined empirically.
  • This embodiment does not limit the manner of setting multiple credit value ranges, the number of multiple credit value ranges, and the upper and lower limit credit values of each credit value range.
  • Step 1043 Divide each IP address into IP address subsets of the corresponding credit value range according to the credit value of each IP address.
  • the credit value of the IP address can be compared with each credit value range in a plurality of credit value ranges, and if the credit value is in one credit value range, the IP address can be divided into the A subset of IP addresses corresponding to the credit value range.
  • Step 1044 Calculate the evaluation index corresponding to each IP address subset according to the first state and the second state of each IP address in each IP address subset.
  • the number of IP address subsets into which multiple IP addresses are divided is related to the set credit value range, and each IP address subset corresponds to an evaluation index, therefore, the number of evaluation indexes is the same as the set credit value range.
  • the number of credit value ranges is related.
  • Each IP address is marked with a first state indicating the validity of the IP address, the first state and the second state of each IP address in each IP address subset are compared, and the first state and the second state are compared according to the relationship between the first state and the second state.
  • the similarities and differences between the IP addresses in each IP address subset generate an evaluation index for this prediction operation as an evaluation index.
  • each evaluation index includes an accuracy rate (accuracy) and a recall rate (recall), then in this example, for each IP address subset, the first value TP, the second value FN, the third The value TN, where the first value TP represents the number of IP addresses in the IP address subset whose first state is abnormal and the second state is abnormal, and the second value FN represents that the first state is abnormal and the second state is abnormal in the IP address subset.
  • the number of IP addresses whose states are normal, and the third numerical value TN represents the number of IP addresses whose first state is normal and the second state is normal in the IP address subset.
  • the accuracy rate acc can be It is expressed as follows:
  • the recall rate rec can be expressed as follows:
  • evaluation indicators are only examples. When implementing the embodiments of the present application, other evaluation indicators may be set according to the actual situation of business operations, such as precision, F1 value, etc., which are not limited in the embodiments of the present application. .
  • the calculation of the credit value of the IP address and the evaluation index use some of the same parameters (such as business characteristics, regression intercept, etc.), then the same parameters can be trained once, not only can reduce the cost of training parameters, but also can Enhancing the correlation between the credit value and the evaluation index is conducive to the subsequent application of the credit value as the credit threshold to classify the legitimacy.
  • the same parameters such as business characteristics, regression intercept, etc.
  • Step 105 If the evaluation index meets the target condition, determine a credit value corresponding to the evaluation index as a credit threshold.
  • the evaluation indicators for evaluating the validity of the predicted IP addresses can be used as a reference, and compared with the preset target conditions. If the indicator meets the requirements of the target condition, the upper credit value of the credit value range corresponding to the evaluation indicator can be set as the credit threshold. It can be considered that the IP address is an abnormal IP address, and the credit value of the IP address is greater than or equal to the credit threshold, and the IP address can be considered to be a normal IP address.
  • the accuracy rate of different IP address subsets can be compared to find the numerical value highest accuracy.
  • the upper limit credit value of the credit value range corresponding to the subset of IP addresses corresponding to the accuracy rate with the highest numerical value is determined as the credit threshold value.
  • the accuracy rate with the highest numerical value corresponds to at least two subsets of IP addresses
  • the upper credit value of the credit value range corresponding to the set is the credit threshold.
  • the IP address is an abnormal IP address.
  • the above method of setting the credit threshold is only an example.
  • other methods of setting the credit threshold may be adopted according to the actual situation of the business operation.
  • the credit value corresponding to the F1 value with the highest numerical value is the credit threshold. etc., this is not limited in the embodiments of the present application.
  • various service features are counted from historical data of service operations triggered by multiple IP addresses, and at least two correlation coefficients are calculated for each service feature.
  • the correlation coefficients are used to represent the relationship between each service feature and IP address.
  • the correlation between the legitimacy of addresses For each IP address, a credit value representing the legitimacy of each IP address is generated according to the correlation coefficient corresponding to the service feature, and an evaluation index is generated for multiple IP addresses.
  • the evaluation index is used for Evaluate an index for predicting the legitimacy of multiple IP addresses by using the business characteristics of multiple IP addresses. If the evaluation index meets the target condition, a credit value corresponding to the evaluation index is determined as a credit threshold, and the credit threshold is used to classify IP addresses.
  • the status of legality, training the credit threshold with the aid of evaluation indicators, can ensure the validity of the legality of the credit threshold division, comprehensively evaluate the credit value of the IP address based on multiple business characteristics of business operations, realize multi-dimensional comprehensive judgment, avoid
  • the method of dividing the legality status according to the threshold of a single dimension can reduce the risk of false interception caused by failure of a single dimension, and also reduce the risk of illegal users bypassing the control by bypassing the threshold of a single dimension, that is, reducing the risk of omission. Improve the security of website operation.
  • FIG. 2 is a flowchart of a method for detecting an IP address according to Embodiment 2 of the present application. This embodiment is applicable to the case where the validity of the IP address is detected by the credit score of the IP address.
  • the IP address detection device can be implemented by software and/or hardware, and can be configured in computer equipment, such as a server, workstation, PC, etc., and includes the following steps:
  • Step 201 Count various service features from real-time data of service operations triggered based on the IP address.
  • the client (client) 321 can call the business operation interface 321 in real time, and request services such as registration, login, password recall, payment, etc. from the server (server) 331, thereby triggering the corresponding
  • services such as registration, login, password recall, payment, etc. from the server (server) 331, thereby triggering the corresponding
  • For business operations such as sending a text message containing a verification code to a specified phone number, sending an email containing a verification link to a specified email address, etc.
  • the client uses the verification code, verification link and other information for verification.
  • the server 331 records the IP address where each client is located, and records the data generated when each client performs business operations, as real-time data. In the firewall 332, these real-time data are acquired, and the validity is checked. For real-time data, preprocessing can be performed, such as data cleaning, missing value processing, outlier processing, etc., to convert real-time data into formatted data that can be used for training. For the real-time data of business operations, IP addresses can be used as a statistical dimension to count various business characteristics from the real-time data of business operations. For different business operations, the correlation between business features and legality is also different. Therefore, the selected business features are also different. This embodiment does not limit the selection of business features.
  • the service operation triggered based on the IP address is a registration operation, wherein the registration operation includes sending a short message containing a verification code and verifying the verification code.
  • the number of SMS requests from the IP address the number of times to verify the verification code, the success rate of the verification code, the number of accounts logged in at the IP address, the total number of phone numbers that receive verification codes, the number of phone numbers that receive verification codes across countries or regions quantity.
  • Step 202 query the correlation coefficient corresponding to each service feature.
  • the credit threshold training method provided in any embodiment of the present application can be used to train a correlation coefficient for each service feature, where the correlation coefficient is used to represent the correlation between each service feature and the validity of the IP address.
  • multiple feature ranges set for each service feature can be queried, and each feature range is associated with a correlation coefficient.
  • the value of the service feature is compared with a plurality of corresponding feature ranges, and if the value of the service feature is in a feature range, a correlation coefficient corresponding to the feature range is extracted.
  • Step 203 Generate a credit value representing the legitimacy of the IP address according to the correlation coefficient corresponding to each service feature.
  • step 203 may include the following steps:
  • Step 2031 Find the feature weight trained for each service feature.
  • the feature weights trained for each service feature are looked up when training a classification model for predicting a second state representing the legitimacy of an IP address based on the service features.
  • Step 2032 Calculate the candidate value of each service feature based on the correlation coefficient and feature weight.
  • the candidate value is positively correlated with the correlation coefficient and feature weight.
  • the first product between the correlation coefficient and the feature weight is calculated; the first sum value between the first product and the sub-regression intercept is calculated, where the sub-regression intercept is the service feature corresponding to the regression intercept and the IP address.
  • the ratio between the types of the The two-sum value is used as a candidate value, and the sub-offset is the ratio between the offset and the type of the service feature corresponding to the IP address.
  • Step 2033 Sum up all the candidate values, and use the summation result as a credit value representing the validity of the IP address.
  • step 203 since the application of step 203 is basically similar to that of step 103, the description is relatively simple, and the relevant part may refer to the partial description of step 103, which is not described in detail in the embodiment of the present application.
  • Step 204 Compare the credit value with a preset credit threshold to determine the validity of the IP address, so as to predict the validity of the IP address by using the service characteristics of the IP address.
  • the credit threshold training method provided by any embodiment of the present application can be used to train the credit threshold, and when the credit threshold is used to divide the legality state, it is possible to evaluate the evaluation of this prediction when the legality of the IP address is predicted by using the business feature.
  • the indicator meets the target conditions.
  • the credit value of the current IP address may be compared with the credit threshold, so as to determine the validity of the current IP address based on the comparison result. If the credit value is less than the preset credit threshold, it is determined that the validity of the IP address is abnormal. If the validity of the IP address is abnormal, business operations on the IP address are prohibited. If the credit value is greater than or equal to the preset credit threshold, it is determined that the validity of the IP address is normal. If the validity of the IP address is normal, it is allowed to perform business operations on the IP address.
  • the communication interface 322 can be invoked to request the telecommunication operator 312 to send a short message containing a verification code to the mobile communication terminal 313 where the specified phone number is located.
  • the client 311 may be installed in the mobile communication terminal 313, or may be installed in an electronic device other than the mobile communication terminal 313, and the installation position of the client 311 is not limited in this embodiment.
  • various service features are counted from real-time data of service operations triggered by IP addresses, and the correlation coefficient corresponding to each service feature is queried.
  • the correlation coefficient is used to indicate the relationship between each service feature and the validity of the IP address.
  • a credit value representing the legitimacy of the IP address is generated, and the credit value is compared with the preset credit threshold to determine the legitimacy of the IP address, so as to realize the use of IP addresses.
  • the validity of IP addresses is predicted by business features, and the validity of IP addresses predicted by business features is used as a constraint, so that the validity can be guaranteed when applying credit thresholds to classify legality.
  • a historical business feature statistics module 401 configured to collect statistics from historical data of business operations triggered based on multiple IP addresses A variety of business features
  • the correlation coefficient calculation module 402 is set to calculate at least two correlation coefficients for each kind of business feature hierarchically, wherein the correlation coefficients are used to represent the validity of each business feature and the IP address.
  • the credit value calculation module 403 is set to generate a credit value representing the legitimacy of each IP address according to the correlation coefficient corresponding to the service feature for each IP address;
  • the evaluation index generation module 404 is set to The multiple IP addresses generate an evaluation index, wherein the evaluation index is an index used to evaluate the validity of predicting the multiple IP addresses using the service characteristics of the multiple IP addresses;
  • the credit threshold training device provided by the embodiment of the present application can execute the credit threshold training method provided by any embodiment of the present application, and has functional modules and effects corresponding to the execution method.
  • Embodiment 5 is a structural block diagram of a device for detecting an IP address provided in Embodiment 4 of the present application, which may include the following modules: a real-time service feature statistics module 501, configured to collect statistics from real-time data of service operations triggered by IP addresses Service features; the correlation coefficient query module 502 is configured to query the correlation coefficient corresponding to each service feature, wherein the correlation coefficient is used to represent the correlation between the each service feature and the validity of the IP address; The credit value generation module 503 is configured to generate a credit value representing the legitimacy of the IP address according to the correlation coefficient corresponding to each service feature; the legitimacy determination module 504 is configured to compare the credit value with a preset credit threshold.
  • a real-time service feature statistics module 501 configured to collect statistics from real-time data of service operations triggered by IP addresses Service features
  • the correlation coefficient query module 502 is configured to query the correlation coefficient corresponding to each service feature, wherein the correlation coefficient is used to represent the correlation between the each service feature and the validity of the IP address
  • the IP address detection apparatus provided in the embodiment of the present application can execute the IP address detection method provided by any embodiment of the present application, and has functional modules and effects corresponding to the execution method.
  • FIG. 6 is a schematic structural diagram of a computer device according to Embodiment 5 of the present application.
  • Figure 6 shows a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present application.
  • computer device 12 takes the form of a general-purpose computing device.
  • Components of computer device 12 may include, but are not limited to, one or more processors or processing units 16 , system memory 28 , and a bus 18 connecting various system components including system memory 28 and processing unit 16 .
  • System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32 .
  • Storage system 34 may be configured to read and write to non-removable, non-volatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard disk drive").
  • a program/utility 40 having a set (at least one) of program modules 42 may be stored in memory 28, for example.
  • Computer device 12 may also communicate with one or more external devices 14 (eg, keyboard, pointing device, display 24, etc.). Such communication may take place through an input/output (I/O) interface 22 .
  • I/O input/output
  • the computer device 12 may also communicate with one or more networks, such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through the network adapter 20.
  • network adapter 20 communicates with other modules of computer device 12 via bus 18 .
  • the processing unit 16 executes a variety of functional applications and data processing by running the programs stored in the system memory 28, such as implementing the credit threshold training and IP address detection methods provided by the embodiments of the present application.
  • Embodiment 6 of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, a plurality of the above-mentioned methods for training credit thresholds and methods for detecting IP addresses are implemented. process, and can achieve the same technical effect, in order to avoid repetition, it will not be repeated here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请公开了一种信用阈值的训练方法及装置、IP地址的检测方法及装置。该信用阈值的训练方法包括:从基于多个IP地址触发的业务操作的历史数据中统计多种业务特征;对每种业务特征分级计算至少两个相关系数,其中,相关系数用于表示每种业务特征与IP地址的合法性之间的相关性;针对每个IP地址,根据业务特征对应的相关系数生成表示每个IP地址的合法性的信用值;对多个IP地址生成评估指标,其中,评估指标是用于评估使用多个IP地址的业务特征预测多个IP地址的合法性的指标;在评估指标符合目标条件的情况下,确定评估指标对应的一信用值为信用阈值,其中,所述信用阈值用于划分IP地址的合法性的状态。

Description

信用阈值的训练方法及装置、IP地址的检测方法及装置
本申请要求在2020年08月13日提交中国专利局、申请号为202010813912.6的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及运营监控的技术领域,例如涉及一种信用阈值的训练方法及装置、互联网协议(Internet Protocol,IP)地址的检测方法及装置。
背景技术
用户在注册、登录等行为中,网站通常会通过短信、邮件等形式下发验证码,以使用户通过输入验证码实现注册、登录等目的。
短信、邮件等信息通过承载验证码进行身份验证,因其操作简便、安全性高、时效性强等优点已被广泛使用。但因其获取便利、限制较少容易被不法分子利用以进行信息轰炸,尤其是恶意请求短信会产生大量的费用,给企业或个人造成巨大的损失。
因此,网站在日常的运营中会对用户的注册、登录等行为进行监控,分辨正常的行为、异常的行为。网站一般是对用户的IP地址进行监控,监控的方式主要有以下两种:
1、基于频次进行拦截
若IP地址在一段时间内请求的次数到达阈值,则认为是异常的行为。
但是在这种监控方式中,一方面,对于原本请求量就很大的IP地址,正常的行为会因为请求的次数到达阈值而被误拦截;另一方面,非法用户可以通过IP代理商获得大量不同的IP,在确定存在阈值后,非法用户容易探知到阈值的数值,通过大量不同的IP地址进行请求以降低每个IP地址进行请求的频次,从而逃避管控。
2、基于验证率低进行拦截
若IP地址在一段时间内的验证率低于阈值,则认为是异常的行为。
但是,存在大量的接码平台,用来自动接收验证码和发送验证码进行验证,因此,非法用户通过接码平台进行验证,验证率可能较高,从而逃避管控。
发明内容
本申请提出了一种信用阈值的训练方法及装置、IP地址的检测方法及装置,以解决通过频次、验证率等单一阈值对IP地址的行为进行监控,容易发生误拦截、绕开管控的问题。
本申请提供了一种信用阈值的训练方法,包括:
从基于多个IP地址触发的业务操作的历史数据中统计多种业务特征;
对每种业务特征分级计算至少两个相关系数,其中,所述相关系数用于表示所述每种业务特征与IP地址的合法性之间的相关性;
针对每个IP地址,根据所述业务特征对应的相关系数生成表示所述每个IP地址的合法性的信用值;
对所述多个IP地址生成评估指标,其中,所述评估指标是用于评估使用所述多个IP地址的业务特征预测所述多个IP地址的合法性的指标;
在所述评估指标符合目标条件的情况下,确定所述评估指标对应的一信用值为信用阈值,其中,所述信用阈值用于划分IP地址的合法性的状态。
本申请还提供了一种IP地址的检测方法,包括:
从基于IP地址触发的业务操作的实时数据中统计多种业务特征;
查询每种业务特征对应的相关系数,其中,所述相关系数用于表示所述每种业务特征与所述IP地址的合法性之间的相关性;
根据每种业务特征对应的相关系数生成表示所述IP地址的合法性的信用值;
将所述信用值与预设的信用阈值进行比较,确定所述IP地址的合法性,以实现使用所述IP地址的业务特征预测所述IP地址的合法性。
本申请还提供了一种信用阈值的训练装置,包括:
历史业务特征统计模块,设置为从基于多个IP地址触发的业务操作的历史数据中统计多种业务特征;
相关系数计算模块,设置为对每种业务特征分级计算至少两个相关系数,其中,所述相关系数用于表示所述每种业务特征与IP地址的合法性之间的相关性;
信用值计算模块,设置为针对每个IP地址,根据所述业务特征对应的相关系数生成表示所述每个IP地址的合法性的信用值;
评估指标生成模块,设置为对所述多个IP地址生成评估指标,其中,所述评估指标是用于评估使用所述多个IP地址的业务特征预测所述多个IP地址的合法性的指标;
信用阈值确定模块,设置为在所述评估指标符合目标条件的情况下,确定所述评估指标对应的一信用值为信用阈值,其中,所述信用阈值用于划分IP地址的合法性的状态。
本申请还提供了一种IP地址的检测装置,包括:
实时业务特征统计模块,设置为从基于IP地址触发的业务操作的实时数据中统计多种业务特征;
相关系数查询模块,设置为查询每种业务特征对应的相关系数,其中,所述相关系数用于表示所述每种业务特征与所述IP地址的合法性之间的相关性;
信用值生成模块,设置为根据每种业务特征对应的相关系数生成表示所述IP地址的合法性的信用值;
合法性确定模块,设置为将所述信用值与预设的信用阈值进行比较,确定所述IP地址的合法性,以实现使用所述IP地址的业务特征预测所述IP地址的合法性。
本申请还提供了一种计算机设备,包括:
至少一个处理器;
存储器,设置为存储至少一个程序;
当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现上述的信用阈值的训练或者上述的IP地址的检测方法。
本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现上述的信用阈值的训练或者上述的IP地址的检测方法。
附图说明
图1为本申请实施例一提供的一种信用阈值的训练方法的流程图;
图2为本申请实施例二提供的一种IP地址的检测方法的流程图;
图3为本申请实施例二提供的一种业务操作的示意图;
图4为本申请实施例三提供的一种信用阈值的训练装置的结构示意图;
图5为本申请实施例四提供的一种IP地址的检测装置的结构示意图;
图6为本申请实施例五提供的一种计算机设备的结构示意图。
具体实施方式
下面结合附图和实施例对本申请进行说明。
在实际应用中,针对不同业务领域的业务操作,IP地址所体现的异常也有所不同。以短信为例,IP地址所体现的异常可能包括如下几种:
1、以攻击电话号码为目的请求短信
攻击者以一个电话号码作为攻击的目标,循环调用不同网站中用于注册、登录等行为的接口,频繁向该电话号码发送承载有验证码的短信,达到攻击电话号码的目的。
2、以消耗网站的费用为目的请求短信
攻击者以一个网站作为攻击的目标,不停变换多种接口参数,如电话号码、IP地址等,循环调用该网站中用于注册、登录等行为的接口以频繁向不同的电话号码发送承载有验证码的短信,大量增加网站支付的发送短信的费用,达到攻击网站的目的。
3、以盈利为目的请求短信
在一个网站以注册新用户、产品促销等目的开展活动,攻击者不停变换多种接口参数,如电话号码、IP地址等,循环调用该网站中用于注册、登录等行为的接口以频繁向不同的电话号码发送承载有验证码的短信,以获取注册新用户所奖励的电子优惠券、实物礼品等财物,或者,以低廉的价格购买到大量的产品,等等,达到盈利的目的。
针对网站而言,依据一些行为的数据(如频次、验证率等)与阈值之间的关系将IP地址划分为异常的IP地址或者正常的IP地址,效率较低。
以短信为例,参考以下几种IP地址请求短信的行为,来说明不同行为对判定IP地址是否异常的作用。
以下表1-表3均记录:IP地址、IP地址请求短信的次数(请求短信的次数)、验证短信中的验证码的次数(验证的次数)、验证验证码的成功率(验证的成功率)、在IP地址登录的账号的数量(账号的数量)、接收验证码的电话号码的总数量(电话号码的总数量)、跨国家或地区接收验证码的电话号码的数量(跨国家或地区的电话号码的数量)。
每个表格的前3行为早于前一日的历史行为数据,第4行为前一日的行为数据。
表1:第一种情况
Figure PCTCN2021111096-appb-000001
Figure PCTCN2021111096-appb-000002
在第一种情况下,该IP地址请求短信的次数不断增加,如果依据请求次数的阈值进行监控且阈值较低,该IP地址可能会被误拦截,但是,从其他维度来看,验证的成功率和跨国家或地区的电话号码的数量并无变化,IP地址请求的次数的增加,是因为电话号码的数量的增加,对业务带来增长,属于正常的行为。
表2:第二种情况
Figure PCTCN2021111096-appb-000003
在第二种情况中,该IP地址请求短信的次数不断增加,同时结合其他维度来看,验证的成功率和账号的数量都发生较大变化,即请求短信的次数增加,验证的次数却无增长,该IP地址突增大量无效的请求,可能是该IP地址攻击网站,如果依据请求次数的阈值进行监控且阈值较高,该IP地址可能会被遗漏。
表3:第三种请求情况
Figure PCTCN2021111096-appb-000004
在第三种情况中,该IP地址请求短信的次数不断增加,同时结合其他维度来看,验证的成功率和跨国家或地区的电话号码的数量都发生较大变化,即针对大量的请求,验证的成功率却下降,且突增大量跨国家或地区的电话号码的请求,可能是IP地址被代理商利用以攻击网站,如果依据请求次数的阈值进行监控且阈值较高,该IP地址可能会被遗漏。
实施例一
图1为本申请实施例一提供的一种信用阈值的训练方法的流程图,本实施例可适用于通过历史的数据学习获得不同特征对异常检测的影响程度,从而构建有效的信用评分机制的情况,该方法可以由信用阈值的训练装置来执行,该信用阈值的训练装置可以由软件和/或硬件实现,可配置在计算机设备中,例如,服务器、工作站、个人电脑(Personal Computer,PC),等等,包括如下步骤:
步骤101、从基于多个IP地址触发的业务操作的历史数据中统计多种业务特征。
在本实施例中,客户端(client)向服务端(server)请求如注册、登录、召回密码、支付等服务,从而触发相应的业务操作,如向指定的电话号码发送包含验证码的短信、向指定的电子邮箱发送包含验证链接的邮件等,客户端使用验证码、验证链接等信息进行验证。
在服务端记录每个客户端所在的IP地址,以及,记录每个客户端在执行业务操作时产生的数据,作为历史数据。
计算机设备获取该历史数据,针对每个IP地址的合法性标记正负样本,正 样本为异常的IP地址,负样本为正常的IP地址。
为了保证训练的效果,异常的IP地址中的历史数据一般全部为异常的行为,不混入正常的IP地址的历史数据,以免在训练时造成干扰。
针对历史数据,可以进行预处理,如数据清洗、缺失值处理、异常值处理等,从而将历史数据转化为可用作训练的格式化数据。
数据清洗可用于清洗垃圾样本,例如,垃圾样本是使用了模拟器或者虚拟专用网络(Virtual Private Network,VPN)代理的虚假IP地址。
客户端在打点上报数据时,会出现少量IP地址为空的情况,缺失值处理可以指查找并过滤IP地址为空的历史数据,并不使用这些历史数据参与训练。
异常值处理可以查找并过滤异常的IP地址,例如,异常的IP地址是使用了模拟器或者VPN代理的虚假IP地址,并不使用这些历史数据参与训练。
此外,为了保证训练的效果,正样本的样本量与负样本的样本量之间的比例在预设的范围内,正样本的样本量与负样本的样本量之间不能相差太大,因此,可以通过对负样本进行下采样,来达到比例的平衡。
针对业务操作的历史数据,可以以IP地址作为统计的维度,从业务操作的历史数据中统计多种业务特征。
因为业务特征的维度较多,且部分业务特征和合法性无相关性,使用无效的业务特征会增加复杂度且无提升效果,因此,可以通过皮尔逊统计学习等统计学特征分析或者机器学习算法,学习每个业务特征与合法性的相关性,筛选出有效的业务特征,剔除无效的业务特征。
对于不同的业务操作,业务特征和合法性之间的相关性也有所不同,因此,所选择的业务特征也有所不同,本实施例对业务特征的选取不加以限制。
在一个示例中,确定基于多个IP地址触发的业务操作为注册操作,其中,注册操作包括发送包含验证码的短信、验证该验证码。
在本示例中,可对注册操作的历史数据统计如下数据中的多种,作为业务特征:
IP地址请求短信的次数、验证验证码的次数、验证验证码的成功率、在IP地址登录的账号的数量、接收验证码的电话号码的总数量、跨国家或地区接收验证码的电话号码的数量。
客户端在一次验证操作中可以多次验证同一个验证码,验证验证码的次数,可以指累计验证的次数。
上述业务特征只是作为示例,在实施本申请实施例时,可以根据业务操作 的实际情况设置其他业务特征,本申请实施例对业务特征的类型不加以限制。
步骤102、对每种业务特征分级计算至少两个相关系数。
不同的业务特征对刻画IP地址的合法性,有不同程度的重要性,例如,针对注册操作,验证验证码的成功率、跨国家或地区接收验证码的电话号码的数量等属于强相关的业务特征,即对合法性的影响较大,请求短信的次数、在IP地址登录的账号的数量是弱相关的业务特征,即对合法性的影响较小。
在实现中,可以在每种业务特征对应不同的取值(即分级)的情况下,自适应学习该种业务特征对应每种取值时的相关系数,该相关系数用于表示该种业务特征与IP地址的合法性之间的相关性。
在本申请的一个实施例中,可使用证据权重(Weight of Evidence,WOE)的方式计算相关系数,WOE是一种将连续变量变换成离散变量的算法,可用于刻画不同业务特征对于合法性的影响程度,在本实施例中,步骤102可以包括如下步骤:
步骤1021、为每种业务特征设置多个特征范围。
在本实施例中,业务特征为连续的变量,为每种业务特征在其数值的范围内划分多个连续的特征范围,从而将业务特征转换为离散的变量。
步骤1022、根据每种业务特征的数值,将所述多个IP地址划分至不同特征范围对应的特征子集中。
针对一个IP地址,可以将其业务特征的数值逐一与相应的特征范围进行比较,如果业务特征的数值在一特征范围中,则可以将该IP地址划分至该特征范围对应的特征子集中。
步骤1023、根据每个特征范围对应的特征子集中的IP地址,计算所述每个特征范围的证据权重,将所述证据权重作为业务特征在所述每个特征范围内的相关系数。
在本实施例中,针对每个特征范围对应的特征子集,可以使用该特征子集中的业务特征对应的IP地址对该特征范围计算证据权重,将该证据权重作为该业务特征在该特征范围内的相关系数。
在实现中,IP地址标记有表示IP地址的合法性的第一状态,该第一状态为IP地址真实的状态,包括正常、异常。
针对一种业务特征对应的一个特征范围,统计该特征范围对应的特征子集中的第一状态为异常的IP地址的数量与该种业务特征对应的第一状态为异常的所有IP地址的数量之间的比值,作为第一比例;统计该特征范围对应的特征子 集中的第一状态为正常的IP地址的数量与该种业务特征对应的第一状态为正常的所有IP地址的数量之间的比值,作为第二比例。对第一比例与第二比例之间的比值取对数,作为该种业务特征在特征范围内的证据权重。
针对一种业务特征,证据权重表示如下:
Figure PCTCN2021111096-appb-000005
其中,WOE i表示第i个特征范围对应的证据权重,y i表示第i个特征范围对应的特征子集中的第一状态为异常的IP地址的数量,y a表示该种业务特征对应的第一状态为异常的所有IP地址的数量,则
Figure PCTCN2021111096-appb-000006
表示第一比例,n i表示第i个特征范围对应的特征子集中的第一状态为正常的IP地址的数量,n a表示该种业务特征对应的第一状态为正常的所有IP地址的数量,则
Figure PCTCN2021111096-appb-000007
表示第二比例,ln表示以自然数e为底数取对数。
以注册操作为例,统计请求短信的次数作为业务特征,请求短信的次数为连续变量,对其进行离散化处理后,划分6个特征范围,可以得到每个特征范围的WOE值如下表4所示:
表4
Figure PCTCN2021111096-appb-000008
特征范围中正样本的数量与WOE的数值正相关,从上表4可以看出,特征范围对应的特征子集中的异常的IP地址越多,WOE越大,因此,该WOE可以代表业务特征与合法性的相关性。
WOE描述了业务特征在当前这个特征范围内,对合法性所起到的影响的方向和大小。当WOE为正时,业务特征在当前特征范围内对个体的判断起到正向的影响,当WOE为负时,业务特征在当前特征范围内对个体的判断起到负向的影响。WOE数值的大小,则是影响大小的体现。
上述计算相关系数的方式只是作为示例,在实施本申请实施例时,可以根据业务操作的实际情况设置其他计算相关系数的方式,例如,使用用于衡量自变量的预测能力的信息价值(Information Value,IV)、受试者工作特征曲线(Receiver Operating Characteristic Curve,ROC)、信息熵等方式计算相关系数,等等,本申请实施例对此不加以限制。
步骤103、针对每个IP地址,根据业务特征对应的相关系数生成表示所述每个IP地址的合法性的信用值。
在实现中,遍历每个IP地址,综合该IP地址的业务特征对IP地址发生的业务操作进行分析,依据该IP地址的业务特征对合法性造成影响的相关系数,将该IP地址的业务特征对合法性的影响程度量化为信用值,该信用值体现了IP地址所触发的业务操作对于合法性的可信程度。
在本申请的一个实施例中,步骤103可以包括如下步骤:
步骤1031、针对每个IP地址的每种业务特征,查询每种业务特征的数值所在的特征范围关联的相关系数。
在本实施例中,针对IP地址的每种业务特征,可以查询在先对该种业务特征设置的多个特征范围,每个特征范围关联有相关系数。
将该业务特征的数值与相应的多个特征范围进行比较,从而确定该业务特征的数值所在的特征范围,提取该特征范围所关联的相关系数。
步骤1032、查找为每种业务特征训练的特征权重。
在本实施例中,可以预先为每种业务特征训练特征权重,该特征权重可用于表示根据该种业务特征对预测IP地址合法性的重要程度。
在一种方式中,该特征权重为分类模型中的一个模型参数,该分类模型用于根据业务特征预测表示IP地址的合法性的第二状态(包括正常、异常),因此,可查找在训练分类模型时为每种业务特征训练的特征权重。
除了应用分类模型中的模型参数作为业务特征的特征权重之外,也可以应 用其他方式设置业务特征的特征权重,例如,运维人员直接对每种业务特征设置特征权重,等等,本实施例对设置业务特征的特征权重的方式不加以限制。
步骤1033、基于相关系数与特征权重计算所述每种业务特征的候选值。
在本实施例中,可以以相关系数与特征权重作为变量,计算每种业务特征的候选值,使得候选值与相关系数、特征权重均正相关,即相关系数越大,候选值越大,相关系数越小,候选值越小,同理,特征权重越大,候选值越大,特征权重越小,候选值越小。
在一个示例中,计算相关系数与特征权重之间的第一乘积,计算第一乘积与子回归截距之间的第一和值,其中,子回归截距为回归截距与IP地址对应的业务特征的种类之间的比值,回归截距用于预测IP地址的合法性。
计算第一和值与预设的比例因子之间的第二乘积,计算第二乘积与子偏移量之间的第二和值,将所述第二和值作为候选值,其中,子偏移量为偏移量与IP地址对应的业务特征的种类之间的比值。
在本示例中,候选值表示如下:
Figure PCTCN2021111096-appb-000009
其中,IP地址下具有n种业务特征,Score i表示第i种业务特征的候选值,woe i表示第i种业务特征的相关系数(如WOE值),w i表示第i种业务特征的特征权重,a表示回归截距,则
Figure PCTCN2021111096-appb-000010
表示子回归截距,factor表示比例因子,offset表示偏移量,factor可以视风险偏好而设置,则
Figure PCTCN2021111096-appb-000011
表示子偏移量。
上述计算候选值的方式只是作为示例,在实施本申请实施例时,可以根据业务操作的实际情况设置其他计算候选值的方式,例如,将相关系数与特征权重进行线性融合,等等,本申请实施例对此不加以限制。
步骤1034、对所有候选值求和,将求和结果作为表示所述每个IP地址的合法性的信用值。
针对同一IP地址的业务特征,可以计算所有业务特征的候选值的和值,作为该IP地址关于合法性的信用值。
在一个示例中,信用值表示如下:
Figure PCTCN2021111096-appb-000012
在本实施例中,联合相关系数与特征权重计算信用值,一方面参考业务特 征在局部对合法性的影响程度,另一方面参考业务特征在全局对合法性的重要程度,可以提高信用值的准确性。
步骤104、对多个IP地址生成评估指标。
在本实施例中,可以应用多个IP地址的业务特征,预测该多个IP地址的合法性,并针对此次预测的操作生成评估指标,即该评估指标是用于评估使用业务特征预测IP地址的合法性的指标。
在本申请的一个实施例中,步骤104可以包括如下步骤:
步骤1041、将每个IP地址对应的业务特征输入分类模型中,通过所述分类模型预测表示所述每个IP地址的合法性的第二状态。
在本实施例中,可以预先训练分类模型,该分类模型属于二分类模型,可用于根据IP地址的业务特征预测表示该IP地址的合法性的第二状态(包括正常、异常)。
该分类模型可以包括向量机(Support Vector Machine,SVM)、逻辑回归(Logistic Regression,LR)、随机森林(Random Forest,RF)等机器学习模型,也可以包括卷积神经网络(Convolutional Neural Network,CNN)等深度学习模型,本实施例对分类模型的类型不加以限制。
为了降低训练的复杂度,可在该分类模型的模型参数中设置业务特征的特征权重,该特征权重可用于计算信用值,因此,可选择结构简单(即模型参数较少)、训练所需样本数量较少的分类模型,在训练分类模型时,一同训练业务特征的特征权重。
以LR为例,LR表示如下:
Figure PCTCN2021111096-appb-000013
其中,w T是特征权值,x是IP地址对应的业务特征,b是回归截距,p为表示该IP地址的合法性的第二状态。
在训练时,可以将业务特征输入分类模型中预测表示IP地址的合法性的第二状态,使用预设的损失函数计算第一状态与第二状态之间的损失值LOSS,该损失值LOSS可体现第一状态和第二状态的不一致程度,即将异常的IP地址预测为正常的IP地址的程度,或者,将正常的IP地址预测为异常的IP地址的程度。
在一个示例中,损失函数F(w)如下表示:
Figure PCTCN2021111096-appb-000014
其中,N为样本的数量,n∈N,p为第二状态,即预测值,在LR模型训练业务特征时,
Figure PCTCN2021111096-appb-000015
y n表示第n个第一状态,即真实值。
在每次迭代训练计算出损失值时,可判断损失值是否小于或等于预设的阈值。
若损失值小于或等于预设的阈值,则确定分类模型训练完成,可存储分类模型的结构及其模型参数。
若损失值大于预设的阈值,则根据随机梯度等方式更新分类模型中的模型参数,返回执行将业务特征输入分类模型中预测表示IP地址的合法性的第二状态,从而进入下一次迭代训练。
在对IP地址进行分类时,可启动分类模型并加载其模型参数,将IP地址的业务特征输入到分类模型中进行处理,分类模型输出表示该IP地址的合法性的第二状态,从而预测该IP地址为正常的IP地址或者是异常的IP地址。
在一种情况中,分类模型的模型参数包括为每种业务特征设置的特征权重,通过多次迭代训练,寻找最优的特征权重。
在另一种情况中,对于LR模型等分类模型,分类模型的模型参数还包括回归截距,通过多次迭代训练,寻找最优的回归截距。
因此,针对LR模型,可查找为每种业务特征训练的特征权重、回归截距,在逻辑回归模型中加载特征权重与回归截距,当加载完成时,将IP地址对应的业务特征输入逻辑回归模型中预测表示IP地址的合法性的第二状态。
对于其他分类模型,可以加载其他模型参数预测表示IP地址的合法性的第二状态,本实施例对此不加以限制。
步骤1042、设置多个信用值范围。
本实施例中,可以针对本次采样的多个IP地址的信用值的总体范围设置多个信用值范围,例如,本次采样的多个IP地址的信用值的总体范围为[3,35],则可以设置该多个信用值范围为[3,5],[3,10],[3,20],[3,35]。也可以在计算每个IP地址的信用值之前预先设置多个信用值范围,该多个信用值范围可以用于对任意次采样的多个IP地址的信用值进行划分,例如设置该多个信用值范围为(-∞,5],(-∞,10],(-∞,20],(-∞,40],其中,信用值40可以为根据经验确定的最大信用值。
本实施例对设置多个信用值范围的方式、多个信用值范围的数量、以及每个信用值范围的上下限信用值不做限定。
步骤1043、根据每个IP地址的信用值,将所述每个IP地址划分至对应的信用值范围的IP地址子集中。
针对每个IP地址,可以将该IP地址的信用值与多个信用值范围中的每个信用值范围进行比较,如果该信用值位于一个信用值范围中,则可以将该IP地址划分至该信用值范围对应的IP地址子集中。
对多个IP地址中的每个IP地址的信用值进行比较和划分操作,即可以将该多个IP地址划分至对应的IP地址子集中。
步骤1044、根据每个IP地址子集中的每个IP地址的第一状态与第二状态,计算所述每个IP地址子集对应的评估指标。
本实施例中,由于多个IP地址被划分至的IP地址子集的数量与所设置的信用值范围有关,每个IP地址子集对应一个评估指标,因此,评估指标的数量和所设置的信用值范围的数量有关。
每个IP地址标记有表示IP地址的合法性的第一状态,将每个IP地址子集中的每个IP地址的第一状态与第二状态进行比较,从而根据第一状态与第二状态之间的异同对每个IP地址子集中的IP地址生成评估(Evaluation)这次预测操作的指标,作为评估指标。
在一个示例中,每个评估指标包括准确率(accuracy)和召回率(recall),则在本示例中,针对每个IP地址子集,可统计第一数值TP、第二数值FN、第三数值TN,其中,第一数值TP表示该IP地址子集中第一状态为异常、第二状态为异常的IP地址的数量,第二数值FN表示该IP地址子集中第一状态为异常、第二状态为正常的IP地址的数量,第三数值TN表示该IP地址子集中第一状态为正常、第二状态为正常的IP地址的数量。
计算第四数值与该IP地址子集中的IP地址的总数量total之间的比值,作为准确率,其中,第四数值为第一数值TP与第三数值TN的和值,则准确率acc可以如下表示:
Figure PCTCN2021111096-appb-000016
计算第一数值TP与第五数值之间的比值,作为召回率,其中,第五数值为第一数值TP与第二数值FN的和值,则召回率rec可以如下表示:
Figure PCTCN2021111096-appb-000017
上述评估指标只是作为示例,在实施本申请实施例时,可以根据业务操作的实际情况设置其他评估指标,例如,精确率(precision)、F1值,等等,本申请实施例对此不加以限制。
在本实施例中,计算IP地址的信用值与评估指标使用部分相同的参数(如业务特征、回归截距等),则相同的参数训练一次即可,不仅可以减少训练参数的成本,而且可以增强信用值与评估指标之间的关联性,有利于后续应用信用值作为信用阈值进行合法性的划分。
步骤105、若评估指标符合目标条件,则确定评估指标对应的一信用值为信用阈值。
在本实施例中,针对不同IP地址子集的评估指标,可以将对预测IP地址的合法性进行评估的评估指标作为参考,与预先设置的目标条件进行比较,如果一个IP地址子集的评估指标满足目标条件的要求,则该评估指标对应的信用值范围的上限信用值可设置为信用阈值,其中,该信用阈值用于划分合法性的状态,即IP地址的信用值小于该信用阈值即可认为该IP地址是异常的IP地址,IP地址的信用值大于或等于该信用阈值即可认为该IP地址是正常的IP地址。
在一种实现方式中,若对异常的IP地址进行封禁等处理,则可以认为准确率的优先级最高,召回率的优先级次之,则可以对比不同IP地址子集的准确率,寻找数值最高的准确率。
若数值最高的准确率对应一个IP地址子集,则确定数值最高的准确率所对应的IP地址的子集对应的信用值范围的上限信用值为信用阈值。
若数值最高的准确率对应至少两个IP地址子集,则对比至少两个IP地址子集对应的召回率,寻找数值最高的召回率,从而确定数值最高的召回率所对应的IP地址的子集对应的信用值范围的上限信用值为信用阈值。
在一个实验中,采集1000个正样本(异常的IP地址,该IP地址的第一状态为异常)和2000个负样本(正常的IP地址,该IP地址的第一状态为正常),针对每个样本计算信用值以及第二状态,将每个样本的信用值与预设的4个信用值范围((-∞,5],(-∞,10],(-∞,20],(-∞,40])进行比较,在该样本的信用值位于其中一个信用值范围中时,将该样本划分至该信用值范围对应的样本子集中,表5示出了将3000个样本划分至4个信用值范围对应的样本子集的结果。
针对每个信用值范围对应的样本子集,根据该样本子集中的每个样本的第一状态和第二状态,计算该信用值范围对应的准确率与召回率,结果如表5所示。
表5
Figure PCTCN2021111096-appb-000018
从表5中可以看出,信用值范围的上限信用值越低,准确率越高,但是覆盖的召回率不断降低,可采取上限信用值5分作为信用阈值,即信用值低于5分的IP地址为异常的IP地址。
上述设置信用阈值的方式只是作为示例,在实施本申请实施例时,可以根据业务操作的实际情况采用其他设置信用阈值的方式,例如,将数值最高的F1值所对应的信用值为信用阈值,等等,本申请实施例对此不加以限制。
在本实施例中,从基于多个IP地址触发的业务操作的历史数据中统计多种业务特征,对每种业务特征分级计算至少两个相关系数,相关系数用于表示每种业务特征与IP地址的合法性之间的相关性,针对每个IP地址,根据业务特征对应的相关系数生成表示每个IP地址的合法性的信用值,对多个IP地址生成评估指标,评估指标是用于评估使用多个IP地址的业务特征预测多个IP地址的合法性的指标,在评估指标符合目标条件的情况下,确定评估指标对应的一信用值为信用阈值,信用阈值用于划分IP地址的合法性的状态,在评估指标的辅助下训练信用阈值,可以保证信用阈值划分合法性的有效性,以业务操作的多个业务特征综合评估IP地址的信用值,实现多维度的综合判断,避免针对单一维度的阈值划分合法性状态的方法,可以降低因单一维度出现故障导致误拦截的风险,也降低不法用户通过绕开单一维度阈值从而绕开管控的风险,即降低了遗漏的风险,整体提高网站运行的安全性。
实施例二
图2为本申请实施例二提供的一种IP地址的检测方法的流程图,本实施例可适用于通过IP地址的信用评分检测该IP地址的合法性的情况,该方法可以由IP地址的检测装置来执行,该IP地址的检测装置可以由软件和/或硬件实现,可 配置在计算机设备中,例如,服务器、工作站、PC,等等,包括如下步骤:
步骤201、从基于IP地址触发的业务操作的实时数据中统计多种业务特征。
在本实施例中,如图3所示,客户端(client)321可实时调用业务操作接口321,向服务端(server)331请求如注册、登录、召回密码、支付等服务,从而触发相应的业务操作,如向指定的电话号码发送包含验证码的短信、向指定的电子邮箱发送包含验证链接的邮件等,客户端使用验证码、验证链接等信息进行验证。
在服务端331记录每个客户端所在的IP地址,以及,记录每个客户端在执行业务操作时产生的数据,作为实时数据。在防火墙332中,获取这些实时数据,并进行合法性的检测。针对实时数据,可以进行预处理,如数据清洗、缺失值处理、异常值处理等,从而将实时数据转化为可用作训练的格式化数据。针对业务操作的实时数据,可以以IP地址作为统计的维度,从业务操作的实时数据中统计多种业务特征。对于不同的业务操作,业务特征和合法性之间的相关性也有所不同,因此,所选择的业务特征也有所不同,本实施例对业务特征的选取不加以限制。
在一个示例中,确定基于IP地址触发的业务操作为注册操作,其中,注册操作包括发送包含验证码的短信、验证该验证码。
在本示例中,可对注册操作的实时数据统计如下数据中的多种,作为业务特征:
IP地址请求短信的次数、验证验证码的次数、验证验证码的成功率、在IP地址登录的账号的数量、接收验证码的电话号码的总数量、跨国家或地区接收验证码的电话号码的数量。
上述业务特征只是作为示例,在实施本申请实施例时,可以根据业务操作的实际情况设置其他业务特征,本申请实施例对业务特征的类型不加以限制。
步骤202、查询每种业务特征对应的相关系数。
可以应用本申请任意实施例所提供的信用阈值的训练方法针对每种业务特征训练相关系数,其中,相关系数用于表示每种业务特征与IP地址的合法性之间的相关性。
在一种实现方式中,可查询为每种业务特征设置的多个特征范围,每个特征范围关联有相关系数。
将该业务特征的数值与相应的多个特征范围进行比较,若业务特征的数值在一个特征范围中,则提取该特征范围对应的相关系数。
步骤203、根据每种业务特征对应的相关系数生成表示IP地址的合法性的信用值。
对IP地址的业务特征进行评估,依据该IP地址的业务特征对合法性造成影响的相关系数,将该IP地址的业务特征对合法性的影响程度量化为信用值。
在本申请的一个实施例中,步骤203可以包括如下步骤:
步骤2031、查找为每种业务特征训练的特征权重。
在一个示例中,查找在训练分类模型时为每种业务特征训练的特征权重,其中,分类模型用于根据业务特征预测表示IP地址的合法性的第二状态。
步骤2032、基于相关系数与特征权重计算每种业务特征的候选值。
候选值与相关系数、特征权重均正相关。在一个示例中,计算相关系数与特征权重之间的第一乘积;计算第一乘积与子回归截距之间的第一和值,子回归截距为回归截距与IP地址对应的业务特征的种类之间的比值,回归截距用于预测IP地址的合法性;计算第一和值与预设的比例因子之间的第二乘积;计算第二乘积与子偏移量之间的第二和值,作为候选值,子偏移量为偏移量与IP地址对应的业务特征的种类之间的比值。
步骤2033、对所有候选值求和,将求和结果作为表示IP地址的合法性的信用值。
在本申请实施例中,由于步骤203与步骤103的应用基本相似,所以描述的比较简单,相关之处参见步骤103的部分说明即可,本申请实施例在此不加以详述。
步骤204、将信用值与预设的信用阈值进行比较,确定IP地址的合法性,以实现使用IP地址的业务特征预测IP地址的合法性。
可以应用本申请任意实施例所提供的信用阈值的训练方法训练信用阈值,应用该信用阈值划分合法性的状态时,可使得在应用业务特征预测IP地址的合法性时,评估此次预测的评估指标符合目标条件。
在本实施例中,可以将当前IP地址的信用值与信用阈值进行比较,从而以比较的结果确定当前IP地址的合法性。若信用值小于预设的信用阈值,则确定IP地址的合法性为异常。若IP地址的合法性为异常,则禁止对IP地址执行业务操作。若信用值大于或等于预设的信用阈值,则确定IP地址的合法性为正常。若IP地址的合法性为正常,则允许对IP地址执行业务操作。
以注册操作为例,如图3所示,可调用通信接口322,请求电信运营商312向指定电话号码所在的移动通信终端313发送包含验证码的短信。客户端311 可能安装在移动通信终端313中,也可能安装在移动通信终端313之外的电子设备中,本实施例对客户端311的安装位置不加以限制。
在本实施例中,从基于IP地址触发的业务操作的实时数据中统计多种业务特征,查询每种业务特征对应的相关系数,相关系数用于表示每种业务特征与IP地址的合法性之间的相关性,根据每种业务特征对应的相关系数生成表示IP地址的合法性的信用值,将信用值与预设的信用阈值进行比较,确定IP地址的合法性,以实现使用IP地址的业务特征预测IP地址的合法性,以业务特征预测IP地址的合法性作为约束的条件,使得应用信用阈值划分合法性时可保证有效性,以业务操作的多个业务特征综合评估IP地址的信用值,实现多维度的综合判断,避免针对单一维度的阈值划分合法性状态的方法,可以降低因单一维度出现故障导致误拦截的风险,也降低不法用户通过绕开单一维度阈值从而绕开管控的风险,即降低了遗漏的风险,整体提高网站运行的安全性。
对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,一些步骤可以采用其他顺序或者同时进行。其次,文中所描述的实施例所涉及的动作并不一定是本申请实施例所必须的。
实施例三
图4为本申请实施例三提供的一种信用阈值的训练装置的结构框图,可以包括如下模块:历史业务特征统计模块401,设置为从基于多个IP地址触发的业务操作的历史数据中统计多种业务特征;相关系数计算模块402,设置为对每种业务特征分级计算至少两个相关系数,其中,所述相关系数用于表示所述每种业务特征与IP地址的合法性之间的相关性;信用值计算模块403,设置为针对每个IP地址,根据所述业务特征对应的相关系数生成表示所述每个IP地址的合法性的信用值;评估指标生成模块404,设置为对所述多个IP地址生成评估指标,其中,所述评估指标是用于评估使用所述多个IP地址的业务特征预测所述多个IP地址的合法性的指标;信用阈值确定模块405,设置为在所述评估指标符合目标条件的情况下,确定所述评估指标对应的一信用值为信用阈值,其中,所述信用阈值用于划分IP地址的合法性的状态。本申请实施例所提供的信用阈值的训练装置可执行本申请任意实施例所提供的信用阈值的训练方法,具备执行方法相应的功能模块和效果。
实施例四
图5为本申请实施例四提供的一种IP地址的检测装置的结构框图,可以包括如下模块:实时业务特征统计模块501,设置为从基于IP地址触发的业务操作的实时数据中统计多种业务特征;相关系数查询模块502,设置为查询每种业 务特征对应的相关系数,其中,所述相关系数用于表示所述每种业务特征与所述IP地址的合法性之间的相关性;信用值生成模块503,设置为根据每种业务特征对应的相关系数生成表示所述IP地址的合法性的信用值;合法性确定模块504,设置为将所述信用值与预设的信用阈值进行比较,确定所述IP地址的合法性,以实现使用所述IP地址的业务特征预测所述IP地址的合法性。本申请实施例所提供的IP地址的检测装置可执行本申请任意实施例所提供的IP地址的检测方法,具备执行方法相应的功能模块和效果。
实施例五
图6为本申请实施例五提供的一种计算机设备的结构示意图。图6示出了适于用来实现本申请实施方式的示例性计算机设备12的框图。
如图6所示,计算机设备12以通用计算设备的形式表现。计算机设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。
系统存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory,RAM)30和/或高速缓存存储器32。存储系统34可以设置为读写不可移动的、非易失性磁介质(图6未显示,通常称为“硬盘驱动器”)。具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如存储器28中。计算机设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信。这种通信可以通过输入/输出(Input/Output,I/O)接口22进行。并且,计算机设备12还可以通过网络适配器20与一个或者多个网络,例如局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN)和/或公共网络,例如因特网,通信。如图所示,网络适配器20通过总线18与计算机设备12的其它模块通信。处理单元16通过运行存储在系统存储器28中的程序,从而执行多种功能应用以及数据处理,例如实现本申请实施例所提供的信用阈值的训练、IP地址的检测方法。
实施例六
本申请实施例六还提供一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述信用阈值的训练方法、IP地址的检测方法的多个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。

Claims (20)

  1. 一种信用阈值的训练方法,包括:
    从基于多个互联网协议IP地址触发的业务操作的历史数据中统计多种业务特征;
    对每种业务特征分级计算至少两个相关系数,其中,所述相关系数用于表示所述每种业务特征与IP地址的合法性之间的相关性;
    针对每个IP地址,根据所述业务特征对应的相关系数生成表示所述每个IP地址的合法性的信用值;
    对所述多个IP地址生成评估指标,其中,所述评估指标是用于评估使用所述多个IP地址的业务特征预测所述多个IP地址的合法性的指标;
    在所述评估指标符合目标条件的情况下,确定所述评估指标对应的一信用值为信用阈值,其中,所述信用阈值用于划分IP地址的合法性的状态。
  2. 根据权利要求1所述的方法,其中,所述从基于多个IP地址触发的业务操作的历史数据中统计多种业务特征,包括:
    确定基于所述多个IP地址触发的业务操作为注册操作,其中,所述注册操作包括发送包含验证码的短信、以及验证所述验证码;
    对所述注册操作的历史数据统计如下多种业务特征中的多种:
    请求短信的次数、验证验证码的次数、验证验证码的成功率、登录的账号的数量、接收验证码的电话号码的总数量、跨国家或地区接收验证码的电话号码的数量。
  3. 根据权利要求1所述的方法,其中,所述对每种业务特征分级计算至少两个相关系数,包括:
    为每种业务特征设置多个特征范围;
    根据每种业务特征的数值,将所述多个IP地址划分至不同特征范围对应的特征子集中;
    根据每个特征范围对应的特征子集中的IP地址,计算所述每个特征范围的证据权重,将所述证据权重作为业务特征在所述每个特征范围内的相关系数。
  4. 根据权利要求3所述的方法,其中,所述IP地址标记有表示所述IP地址的合法性的第一状态;
    所述根据每个特征范围对应的特征子集中的IP地址,计算所述每个特征范围的证据权重,包括:
    统计所述特征子集中的第一状态为异常的IP地址的数量与所述每种业务特 征对应的第一状态为异常的所有IP地址的数量之间的比值,作为第一比例;
    统计所述特征子集中的第一状态为正常的IP地址的数量与所述每种业务特征对应的第一状态为正常的所有IP地址的数量之间的比值,作为第二比例;
    对所述第一比例与所述第二比例之间的比值取对数,将所述对数作为所述每种业务特征在所述特征范围内的证据权重。
  5. 根据权利要求1所述的方法,其中,所述针对每个IP地址,根据所述业务特征对应的相关系数生成表示所述每个IP地址的合法性的信用值,包括:
    针对每个IP地址的每种业务特征,查询每种业务特征的数值所在的特征范围关联的相关系数;
    查找为每种业务特征训练的特征权重;
    基于所述相关系数与所述特征权重计算所述每种业务特征的候选值,其中,所述候选值与所述相关系数、以及所述特征权重均正相关;
    对所有候选值求和,将求和结果作为表示所述每个IP地址的合法性的信用值。
  6. 根据权利要求5所述的方法,其中,所述查找为每种业务特征训练的特征权重,包括:
    查找在训练分类模型的情况下为所述每个IP地址的每种业务特征训练的特征权重,其中,所述分类模型用于根据所述每个IP地址的每种业务特征预测表示所述每个IP地址的合法性的第二状态。
  7. 根据权利要求5所述的方法,其中,所述基于所述相关系数与所述特征权重计算所述每种所述业务特征的候选值,包括:
    计算所述相关系数与所述特征权重之间的第一乘积;
    计算所述第一乘积与子回归截距之间的第一和值,其中,所述子回归截距为回归截距与所述IP地址对应的业务特征的种类之间的比值,所述回归截距用于预测所述IP地址的合法性;
    计算所述第一和值与预设的比例因子之间的第二乘积;
    计算所述第二乘积与子偏移量之间的第二和值,将所述第二和值作为所述候选值,其中,所述子偏移量为偏移量与所述IP地址对应的业务特征的种类之间的比值。
  8. 根据权利要求1所述的方法,其中,所述IP地址标记有表示所述IP地址的合法性的第一状态,所述评估指标的数量为至少一个;
    所述对所述多个IP地址生成评估指标,包括:
    将每个IP地址对应的业务特征输入分类模型中,通过所述分类模型预测表示所述每个IP地址的合法性的第二状态;
    设置多个信用值范围;
    根据每个IP地址的信用值,将所述每个IP地址划分至对应的信用值范围的IP地址子集中;
    根据每个IP地址子集中的每个IP地址的第一状态与第二状态,计算所述每个IP地址子集对应的评估指标。
  9. 根据权利要求8所述的方法,其中,所述分类模型为逻辑回归模型;
    所述将每个IP地址对应的业务特征输入分类模型中,通过所述分类模型预测表示所述每个IP地址的合法性的第二状态,包括:
    查找为每种业务特征训练的特征权重和回归截距;
    在所述逻辑回归模型中加载查找到的特征权重与回归截距;
    在所述逻辑回归模型加载所述查找到的特征权重与回归截距完成的情况下,将每个IP地址对应的业务特征输入所述逻辑回归模型中,通过所述逻辑回归模型预测表示所述每个IP地址的合法性的第二状态。
  10. 根据权利要求8所述的方法,其中,所述至少一个评估指标中的每个评估指标包括准确率和召回率;
    所述根据每个IP地址子集中的每个IP地址的第一状态与第二状态,计算所述每个IP地址子集对应的评估指标,包括:
    统计第一数值TP、第二数值FN、第三数值TN,其中,所述第一数值TP表示所述IP地址子集中第一状态为异常、且第二状态为异常的IP地址的数量,所述第二数值FN表示所述IP地址子集中第一状态为异常、且第二状态为正常的IP地址的数量,所述第三数值TN表示所述IP地址子集中第一状态为正常、且第二状态为正常的IP地址的数量;
    计算第四数值与所述IP地址子集中的IP地址的总数量之间的比值,作为所述准确率,其中,所述第四数值为所述第一数值TP与所述第三数值TN的和值;
    计算所述第一数值TP与第五数值之间的比值,作为所述召回率,其中,所述第五数值为所述第一数值TP与所述第二数值FN的和值。
  11. 根据权利要求8所述的方法,还包括:
    将每个IP地址对应的业务特征输入分类模型中,通过所述分类模型预测表 示所述每个IP地址的合法性的第二状态;
    计算所述每个IP地址的第一状态与所述第二状态之间的损失值;
    判断所述损失值是否小于或等于预设的阈值;
    响应于所述损失值小于或等于所述预设的阈值,确定所述分类模型训练完成;
    响应于所述损失值大于所述预设的阈值,更新所述分类模型中的模型参数,返回执行所述将每个IP地址对应的业务特征输入分类模型中,通过所述分类模型预测表示所述每个IP地址的合法性的第二状态,其中,所述模型参数包括为每种业务特征设置的特征权重和回归截距中的至少一者。
  12. 根据权利要求10所述的方法,其中,所述在所述评估指标符合目标条件的情况下,确定所述评估指标对应的一信用值为信用阈值,包括:
    对比所述至少一个评估指标中的准确率;
    在数值最高的准确率对应一个IP地址子集的情况下,确定所述一个IP地址子集对应的信用值范围的上限信用值为所述信用阈值;
    在数值最高的准确率对应至少两个IP地址子集的情况下,对比所述至少两个IP地址子集对应的召回率;确定数值最高的召回率对应的IP地址子集对应的信用值范围的上限信用值为所述信用阈值。
  13. 一种互联网协议IP地址的检测方法,包括:
    从基于IP地址触发的业务操作的实时数据中统计多种业务特征;
    查询每种业务特征对应的相关系数,其中,所述相关系数用于表示所述每种业务特征与所述IP地址的合法性之间的相关性;
    根据每种业务特征对应的相关系数生成表示所述IP地址的合法性的信用值;
    将所述信用值与预设的信用阈值进行比较,确定所述IP地址的合法性,以实现使用所述IP地址的业务特征预测所述IP地址的合法性。
  14. 根据权利要求13所述的方法,其中,所述查询每种业务特征对应的相关系数,包括:
    查询为每种业务特征设置的多个特征范围;
    在所述每种业务特征对应的一业务特征的数值在所述多个特征范围中的一个特征范围中的情况下,提取所述一个特征范围对应的相关系数。
  15. 根据权利要求13所述的方法,其中,所述将所述信用值与预设的信用阈值进行比较,确定所述IP地址的合法性,包括:
    在所述信用值小于所述预设的信用阈值的情况下,确定所述IP地址的合法性为异常;
    在所述信用值大于或等于所述预设的信用阈值的情况下,确定所述IP地址的合法性为正常。
  16. 根据权利要求13-15中任一项所述的方法,还包括:
    在所述IP地址的合法性为异常的情况下,禁止对所述IP地址执行业务操作。
  17. 一种信用阈值的训练装置,包括:
    历史业务特征统计模块,设置为从基于多个互联网协议IP地址触发的业务操作的历史数据中统计多种业务特征;
    相关系数计算模块,设置为对每种业务特征分级计算至少两个相关系数,其中,所述相关系数用于表示所述每种业务特征与IP地址的合法性之间的相关性;
    信用值计算模块,设置为针对每个IP地址,根据所述业务特征对应的相关系数生成表示所述每个IP地址的合法性的信用值;
    评估指标生成模块,设置为对所述多个IP地址生成评估指标,其中,所述评估指标是用于评估使用所述多个IP地址的业务特征预测所述多个IP地址的合法性的指标;
    信用阈值确定模块,设置为在所述评估指标符合目标条件的情况下,确定所述评估指标对应的一信用值为信用阈值,其中,所述信用阈值用于划分IP地址的合法性的状态。
  18. 一种互联网协议IP地址的检测装置,包括:
    实时业务特征统计模块,设置为从基于IP地址触发的业务操作的实时数据中统计多种业务特征;
    相关系数查询模块,设置为查询每种业务特征对应的相关系数,其中,所述相关系数用于表示所述每种业务特征与所述IP地址的合法性之间的相关性;
    信用值生成模块,设置为根据每种业务特征对应的相关系数生成表示所述IP地址的合法性的信用值;
    合法性确定模块,设置为将所述信用值与预设的信用阈值进行比较,确定所述IP地址的合法性,以实现使用所述IP地址的业务特征预测所述IP地址的合法性。
  19. 一种计算机设备,包括:
    至少一个处理器;
    存储器,设置为存储至少一个程序;
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-12中任一项所述的信用阈值的训练方法或者如权利要求13-16中任一项所述的互联网协议IP地址的检测方法。
  20. 一种计算机可读存储介质,设置为存储计算机程序,所述计算机程序被处理器执行时实现如权利要求1-12中任一项所述的信用阈值的训练方法或者如权利要求13-16中任一项所述的互联网协议IP地址的检测方法。
PCT/CN2021/111096 2020-08-13 2021-08-06 信用阈值的训练方法及装置、ip地址的检测方法及装置 WO2022033396A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US18/041,275 US20230328087A1 (en) 2020-08-13 2021-08-06 Method for training credit threshold, method for detecting ip address, computer device and storage medium
EP21855449.1A EP4199421A1 (en) 2020-08-13 2021-08-06 Credit threshold training method and apparatus, and ip address detection method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010813912.6 2020-08-13
CN202010813912.6A CN112003846B (zh) 2020-08-13 2020-08-13 一种信用阈值的训练、ip地址的检测方法及相关装置

Publications (1)

Publication Number Publication Date
WO2022033396A1 true WO2022033396A1 (zh) 2022-02-17

Family

ID=73472791

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/111096 WO2022033396A1 (zh) 2020-08-13 2021-08-06 信用阈值的训练方法及装置、ip地址的检测方法及装置

Country Status (4)

Country Link
US (1) US20230328087A1 (zh)
EP (1) EP4199421A1 (zh)
CN (1) CN112003846B (zh)
WO (1) WO2022033396A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114900356A (zh) * 2022-05-06 2022-08-12 联云(山东)大数据有限公司 恶意用户行为检测方法、装置及电子设备

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112003846B (zh) * 2020-08-13 2023-02-03 广州市百果园信息技术有限公司 一种信用阈值的训练、ip地址的检测方法及相关装置
CN113329034B (zh) * 2021-06-25 2021-12-07 广州华资软件技术有限公司 基于人工智能的大数据业务优化方法、服务器及存储介质
CN116131928B (zh) * 2023-01-30 2023-10-03 讯芸电子科技(中山)有限公司 一种光传输线路调整方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105323210A (zh) * 2014-06-10 2016-02-10 腾讯科技(深圳)有限公司 一种检测网站安全的方法、装置及云服务器
CN108667828A (zh) * 2018-04-25 2018-10-16 咪咕文化科技有限公司 一种风险控制方法、装置及存储介质
US20190158526A1 (en) * 2016-09-30 2019-05-23 Oath Inc. Computerized system and method for automatically determining malicious ip clusters using network activity data
CN112003846A (zh) * 2020-08-13 2020-11-27 广州市百果园信息技术有限公司 一种信用阈值的训练、ip地址的检测方法及相关装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9386031B2 (en) * 2014-09-12 2016-07-05 AO Kaspersky Lab System and method for detection of targeted attacks
WO2019178753A1 (zh) * 2018-03-20 2019-09-26 深圳蓝贝科技有限公司 支付方法、装置和系统
US11899763B2 (en) * 2018-09-17 2024-02-13 Microsoft Technology Licensing, Llc Supervised learning system for identity compromise risk computation
CN110349038A (zh) * 2019-06-13 2019-10-18 中国平安人寿保险股份有限公司 风险评估模型训练方法和风险评估方法
CN111080397A (zh) * 2019-11-18 2020-04-28 支付宝(杭州)信息技术有限公司 信用评估方法、装置及电子设备
CN111049822B (zh) * 2019-12-10 2022-04-22 北京达佳互联信息技术有限公司 短信验证码发送方法、装置、短信服务器及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105323210A (zh) * 2014-06-10 2016-02-10 腾讯科技(深圳)有限公司 一种检测网站安全的方法、装置及云服务器
US20190158526A1 (en) * 2016-09-30 2019-05-23 Oath Inc. Computerized system and method for automatically determining malicious ip clusters using network activity data
CN108667828A (zh) * 2018-04-25 2018-10-16 咪咕文化科技有限公司 一种风险控制方法、装置及存储介质
CN112003846A (zh) * 2020-08-13 2020-11-27 广州市百果园信息技术有限公司 一种信用阈值的训练、ip地址的检测方法及相关装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114900356A (zh) * 2022-05-06 2022-08-12 联云(山东)大数据有限公司 恶意用户行为检测方法、装置及电子设备

Also Published As

Publication number Publication date
CN112003846A (zh) 2020-11-27
US20230328087A1 (en) 2023-10-12
CN112003846B (zh) 2023-02-03
EP4199421A1 (en) 2023-06-21

Similar Documents

Publication Publication Date Title
WO2022033396A1 (zh) 信用阈值的训练方法及装置、ip地址的检测方法及装置
US20200394661A1 (en) Business action based fraud detection system and method
US20220210200A1 (en) Ai-driven defensive cybersecurity strategy analysis and recommendation system
Ni et al. Real‐time detection of application‐layer DDoS attack using time series analysis
CN105590055B (zh) 用于在网络交互系统中识别用户可信行为的方法及装置
US8356001B2 (en) Systems and methods for application-level security
US11550905B2 (en) Intelligent security risk assessment
US9288124B1 (en) Systems and methods of classifying sessions
EP3085023B1 (en) Communications security
CN111786950A (zh) 基于态势感知的网络安全监控方法、装置、设备及介质
CN107682345B (zh) Ip地址的检测方法、检测装置及电子设备
CN108156141B (zh) 一种实时数据识别方法、装置及电子设备
US11516240B2 (en) Detection of anomalies associated with fraudulent access to a service platform
CN111754241A (zh) 一种用户行为感知方法、装置、设备及介质
CN104579782A (zh) 一种热点安全事件的识别方法及系统
CN112866281A (zh) 一种分布式实时DDoS攻击防护系统及方法
Zhang et al. Detecting Insider Threat from Behavioral Logs Based on Ensemble and Self‐Supervised Learning
Tao et al. An efficient network security situation assessment method based on AE and PMU
Hao et al. A sequential detection method for intrusion detection system based on artificial neural networks
US20220368709A1 (en) Detecting data exfiltration and compromised user accounts in a computing network
CN114363082A (zh) 网络攻击检测方法、装置、设备及计算机可读存储介质
CN112929369A (zh) 一种分布式实时DDoS攻击检测方法
Alshaikh et al. On the variability in the application and measurement of supervised machine learning in cyber security
CN114157514B (zh) 一种多路ids集成检测方法和装置
Courtney et al. Data Science Techniques to Detect Fraudulent Resource Consumption in the Cloud

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021855449

Country of ref document: EP

Effective date: 20230313