WO2018177167A1 - Method for analyzing ip address, system, computer readable storage medium, and computer device - Google Patents

Method for analyzing ip address, system, computer readable storage medium, and computer device Download PDF

Info

Publication number
WO2018177167A1
WO2018177167A1 PCT/CN2018/079732 CN2018079732W WO2018177167A1 WO 2018177167 A1 WO2018177167 A1 WO 2018177167A1 CN 2018079732 W CN2018079732 W CN 2018079732W WO 2018177167 A1 WO2018177167 A1 WO 2018177167A1
Authority
WO
WIPO (PCT)
Prior art keywords
period
score
address
probability
rest
Prior art date
Application number
PCT/CN2018/079732
Other languages
French (fr)
Chinese (zh)
Inventor
刘鑫琪
童剑
Original Assignee
贵州白山云科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 贵州白山云科技有限公司 filed Critical 贵州白山云科技有限公司
Publication of WO2018177167A1 publication Critical patent/WO2018177167A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0236Filtering by address, protocol, port number or service, e.g. IP-address or URL

Definitions

  • Embodiments of the present invention relate to, but are not limited to, the field of Internet user portraits, and in particular, to an IP address analysis method, system, and computer readable storage medium and computer device.
  • the network layer firewall can be thought of as an IP packet filter that operates on the underlying TCP/IP protocol stack. In the enumeration mode, only the packets that meet the specific rules are allowed to pass, and the rest are prohibited from crossing the firewall (except for the virus, the firewall cannot prevent the virus from intruding). These rules can usually be defined or modified by the administrator, although some firewall devices may only be able to apply built-in rules.
  • Firewall rules can also be formulated in a more relaxed way, as long as the packet does not conform to any of the "negative rules". Most operating systems and network devices have built-in firewall functions.
  • Newer firewalls can use the various attributes of the packet to filter, such as source IP address, source port number, destination IP address or port number, and service type (such as HTTP or FTP). It can also be filtered via attributes such as communication protocol, TTL value, source domain name or network segment....
  • the existing interception scheme is intercepted by a single fixed rule, and the granularity and dimension of the access request analysis are insufficient, and the identity of the visitor is not recognized, which is likely to cause a mistake.
  • Embodiments of the present invention are directed to solving the problems described above. It is an object of embodiments of the present invention to provide a solution to any of the above problems. Specifically, embodiments of the present invention provide the capability.
  • an IP address analysis method including:
  • the historical data of the IP address is analyzed to generate credit data of the IP address.
  • the step of collecting historical data of the IP address includes:
  • the predefined indicator including at least one or more of the following information:
  • the predefined indicators corresponding to the respective IP addresses are stored, and each IP address corresponds to one or more predefined indicators.
  • the step of collecting historical data of the IP address further includes:
  • third-party IP libraries and/or third-party IP blacklists are obtained from third-party platforms one or more times.
  • the historical data of the IP address is analyzed, and the credit data of the generated IP address includes:
  • the indicator is pre-processed and normalized to obtain an intermediate value of each rest day, where the second period includes a plurality of first periods corresponding to the working days and a plurality of first periods corresponding to the rest days;
  • the current second period temporary specific indicator in the second period is calculated according to one or more working day weighted averages and one or more rest day weighted average values, and the current second period temporary specific indicators include:
  • This week is the probability of office export IP. This period is the probability of household export IP. This period is the probability of real person, the activity score of this cycle, and the number of people in this cycle.
  • the final specific indicator of the second period serves as credit data for the IP address.
  • the pre-predetermined indicator of the first period corresponding to the working day in the second period is pre-processed and normalized to obtain an intermediate value of each working day, and the first corresponding to the rest day in the second period
  • the steps of pre-processing and normalizing the periodic pre-defined indicators to obtain the intermediate values of each rest day include:
  • the score takes the weighted mean value, and the rest day is the intermediate value of the household exit IP probability;
  • the step of calculating the current second period temporary specific indicator in the second period according to one or more working day weighted average values and one or more rest day weighted average values includes:
  • the weighted average value of the intermediate value of the office exit IP probability is the probability that the current period is the office exit IP
  • the working day is the intermediate value of the household export IP probability and the rest day is the weighted average of the intermediate value of the household export IP probability as the current period is the probability of the household export IP;
  • the working day is the weighted mean of the median probability of the real person and the rest day as the median probability of the real person as the true probability of the cycle;
  • the weighted average of the median activity day value and the median value of the rest day activity is used as the cycle activity score.
  • the number of mobile Agents on the working day and the number of mobile Agents on the rest day and the maximum number of PCs on the workday and the number of UserAgents on the rest of the PC are grouped as the number of people in this cycle.
  • the final specific indicator includes at least one or more of the following information
  • IP, IPInt, update ID the number of IP updates, the final number of people grouped, and finally the sum of the probability of office export IP, and finally the sum of the probability of household export IP, and finally the sum of the probability of real people, the sum of the final activity scores
  • the IPInt is a long integer corresponding to the IP address
  • the Update ID is the number of times to update the final specific indicator of the second period
  • the IP update number is the final specific indicator of the second period of the update of the IP address. frequency
  • the step of adjusting the temporary specific indicator of the current second period according to the final specific indicator of the previous second period and the third-party IP library and/or the third-party IP blacklist to obtain the final specific indicator of the current second period include:
  • the final specific indicator of the current second period is obtained by the following calculation:
  • the group with the large number of people is selected as the final number of groups in the current second period, otherwise the current selection is selected.
  • the sum of the probability of the office exit IP in the temporary specific indicator of the second period plus the probability of the final office outlet IP in the final specific indicator of the second period is the sum of the probability of the final office IP of the current second period.
  • the sum of the probability of the household export IP plus the probability of the final household IP in the final specific indicator of the second period in the temporary specific indicator of the second period is the sum of the probability of the final household IP of the current second period.
  • the sum of the probability of the real person in the temporary specific indicator of the second period plus the final probability of the real person in the final specific indicator of the second period is the sum of the probability of the real person in the current second period
  • the final specific indicator of the current second period is used to cover the final specific indicator of the previous second period, and the number of times to update the final specific indicator of the second period and the number of updates to the corresponding IP address are recorded.
  • the final specific indicator of the current second period is adjusted according to the final specific indicator of the previous second period and the third-party IP library and/or the third-party IP blacklist, and the final specific indicator of the current second period is obtained.
  • the steps also include:
  • the additional information of the IP address included in the third-party IP library adjust the IP probability of the office exit, the probability of the household exit IP, and the probability of the real person in the temporary specific indicators of the current second period;
  • the method further comprises:
  • Receiving a third party to issue an IP verification request for the IP address searching for credit data corresponding to the IP address, performing credit rating evaluation on the IP address according to the credit data, and returning the evaluation result to the third party.
  • an IP address analysis system including a big data platform and an offline computing platform;
  • the big data platform is configured to store original logs, calculate original logs, and collect and store historical data of IP addresses;
  • the offline computing platform is configured to analyze historical data of an IP address collected by the big data platform to generate credit data of an IP address.
  • a computer readable storage medium having stored thereon a computer program, the program being implemented by a processor to implement the steps of the above method.
  • a computer apparatus comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor executing the The steps of the above method are implemented when the program is executed.
  • the method and system of the embodiment of the present invention collect historical data of an IP address, analyze historical data of an IP address, generate credit data of an IP address, implement accurate analysis of IP address refinement, and determine an IP address attribute by using big data. It has a comprehensive and accurate understanding of the IP address credit situation, and can be applied to scenarios such as IP address legality verification and IP address interception, which solves the problem of IP address security cognitive error, effectively preventing IP address legality misjudgment, The occurrence of an IP address error.
  • FIG. 1 exemplarily shows a flow of an IP address analysis method according to Embodiment 1 of the present invention
  • FIG. 2 exemplarily shows an application principle of the technical solution provided by the embodiment of the present invention
  • FIG. 3 exemplarily shows an architecture of an IP address analysis system provided by Embodiment 2 of the present invention.
  • the existing interception scheme is intercepted by a single fixed rule, and the granularity and dimension of the access request analysis are insufficient, and the identity of the visitor is not recognized, which is likely to cause a mistake.
  • an embodiment of the present invention provides an IP address analysis method, and Embodiment 1 of the present invention will be described below with reference to the accompanying drawings.
  • the embodiment of the present invention provides an IP address analysis method, which performs detailed analysis on an IP address based on historical data of an IP address and related additional information, and obtains credit data of an IP address according to the credit data of the IP address.
  • the IP address is evaluated, and the accurate analysis and judgment of the IP address refinement is realized.
  • the security level of the IP address is evaluated by the credit data, and the occurrence of misjudgment and error prevention is effectively prevented.
  • the specific process is as shown in FIG. 1 and includes:
  • Step 101 Collect historical data of an IP address.
  • the original log such as the CDN server log
  • the information in the CDN server log is formatted to obtain a predefined indicator.
  • the CDN server log is specifically a CDN nginx log, and can also be calculated by using other network access logs.
  • the predefined indicator includes at least one or more of the following information:
  • the “rest period request number” refers to the number of requests issued by the IP address during the rest period
  • the “sleep period request number” refers to the number of requests issued by the IP address during the sleep period
  • the “working period request file size” refers to a sum of file sizes requested by the IP address during the working period
  • the “intermediate request file size” refers to the sum of the file sizes requested by the IP address during the rest period
  • the “sleep period request file size” refers to the sum of the file sizes requested by the IP address during the sleep period.
  • the “number of UserAgents in the working period” refers to the number of UserAgents that appear under the IP address in the working period
  • the “number of User Agents in the rest period” refers to the number of users that appear under the IP address during the rest period.
  • the number of UserAgents, the number of UserAgents in the sleep period, the number of UserAgents that appear under the IP address in the sleep period, and the number of UserAgents in the mobile terminal is the number of UserAgents accessed by the mobile terminal.
  • the number of the user-accessible IP addresses refers to the number of UserAgents accessed by the IP address
  • the number of access sources refers to the number of access sources of the IP addresses in the first period.
  • the "number of access domain names” refers to the number of domain names accessed by the IP address in the first period, and the "hours of occurrence” refers to the occurrence of the IP address in the first period. Number of hours (h appears IP access count appears 1 hours, for example 2 and 4-point access will appear both IP value is set to 2 hours).
  • the predefined indicators corresponding to the respective IP addresses are stored, and each IP address corresponds to one or more predefined indicators.
  • the predefined indicator may be stored in a hive table.
  • the first period involved in this step is preferably 1 natural day (24 hours).
  • the historical data of the IP address can also be obtained from a third-party platform, such as obtaining a third-party IP library and/or a third-party IP blacklist from a third-party platform one or more times in the second period, and the third-party IP library often Additional information including the IP address (such as the IP address or the distribution of the IP address segment, that is, the country, province, city, and operator corresponding to the IP segment, may also indicate the name of a company, or may be marked as a data center).
  • Third-party platform data is not updated regularly, so it can be obtained after the data of the third-party platform is updated, or it can be obtained before the second cycle is ready for calculation.
  • Step 102 Analyze historical data of the IP address, and generate credit data of the IP address;
  • the pre-defined indicator of the first period corresponding to the working day in the second period is pre-processed and normalized to obtain an intermediate value of each working day, and the first corresponding to the rest day in the second period Pre-defined indicators of the period are pre-processed and normalized to obtain intermediate values of the respective rest days, wherein the second period includes a plurality of first periods corresponding to the working days and a plurality of first periods corresponding to the rest days;
  • the median values of each working day are respectively weighted average (including arithmetic weighted average or geometric weighted average) to obtain the working day weighted mean; and the median values of the respective rest days are respectively weighted average or maximum processed to obtain the rest day weighted mean or maximum
  • the current second period temporary specific indicator in the second period is calculated according to one or more working day weighted averages, and one or more rest day weighted average values, and the current second period temporary specific indicators include: the current period is office The probability of exporting IP, this period is the probability of household export IP
  • the second period is an integer multiple of the first period; preferably, when the first period is day, the second period is month or week.
  • the first period involved is 1 day, and the second period is 1 month; the normalization algorithm involved in the embodiment of the present invention uses a deformed sigmoid function, 1.0/(1.0+math.exp(-molecule/denominator+4.0 )), because the input values are greater than or equal to 0, so add 4.0.
  • the IP number is used as the dimension, and the number of mobile UserAgents, the number of PC-side UserAgents, and the number of requests is calculated in the first cycle corresponding to the working day.
  • Number of requests The number of work time requests plus the number of break time requests plus the number of sleep time requests.
  • the working day IP address is the probability of household export IP:
  • the normalization algorithm has a numerator of 6, and the denominator is the number of hours of occurrence;
  • Mobile UserAgent score normalization algorithm, the numerator is the average of the number of mobile UserAgents under each IP for a period of time (such as 10, irregular update), and the denominator is the number of mobile UserAgents;
  • PC-side UserAgent score normalization algorithm, the numerator is the average of the number of PC-side UserAgents under each IP for a period of time (such as 5, irregular update), and the denominator is the number of PC-side UserAgents;
  • Break time VS work time request number score normalization algorithm, the number of breaks is divided by 4, the denominator is the number of work time requests divided by 12;
  • Rest time VS sleep time request number score normalization algorithm, the number of breaks is divided by 4, the denominator is the number of sleep time requests divided by 8;
  • Access domain name score normalization algorithm, the numerator is the number of domain names visited by IP on a daily basis (not updated regularly), and the denominator is the number of domain names;
  • the number of scores in the above hours, the mobile userUser score, the PC userAgent score, the rest period VS working period request score, the rest period VS sleep period request score, and the number of access domain number scores are the working days for the household export IP probability. Median.
  • Work time request number score normalization algorithm, the number of work time requests divided by 12, the denominator is the average number of hours of each IP working time period within a period of time (not regularly updated);
  • Work time VS break time request number score pre-processing and normalization algorithm, the number of work is the number of work time periods divided by 12, the denominator is the number of break time requests divided by 4;
  • Rest time VS sleep period request score preprocessing and normalization algorithm, the number of breaks is divided by 4, the denominator is the number of sleep time requests divided by 8;
  • PC-side UserAgent score pre-processing and normalization algorithm
  • the numerator is the average of the number of PC-side UserAgents under each IP for a period of time (such as 10, irregular update), and the denominator is the number of PC-side UserAgents;
  • Working period VS break period UserAgent number score preprocessing and normalization algorithm, the numerator is the number of UserAgents during the working period, and the denominator is the number of UserAgents during the break period;
  • Request number distribution score preprocessing and normalization algorithm, the numerator is the number of work period requests divided by 12 and the number of rest periods is divided by 4 and the sleep period is divided by 8 standard deviation, the denominator is 1;
  • UserAgent number distribution score normalization algorithm, the number of users is the number of UserAgents during the working period, the number of UserAgents during the break period, the standard deviation of the number of UserAgents during the sleep period, and the denominator is 1;
  • the number of hours appears: the normalization algorithm, the numerator is 6, and the denominator is the number of hours;
  • Domain name VS source number score normalization algorithm, the numerator is the source number, and the denominator is the number of domain names;
  • Mobile Agent VSPC UserAgent Number Score Normalization algorithm, the numerator is the number of mobile UserAgent, and the denominator is the number of PC UserAgent.
  • Number of access domain names normalization algorithm, the number of accesses is the number of domain names, and the denominator is 10;
  • Work time request number score normalization algorithm, the number of work time requests divided by 12, the denominator is the average number of hours of each IP working time period within a period of time (not regularly updated);
  • Break time request score normalization algorithm, the number of breaks is divided by 4, and the denominator is the average number of hours of each IP break in a period of time (not regularly updated);
  • Sleep time request score normalization algorithm, the number of sleep time requests divided by 8, the denominator is the average number of hours of IP sleep time requests within a period of time (not regularly updated);
  • the number of hours appears: normalization algorithm, the number of molecules is the number of hours, and the denominator is 6;
  • Request source score Normalization algorithm, the numerator is the number of request sources, and the denominator is the average of the average number of daily requests per IP (not updated regularly).
  • IP the dimension, calculate the weighted average of the number of mobile userUsers in the first cycle corresponding to the working day, the weighted mean of the number of UserAgents on the PC, the weighted average of the number of requests, the working day is the weighted average of the probability of the home exit IP, and the working day is the weight of the office exit IP probability.
  • Mean the working day is the real-life probability weighted mean, and the working day activity is weighted by the mean.
  • the IP is used as a dimension to calculate the number of mobile UserAgents, the number of PC-side UserAgents, and the number of requests in the first cycle corresponding to the rest day.
  • Number of requests The number of work time requests plus the number of break time requests plus the number of sleep time requests.
  • the rest day is the intermediate value of the household export IP probability: similar to the working day algorithm
  • the rest day is the median probability of the real person: similar to the working day algorithm
  • IP the dimension
  • the rest day is the weighted average of the probability of household export IP
  • the rest day is the weighted average of the probability of the real person.
  • rest day activity weighted mean the rest day activity weighted mean.
  • IPInt Convert this IP to the corresponding long integer.
  • the number of people in this week group Find the number of UserAgents on the PC side of the working day, and calculate the weighted average of the number of PC-side UserAgents in the first cycle corresponding to the working day.
  • the number of mobile Agents on the working day is calculated, and the weighted average of the number of mobile userUsers in the first cycle corresponding to the working day is calculated.
  • the number of UserAgents on the PC side calculates the weighted average of the number of PC-side UserAgents in the first cycle corresponding to the rest day.
  • the number of mobile Agents on the rest day is calculated, and the weighted average of the number of mobile users in the first cycle corresponding to the rest day is calculated.
  • the group is grouped as follows: 1:0-1, 2:2-5, 3:6-10, 4:11-30, 5:31-50, 6:51-100, 7:101-500, 8:501-2000, 9:>2000.
  • PC-side UserAgent number score normalization algorithm, the numerator is the number of PC-side UserAgents on the working day, and the denominator is the number of PC-side UserAgents on the rest day;
  • Workday VS rest day request score normalization algorithm, the numerator is the number of workday requests, and the denominator is the rest day request number;
  • the weighted average of the above three score weighted mean values and the median value of the office exit IP probability on the working day is the IP probability of the office exit IP in this cycle.
  • This week is the probability of household export IP: the working day is the weighted average of the intermediate value of the household export IP probability and the rest day is the intermediate value of the household export IP probability.
  • This week is the probability of a real person: the working day is the weighted mean of the median probability of the real person and the rest day is the median probability of the real person.
  • This week's activity score the weighted average of the median activity day value and the median value of the rest day activity.
  • This week's number of people grouping grouping the number of mobile Agents on the working day and the number of mobile Agents on the rest day and the maximum number of PCs on the workday and the number of UserAgents on the rest of the PC.
  • the temporary specific indicators of the current second period are stored in the MySQL temporary data table and enter the adjustment phase.
  • the current specific indicator of the current second cycle stored in MySQL, and the final specific indicator of the last second cycle stored in MySQL in the previous second cycle are used as input for this phase.
  • third-party IP library information determine whether the additional information contained in the third-party IP library determines whether the string contains the following sensitive string, and returns the corresponding adjustment index: “Company”, “Data Center”, “GSM/TD-SCDMA” /LTE”.
  • the adjustment index includes three adjustment indexes of “probability of real person”, “probability of IP for office export”, and “probability of IP for household export”. If no sensitive string is included, the adjustment index of all three probabilities is 1.
  • the three probabilities are multiplied by three probabilities, and the probabilities need to be in the range of [0.05, 0.95]. If they are less than 0.05, they return 0.05. If they are greater than 0.95, they return 0.95.
  • the manner in which the final specific indicator of the current second period is generated for the IP address is also different, as follows:
  • Update ID The data of the second cycle is updated several times.
  • IPInt Convert this IP to the corresponding long integer.
  • the number of IP updates The number of IP updates is increased by one in the previous second period.
  • Final number grouping If the current number of people in the second period of the temporary specific indicator group and the final number of the final specific indicator in the previous second period are grouped into adjacent groups, the group is eventually large; otherwise it is the grouping of temporary data.
  • the sum of the IP probabilities of the office exits the sum of the office exit IP probabilities in the current second-period temporary specific indicators plus the final concrete IP probabilities in the final specific indicators of the second period.
  • the current second-period temporary specific indicator is the sum of the household export IP probability plus the final specific indicator of the second cycle, which is the sum of the household export IP probabilities.
  • the sum of the probabilities of the real people the probability of the real IP in the temporary specific indicator of the second period plus the final probability of the real person in the final specific indicator of the second period.
  • the sum of the final activity scores the sum of the activity scores in the temporary specific indicators of the current second period plus the final activity scores of the second period, divided by the number of final updates recorded by the update ID, multiplied by the number of IP updates .
  • IPInt Convert this IP to the corresponding long integer.
  • Update ID The data of the second cycle is updated several times.
  • the temporary specific indicators of the current second period are the IP addresses of the office exits.
  • the current specific indicator of the second cycle is the probability of household export IP.
  • the final is the sum of the probability of the real person: the real probability of the temporary specific indicator of the current second cycle.
  • the sum of the final activity scores the activity score of the current second period temporary specific indicator divided by the update ID, multiplied by the number of IP updates.
  • the IP credit smear data may be generated according to the third party IP blacklist, and the IP credit smear data is added to the final specific indicator of the current second period.
  • the information in the blacklist generally includes a blacklisted IP address or an IP address segment.
  • the credit smear data may preferably be expressed in the form of a credit smear score, for example, the more third party blacklists are present, the higher the credit smear score, and the credit smudge score is zero if it does not exist in the blacklist.
  • the final specific indicator of the current second period is used to cover the final specific indicator of the previous second period, and the number of times to update the final specific indicator of the second period and the number of updates to the corresponding IP address are recorded. Update all of the above data to MySQL.
  • the IP can be used as an index; or the IP can be converted to a corresponding long integer and then indexed by the corresponding long integer, and the IPInt is divided into 256 parts and then stored in a sub-table. Easy to query and improve the speed of the query.
  • the final specific indicator corresponding to the IP address is used as the credit data of the IP address, and the IP address is evaluated according to the credit data.
  • Providing an interface to the third party allowing access to the credit data of the IP address through the interface; or receiving an IP verification request for the IP address by the third party, searching for credit data corresponding to the IP address, according to the credit data
  • the credit rating is evaluated on the IP address, and the evaluation result is returned to the third party.
  • the firewall determines the legality of the IP address based on the credit data of the IP address, and can also independently form an IP credit rating platform to provide IP verification results to the firewall. It can also provide a secondary verification mechanism without affecting the existing firewall function. That is, when the firewall determines that the IP address is suspicious, the IP credit rating platform performs secondary verification based on the credit data to further improve the accuracy of the firewall interception. Sex, prevent mistakes.
  • the IP address analysis method provided by the embodiment of the present invention can be combined with the existing Internet architecture. As shown in FIG. 2, the user access log is collected as the original log, combined with the third party IP blacklist and IP address library, and the present invention is used.
  • the IP address analysis method provided by the embodiment obtains IP user attribute data mainly composed of final specific indicators, IP stain data mainly generated according to a third-party IP blacklist, and IP address pool data obtained according to a third-party IP address library, and IP user attribute data, IP smear data and IP address pool data are integrated into the IP credit rating platform, and the IP address is credit-rated to obtain credit data of the IP address.
  • the credit data of the IP address comprehensively describes the characteristics of the IP address, and can be used for confirming the security of the IP address in the information security field, or realizing the IP address based on the big data analysis.
  • the application results can also be fed back to the IP credit rating platform to perform algorithm iteration and parameter adjustment on the existing results, giving the system the ability to self-learn and self-adjust, and further improve the accuracy of the IP credit rating platform for IP analysis.
  • An embodiment of the present invention provides an IP address analysis system, and the structure thereof is as shown in FIG. 3, including:
  • the big data platform includes: Hadoop computing platform, spark computing platform;
  • Offline computing platforms include: servers or server clusters.
  • the big data platform is configured to store original logs, calculate original logs, and collect and store historical data of IP addresses;
  • the offline computing platform is configured to analyze historical data of an IP address collected by the big data platform to generate credit data of an IP address.
  • the offline computing platform is further configured to communicate with a third party through the big data platform, provide credit data of the IP address to a third party, or receive a third party query request to return an authentication according to the IP address of the credit data. information.
  • the IP address analysis system further includes a storage platform, the storage platform supports a MySQL system, and can be configured to store the original log, credit data of an IP address, a third-party IP library obtained from a third party, and a third-party IP black.
  • the embodiment of the present invention further provides a computer readable storage medium, where the computer program stores a computer program, and when the program is executed by the processor, the steps of the foregoing method are implemented.
  • the embodiment of the invention further provides a computer device, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, the processor implementing the program to implement the steps of the above method.
  • the embodiment of the present invention provides an IP address analysis system, which can be combined with an IP address analysis method provided by an embodiment of the present invention to collect historical data of an IP address and analyze historical data of the IP address.
  • the credit data of the IP address realizes accurate analysis of IP address refinement, determines the IP address attribute with big data, and has a comprehensive and accurate understanding of the IP address credit status, which can be applied to IP address legality verification, IP address interception, etc.
  • the problem of IP address security cognition error is solved, and the IP address legality misjudgment and IP address misinterpretation are effectively prevented.
  • computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules or other data. Sex, removable and non-removable media.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridge, magnetic tape, magnetic disk storage or other magnetic storage device, or may Any other medium used to store the desired information and that can be accessed by the computer.
  • communication media typically includes computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and can include any information delivery media. .
  • the present invention by collecting the historical data of the IP address, analyzing the historical data of the IP address, generating the credit data of the IP address, realizing the accurate analysis of the IP address refinement, and determining the IP address attribute by the big data.
  • a comprehensive and accurate understanding of the IP address credit situation can be applied to scenarios such as IP address legality verification and IP address interception, which solves the problem of IP address security cognitive error, effectively preventing IP address legality misjudgment, IP The occurrence of an address error.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

Provided are a method for analyzing an IP address, a system, a computer readable storage medium, and a computer device. The method comprises: collecting historical data of an IP address; and performing analysis on the historical data of the IP address to generate reliability data of the IP address. The present invention is applicable to aspects such as legitimacy authentication of an IP address and interception of an IP address, thereby solving a problem in which a security class of an IP address is mistakenly identified, effectively preventing misjudgment of legitimacy of an IP address, and preventing an erroneous interception of an IP address.

Description

一种IP地址分析方法、系统及计算机可读存储介质和计算机设备IP address analysis method, system and computer readable storage medium and computer equipment
本申请要求在2017年04月01日提交中国专利局、申请号为201710216069.1、发明名称为“一种IP地址分析方法及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. 200910216069.1, entitled "A Method for the Analysis of IP Addresses and Systems", filed on April 1, 2017, the entire contents of which is incorporated herein by reference. in.
技术领域Technical field
本发明实施例涉及但不限于互联网用户画像领域,尤其涉及一种IP地址分析方法、系统及其计算机可读存储介质和计算机设备。Embodiments of the present invention relate to, but are not limited to, the field of Internet user portraits, and in particular, to an IP address analysis method, system, and computer readable storage medium and computer device.
背景技术Background technique
网络层防火墙可视为一种IP封包过滤器,运作在底层的TCP/IP协议堆栈上。可以以枚举的方式,只允许符合特定规则的封包通过,其余的一概禁止穿越防火墙(病毒除外,防火墙不能防止病毒侵入)。这些规则通常可以经由管理员定义或修改,不过某些防火墙设备可能只能套用内置的规则。The network layer firewall can be thought of as an IP packet filter that operates on the underlying TCP/IP protocol stack. In the enumeration mode, only the packets that meet the specific rules are allowed to pass, and the rest are prohibited from crossing the firewall (except for the virus, the firewall cannot prevent the virus from intruding). These rules can usually be defined or modified by the administrator, although some firewall devices may only be able to apply built-in rules.
也可以以另一种较宽松的角度来制定防火墙规则,只要封包不符合任何一项“否定规则”就予以放行。操作系统及网络设备大多已内置防火墙功能。Firewall rules can also be formulated in a more relaxed way, as long as the packet does not conform to any of the "negative rules". Most operating systems and network devices have built-in firewall functions.
较新的防火墙能利用封包的多样属性来进行过滤,例如:来源IP地址、来源端口号、目的IP地址或端口号、服务类型(如HTTP或是FTP)。也能经由通信协议、TTL值、来源的网域名称或网段...等属性来进行过滤。Newer firewalls can use the various attributes of the packet to filter, such as source IP address, source port number, destination IP address or port number, and service type (such as HTTP or FTP). It can also be filtered via attributes such as communication protocol, TTL value, source domain name or network segment....
现有的拦截方案通过单一固定规则拦截,对访问请求分析的粒度、维度不够,对访问者的身份缺乏认知,容易造成误拦。The existing interception scheme is intercepted by a single fixed rule, and the granularity and dimension of the access request analysis are insufficient, and the identity of the visitor is not recognized, which is likely to cause a mistake.
发明内容Summary of the invention
本发明实施例旨在解决上面描述的问题。本发明实施例的一个目的是提供一种解决以上问题中的任何一个的。具体地,本发明实施例提供能够。Embodiments of the present invention are directed to solving the problems described above. It is an object of embodiments of the present invention to provide a solution to any of the above problems. Specifically, embodiments of the present invention provide the capability.
根据本发明实施例的第一方面,提供了一种IP地址分析方法,包括:According to a first aspect of the embodiments of the present invention, an IP address analysis method is provided, including:
收集IP地址的历史数据;Collect historical data of IP addresses;
对IP地址的历史数据进行分析,生成IP地址的信用数据。The historical data of the IP address is analyzed to generate credit data of the IP address.
其中,所述收集IP地址的历史数据的步骤包括:The step of collecting historical data of the IP address includes:
在第一周期内收集并解析原始日志;Collect and parse the original log in the first cycle;
将所述原始日志中的信息格式化,得到预定义指标,所述预定义指标至少包含以下信息中的任一项或任意多项:Formatting the information in the original log to obtain a predefined indicator, the predefined indicator including at least one or more of the following information:
时间,IP,工作时段请求数,休息时段请求数,睡眠时段请求数,工作时段请求文件大小,休息时段请求文件大小,睡眠时段请求文件大小,工作时段用户代理UserAgent数,休息时段UserAgent数,睡眠时段UserAgent数,移动端UserAgent数,PC端UserAgent数,访问来源数量,访问域名数量,出现小时数;Time, IP, working time request number, rest period request number, sleep period request number, working period request file size, rest period request file size, sleep period request file size, working period user agent UserAgent number, rest period UserAgent number, sleep Number of time UserAgents, number of mobile UserAgents, number of UserAgents on the PC, number of access sources, number of visited domain names, number of hours of occurrence;
存储各个IP地址对应的所述预定义指标,每个IP地址对应一个或多个预定义指标。The predefined indicators corresponding to the respective IP addresses are stored, and each IP address corresponds to one or more predefined indicators.
其中,所述收集IP地址的历史数据的步骤还包括:The step of collecting historical data of the IP address further includes:
在第二周期内,一次或多次从第三方平台获取第三方IP库和/或第三方IP黑名单。In the second cycle, third-party IP libraries and/or third-party IP blacklists are obtained from third-party platforms one or more times.
其中,对IP地址的历史数据进行分析,生成IP地址的信用数据包括:The historical data of the IP address is analyzed, and the credit data of the generated IP address includes:
对第二周期内与工作日对应的第一周期的预定义指标进行预处理并归一化后得到各个工作日中间值,对所述第二周期内与休息日对应的第一周期的预定义指标进行预处理并归一化后得到各个休息日中间值,所述第二周期包含多个与工作日对应的第一周期和多个与休息日对应的第一周期;Pre-defining the pre-defined indicators of the first period corresponding to the working days in the second period and normalizing to obtain intermediate values of the respective working days, and pre-defining the first period corresponding to the rest days in the second period The indicator is pre-processed and normalized to obtain an intermediate value of each rest day, where the second period includes a plurality of first periods corresponding to the working days and a plurality of first periods corresponding to the rest days;
对所述各个工作日中间值分别进行加权平均处理得到工作日加权均值;Performing weighted average processing on the intermediate values of the respective working days to obtain a working day weighted average value;
对所述各个休息日中间值分别进行加权平均或最大值处理得到休息日加权均值或最大值;Performing a weighted average or maximum value for each of the rest day intermediate values to obtain a rest day weighted mean or maximum value;
依据一个或者多个工作日加权均值,一个或多个休息日加权均值计算得到第二周期内的当前第二周期临时具体指标,所述当前第二周期临时具体指标包括:The current second period temporary specific indicator in the second period is calculated according to one or more working day weighted averages and one or more rest day weighted average values, and the current second period temporary specific indicators include:
本周期为办公出口IP概率,本周期为家庭出口IP概率,本周期为真人概率,本周期活跃度分数,本周期人数分组;This week is the probability of office export IP. This period is the probability of household export IP. This period is the probability of real person, the activity score of this cycle, and the number of people in this cycle.
根据上一第二周期的最终具体指标与第三方IP库和/或第三方IP黑名单, 对所述当前第二周期临时具体指标进行调整,得到当前第二周期的最终具体指标,以该当前第二周期的最终具体指标作为所述IP地址的信用数据。Adjusting the current second period temporary specific indicator according to the final specific indicator of the previous second period and the third-party IP library and/or the third-party IP blacklist, and obtaining the final specific indicator of the current second period, The final specific indicator of the second period serves as credit data for the IP address.
其中,所述对第二周期内与工作日对应的第一周期的预定义指标进行预处理及归一化后得到各个工作日中间值,对所述第二周期内与休息日对应的第一周期的预定义指标进行预处理及归一化后得到各个休息日中间值的步骤包括:The pre-predetermined indicator of the first period corresponding to the working day in the second period is pre-processed and normalized to obtain an intermediate value of each working day, and the first corresponding to the rest day in the second period The steps of pre-processing and normalizing the periodic pre-defined indicators to obtain the intermediate values of each rest day include:
计算与工作日对应的第一周期小时出现数分数、移动端UserAgent分数、PC端UserAgent数分数、休息时段VS工作时段请求数分数、休息时段VS睡眠时段请求数分数、访问域名数分数,对以上分数取加权均值,得到工作日为家庭出口IP概率中间值;Calculate the first cycle hour number score corresponding to the working day, the mobile userAgent score, the PC userAgent score, the rest period VS work time request number score, the rest period VS sleep time request number score, the access domain name score, and the above The score is taken as a weighted mean, and the working day is the intermediate value of the household exit IP probability;
计算与工作日对应的第一周期工作时段请求数分数、工作时段VS休息时段请求数分数、休息时段VS睡眠时段请求数分数、PC端UserAgent数分数、工作时段VS休息时段UserAgent数分数,对以上分数值加权平均,得到工作日为办公出口IP概率中间值;Calculating the first cycle working period request number score corresponding to the working day, the working period VS rest period request number score, the rest period VS sleep period request number score, the PC end UserAgent number score, the working period VS rest period UserAgent number score, and the above The weighted average of the points is obtained, and the working day is the intermediate value of the office exit IP probability;
计算与工作日对应的第一周期请求数分布分数、UserAgent数分布分数、小时出现数分数、域名数VS来源数分数、移动端VS PC端UserAgent数分数,对以上分数取加权均值,得到工作日为真人概率中间值;Calculate the first cycle request number distribution score, the UserAgent number distribution score, the hour appearance number score, the domain name number VS source number score, the mobile terminal VS PC end UserAgent number score corresponding to the working day, and the weighted average value of the above scores is obtained, and the working day is obtained. The median probability of being a real person;
计算与工作日对应的第一周期访问域名数分数、工作时段请求数分数、休息时段请求数分数、睡眠时段请求数分数、小时出现数分数、请求来源数分数,对以上分数取加权均值,得到工作日活跃度中间值;Calculating the first period access domain number score, the working period request number score, the rest period request number score, the sleep period request number score, the hour appearance number score, the request source number score corresponding to the working day, and weighting the average score of the above scores, Intermediate value of working day activity;
计算与休息日对应的第一周期小时出现数分数、移动端UserAgent分数、PC端UserAgent数分数、休息时段VS工作时段请求数分数、休息时段VS睡眠时段请求数分数、访问域名数分数,对以上分数取加权均值,得到休息日为家庭出口IP概率中间值;Calculate the first cycle hour number score corresponding to the rest day, the mobile userAgent score, the PC end UserAgent number score, the rest period VS work period request number score, the rest period VS sleep period request number score, the access domain name score, and the above The score takes the weighted mean value, and the rest day is the intermediate value of the household exit IP probability;
计算与休息日对应的第一周期休息时段请求数分数、UserAgent数分布分数、小时出现数分数、域名数VS来源数分数、移动端VS PC端UserAgent数分数,对以上分数取加权均值,得到休息日为真人概率中间值;Calculate the first period rest period request number score, the UserAgent number distribution score, the hour appearance number score, the domain name number VS source number score, the mobile end VS PC end UserAgent number score corresponding to the rest day, and take the weighted mean value of the above scores to obtain a rest The daily value of the probability of a real person;
计算与休息日对应的第一周期访问域名数分数、工作时段请求数分数、休息时段请求数分数、睡眠时段请求数分数、小时出现数分数、请求来源数分数,对以上分数取加权均值,得到休息日活跃度中间值。Calculating the first period access domain number score, the work period request number score, the rest period request number score, the sleep period request number score, the hour appearance number score, the request source number score corresponding to the rest day, and weighting the average score of the above scores, The median value of the rest day activity.
其中,所述依据一个或者多个工作日加权均值,一个或多个休息日加权均值计算得到第二周期内的当前第二周期临时具体指标的步骤包括:The step of calculating the current second period temporary specific indicator in the second period according to one or more working day weighted average values and one or more rest day weighted average values includes:
预处理及归一化后得到工作日VS休息日PC端UserAgent数分数、工作日VS休息日移动端UserAgent数分数、工作日VS休息日请求数分数,对以上三个分数取加权均值,与工作日为办公出口IP概率中间值的加权均值为所述本周期为办公出口IP概率;After pre-processing and normalization, the number of UserAgents on the workday VS rest day, the number of workdays on the workday VS rest days, the number of workdays VS rest days, and the number of breaks on the rest days are obtained. The weighted average value of the intermediate value of the office exit IP probability is the probability that the current period is the office exit IP;
以工作日为家庭出口IP概率中间值与休息日为家庭出口IP概率中间值加权均值作为本周期为家庭出口IP概率;The working day is the intermediate value of the household export IP probability and the rest day is the weighted average of the intermediate value of the household export IP probability as the current period is the probability of the household export IP;
以工作日为真人概率中间值与休息日为真人概率中间值的加权均值作为本周期为真人概率;The working day is the weighted mean of the median probability of the real person and the rest day as the median probability of the real person as the true probability of the cycle;
以工作日活跃度中间值与休息日活跃度中间值的加权均值作为本周期活跃度分数。The weighted average of the median activity day value and the median value of the rest day activity is used as the cycle activity score.
以工作日移动端UserAgent数量与休息日移动端UserAgent数量与工作日PC端UserAgent数量与休息日PC端UserAgent数量的最大值进行分组作为本周期人数分组。The number of mobile Agents on the working day and the number of mobile Agents on the rest day and the maximum number of PCs on the workday and the number of UserAgents on the rest of the PC are grouped as the number of people in this cycle.
其中,所述最终具体指标至少包含以下信息的任一项或任意多项,Wherein, the final specific indicator includes at least one or more of the following information,
IP,IPInt,更新ID,该IP更新次数,最终人数分组,最终为办公出口IP概率之和,最终为家庭出口IP概率之和,最终为真人概率之和,最终活跃度分数之和,IP, IPInt, update ID, the number of IP updates, the final number of people grouped, and finally the sum of the probability of office export IP, and finally the sum of the probability of household export IP, and finally the sum of the probability of real people, the sum of the final activity scores,
其中,“IPInt”为IP地址对应的长整型,“更新ID”为更新第二周期的最终具体指标的次数,“该IP更新次数”为某IP地址的更新第二周期的最终具体指标的次数,The IPInt is a long integer corresponding to the IP address, the Update ID is the number of times to update the final specific indicator of the second period, and the IP update number is the final specific indicator of the second period of the update of the IP address. frequency,
所述根据上一第二周期的最终具体指标与第三方IP库和/或第三方IP黑名单,对所述当前第二周期临时具体指标进行调整,得到当前第二周期的最终具体指标的步骤包括:The step of adjusting the temporary specific indicator of the current second period according to the final specific indicator of the previous second period and the third-party IP library and/or the third-party IP blacklist to obtain the final specific indicator of the current second period include:
对于在所述上一第二周期的最终具体指标与所述当前第二周期临时具体指标中均涉及的IP地址,通过如下计算获取当前第二周期的最终具体指标:For the IP address involved in the final specific indicator of the last second period and the temporary specific indicator of the current second period, the final specific indicator of the current second period is obtained by the following calculation:
在当前第二周期临时具体指标的人数分组和上一第二周期的最终具体指标的最终人数分组为相邻的分组时,选择人数大的分组作为当前第二周期的最终 人数分组,否则选择当前第二周期临时具体指标的人数分组,When the current number of people of the temporary specific indicator in the second period and the final number of the final specific indicator of the previous second period are grouped into adjacent groups, the group with the large number of people is selected as the final number of groups in the current second period, otherwise the current selection is selected. The number of people in the second cycle of temporary specific indicators,
当前第二周期临时具体指标中的为办公出口IP概率加上一第二周期的最终具体指标中的最终为办公出口IP概率之和作为当前第二周期的最终为办公出口IP概率之和,The sum of the probability of the office exit IP in the temporary specific indicator of the second period plus the probability of the final office outlet IP in the final specific indicator of the second period is the sum of the probability of the final office IP of the current second period.
当前第二周期临时具体指标中的为家庭出口IP概率加上一第二周期的最终具体指标中的最终为家庭出口IP概率之和作为当前第二周期的最终为家庭出口IP概率之和,The sum of the probability of the household export IP plus the probability of the final household IP in the final specific indicator of the second period in the temporary specific indicator of the second period is the sum of the probability of the final household IP of the current second period.
当前第二周期临时具体指标中的为真人概率加上一第二周期的最终具体指标中的最终为真人概率之和作为当前第二周期的最终为真人概率之和,The sum of the probability of the real person in the temporary specific indicator of the second period plus the final probability of the real person in the final specific indicator of the second period is the sum of the probability of the real person in the current second period,
当前第二周期临时具体指标中的活跃度分数加上一第二周期的最终活跃度分数之和后,除以更新ID记录的最终更新次数,乘以该IP更新次数,作为当前第二周期的最终活跃度分数之和;After the sum of the activity score in the temporary specific indicator of the second period plus the final activity score of the second period, divided by the number of final updates recorded by the update ID, multiplied by the number of IP updates, as the current second period The sum of the final activity scores;
对于在所述上一第二周期的最终具体指标中未涉及而在所述当前第二周期临时具体指标中涉及的IP地址,通过如下计算获取当前第二周期的临时具体指标:For the IP address involved in the temporary specific indicator of the current second period, which is not involved in the final specific indicator of the last second period, obtain the temporary specific indicator of the current second period by the following calculation:
以当前第二周期临时具体指标中的人数分组作为当前第二周期的最终人数分组,Grouping the number of people in the temporary specific indicator of the current second period as the final number of people in the current second period.
以当前第二周期临时具体指标中的为办公出口IP概率作为当前第二周期的最终为办公出口IP概率之和,Taking the IP of the office exit IP in the temporary specific indicator of the current second period as the sum of the probability of the final office IP of the current second period,
以当前第二周期临时具体指标中的为家庭出口IP概率作为当前第二周期的最终为家庭出口IP概率之和,Taking the probability of household export IP in the temporary specific indicator of the current second period as the sum of the probability of the final household IP of the current second period,
以当前第二周期临时具体指标中的为真人概率作为当前第二周期的最终真人概率之和,Taking the probability of being a real person in the temporary specific indicator of the current second period as the sum of the final real probability of the current second period,
以当前第二周期临时具体指标中的活跃度分数除以更新ID,乘以该IP更新次数,作为当前第二周期的最终活跃度分数之和;Dividing the activity score in the temporary specific indicator of the current second period by the update ID, multiplying the IP update count as the sum of the final activity scores of the current second period;
使用当前第二周期的最终具体指标覆盖所述上一第二周期的最终具体指标,记录更新第二周期的最终具体指标的次数和对相应IP地址更新的次数。The final specific indicator of the current second period is used to cover the final specific indicator of the previous second period, and the number of times to update the final specific indicator of the second period and the number of updates to the corresponding IP address are recorded.
其中,所述根据上一第二周期的最终具体指标与第三方IP库和/或第三方IP黑名单,对所述当前第二周期临时具体指标进行调整,得到当前第二周期的最 终具体指标的步骤还包括:The final specific indicator of the current second period is adjusted according to the final specific indicator of the previous second period and the third-party IP library and/or the third-party IP blacklist, and the final specific indicator of the current second period is obtained. The steps also include:
过滤掉所述当前第二周期临时具体指标中对应IP不合语法或对应IP为局域网IP的数据;Filtering data corresponding to the IP non-syntax or the corresponding IP as the local area network IP in the temporary specific indicator of the current second period;
根据第三方IP库中包含的IP地址附加信息,调整当前第二周期临时具体指标中的为办公出口IP概率、家庭出口IP概率及真人概率;According to the additional information of the IP address included in the third-party IP library, adjust the IP probability of the office exit, the probability of the household exit IP, and the probability of the real person in the temporary specific indicators of the current second period;
根据所述第三方IP黑名单生成IP信用污点数据,将所述IP信用污点数据加入所述当前第二周期的最终具体指标。Generating IP credit smear data according to the third-party IP blacklist, and adding the IP credit smear data to the final specific indicator of the current second period.
其中,该方法还包括:Wherein, the method further comprises:
向第三方提供接口,允许通过所述接口访问所述IP地址的信用数据;或,Providing an interface to a third party to allow access to credit data of the IP address through the interface; or
接收第三方发出针对IP地址的IP验证请求,查找所述IP地址对应的信用数据,根据所述信用数据对所述IP地址进行信用等级评价,向所述第三方返回评价结果。Receiving a third party to issue an IP verification request for the IP address, searching for credit data corresponding to the IP address, performing credit rating evaluation on the IP address according to the credit data, and returning the evaluation result to the third party.
根据本发明实施例的另一方面,还提供了一种IP地址分析系统,包括大数据平台与离线计算平台;According to another aspect of the embodiments of the present invention, an IP address analysis system is further provided, including a big data platform and an offline computing platform;
所述大数据平台,设置为存储原始日志,计算原始日志,收集并存储IP地址的历史数据;The big data platform is configured to store original logs, calculate original logs, and collect and store historical data of IP addresses;
所述离线计算平台,设置为对所述大数据平台收集的IP地址的历史数据进行分析,生成IP地址的信用数据。The offline computing platform is configured to analyze historical data of an IP address collected by the big data platform to generate credit data of an IP address.
根据本发明实施例的另一方面,还提供了一种计算机可读存储介质,所述存储介质上存储有计算机程序,所述程序被处理器执行时实现上述方法的步骤。According to another aspect of an embodiment of the present invention, there is also provided a computer readable storage medium having stored thereon a computer program, the program being implemented by a processor to implement the steps of the above method.
根据本发明实施例的另一方面,还提供了一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现上述方法的步骤。According to another aspect of an embodiment of the present invention, there is also provided a computer apparatus comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor executing the The steps of the above method are implemented when the program is executed.
本发明实施例的方法和系统通过收集IP地址的历史数据,对IP地址的历史数据进行分析,生成IP地址的信用数据,实现了对IP地址细化精确的分析,以大数据确定IP地址属性,对IP地址信用情况有了全面准确的了解,可应用于IP地址合法性验证、IP地址拦截等场景中,解决了IP地址安全性认知错误的问题,有效防止IP地址合法性误判、IP地址误拦的发生。The method and system of the embodiment of the present invention collect historical data of an IP address, analyze historical data of an IP address, generate credit data of an IP address, implement accurate analysis of IP address refinement, and determine an IP address attribute by using big data. It has a comprehensive and accurate understanding of the IP address credit situation, and can be applied to scenarios such as IP address legality verification and IP address interception, which solves the problem of IP address security cognitive error, effectively preventing IP address legality misjudgment, The occurrence of an IP address error.
参照附图来阅读对于示例性实施例的以下描述,本发明的其他特性特征和优点将变得清晰。Other characteristics and advantages of the present invention will become apparent from the following description of the exemplary embodiments.
附图说明DRAWINGS
并入到说明书中并且构成说明书的一部分的附图示出了本发明的实施例,并且与描述一起用于解释本发明的原理。在这些附图中,类似的附图标记用于表示类似的要素。下面描述中的附图是本发明的一些实施例,而不是全部实施例。对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,可以根据这些附图获得其他的附图。The accompanying drawings, which are incorporated in FIG In the drawings, like reference numerals are used to refer to the like. The drawings in the following description are some embodiments of the invention, rather than all embodiments. Other figures may be obtained from those of ordinary skill in the art in light of the inventive work.
图1示例性地示出了本发明的实施例一提供的一种IP地址分析方法流程;FIG. 1 exemplarily shows a flow of an IP address analysis method according to Embodiment 1 of the present invention;
图2示例性地示出了本发明的实施例提供的技术方案的应用原理;FIG. 2 exemplarily shows an application principle of the technical solution provided by the embodiment of the present invention;
图3示例性地示出了本发明的实施例二提供的一种IP地址分析系统的架构。FIG. 3 exemplarily shows an architecture of an IP address analysis system provided by Embodiment 2 of the present invention.
具体实施方式detailed description
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。The technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.
现有的拦截方案通过单一固定规则拦截,对访问请求分析的粒度、维度不够,对访问者的身份缺乏认知,容易造成误拦。The existing interception scheme is intercepted by a single fixed rule, and the granularity and dimension of the access request analysis are insufficient, and the identity of the visitor is not recognized, which is likely to cause a mistake.
为了解决上述问题,本发明的实施例提供了一种IP地址分析方法,下面结合附图,对本发明的实施例一进行说明。In order to solve the above problems, an embodiment of the present invention provides an IP address analysis method, and Embodiment 1 of the present invention will be described below with reference to the accompanying drawings.
本发明实施例提供了一种IP地址分析方法,基于IP地址过往访问历史数据、相关附加信息等数据,对IP地址进行详细的分析,获得IP地址的信用数据,依据该IP地址的信用数据对IP地址进行评价,实现了对IP地址细化准确的分析判断,以信用数据评价IP地址安全程度,有效的防止了误判误拦的发生,具体流程如图1所示,包括:The embodiment of the present invention provides an IP address analysis method, which performs detailed analysis on an IP address based on historical data of an IP address and related additional information, and obtains credit data of an IP address according to the credit data of the IP address. The IP address is evaluated, and the accurate analysis and judgment of the IP address refinement is realized. The security level of the IP address is evaluated by the credit data, and the occurrence of misjudgment and error prevention is effectively prevented. The specific process is as shown in FIG. 1 and includes:
步骤101、收集IP地址的历史数据;Step 101: Collect historical data of an IP address.
本步骤中,按照第一周期收集原始日志,如CDN服务器日志,并解析,将所述CDN服务器日志中的信息格式化,得到预定义指标。所述CDN服务器日志具体为CDN nginx日志,也可以通过其他网络访问日志来计算。In this step, the original log, such as the CDN server log, is collected and parsed according to the first cycle, and the information in the CDN server log is formatted to obtain a predefined indicator. The CDN server log is specifically a CDN nginx log, and can also be calculated by using other network access logs.
所述预定义指标至少包含以下信息中的任一项或任意多项:The predefined indicator includes at least one or more of the following information:
时间,IP,工作时段请求数,休息时段请求数,睡眠时段请求数,工作时段请求文件大小,休息时段请求文件大小,睡眠时段请求文件大小,工作时段UserAgent数,休息时段UserAgent数,睡眠时段UserAgent数,移动端UserAgent数,PC端UserAgent数,访问来源数量,访问域名数量,出现小时数,其中,所述第一周期包含工作时段、休息时段及睡眠时段,所述“时间”是指相应的CDN服务器日志生成的时间(即用户访问时间),所述“IP”是指涉及的IP地址,所述“工作时段请求数”是指在所述工作时段内所述IP地址发出的请求数,所述“休息时段请求数”是指在所述休息时段内所述IP地址发出的请求数,所述“睡眠时段请求数”是指在所述睡眠时段内所述IP地址发出的请求数,所述“工作时段请求文件大小”是指在所述工作时段内所述IP地址请求的文件大小总和,所述“休息时段请求文件大小”是指在所述休息时段内所述IP地址请求的文件大小总和,所述“睡眠时段请求文件大小”是指在所述睡眠时段内所述IP地址请求的文件大小总和,所述“工作时段UserAgent数”是指在所述工作时段内所述IP地址下出现的UserAgent数,所述“休息时段UserAgent数”是指在所述休息时段内所述IP地址下出现的UserAgent数,所述“睡眠时段UserAgent数”是指在所述睡眠时段内所述IP地址下出现的UserAgent数,所述“移动端UserAgent数”是指所述IP地址通过移动端访问的UserAgent数,所述“PC端UserAgent数”是指所述IP地址通过PC端访问的UserAgent数,所述“访问来源数量”是指在所述第一周期内所述IP地址访问来源的数量,所述“访问域名数量”是指在所述第一周期内所述IP地址访问的域名的数量,所述“出现小时数”是指在所述第一周期内出现所述IP地址访问的小时数(即出现IP地址访问的小时计1个出现小时数,例如2点和4点都有IP访问就将出现小时数值置为2)。Time, IP, working time request number, rest period request number, sleep period request number, working period request file size, rest period request file size, sleep period request file size, working period UserAgent number, rest period UserAgent number, sleep period UserAgent Number, number of mobile userUsers, number of user agents on the PC side, number of access sources, number of access domain names, number of hours of occurrence, wherein the first period includes a working period, a rest period, and a sleep period, and the "time" refers to a corresponding The time when the CDN server logs are generated (that is, the user access time), the "IP" refers to the IP address involved, and the "working period request number" refers to the number of requests sent by the IP address during the working period. The "rest period request number" refers to the number of requests issued by the IP address during the rest period, and the "sleep period request number" refers to the number of requests issued by the IP address during the sleep period, The “working period request file size” refers to a sum of file sizes requested by the IP address during the working period, the “ The "intermediate request file size" refers to the sum of the file sizes requested by the IP address during the rest period, and the "sleep period request file size" refers to the sum of the file sizes requested by the IP address during the sleep period. The “number of UserAgents in the working period” refers to the number of UserAgents that appear under the IP address in the working period, and the “number of User Agents in the rest period” refers to the number of users that appear under the IP address during the rest period. The number of UserAgents, the number of UserAgents in the sleep period, the number of UserAgents that appear under the IP address in the sleep period, and the number of UserAgents in the mobile terminal is the number of UserAgents accessed by the mobile terminal. The number of the user-accessible IP addresses refers to the number of UserAgents accessed by the IP address, and the number of access sources refers to the number of access sources of the IP addresses in the first period. The "number of access domain names" refers to the number of domain names accessed by the IP address in the first period, and the "hours of occurrence" refers to the occurrence of the IP address in the first period. Number of hours (h appears IP access count appears 1 hours, for example 2 and 4-point access will appear both IP value is set to 2 hours).
存储各个IP地址对应的所述预定义指标,每个IP地址对应一个或多个预定义指标。具体的,可将所述预定义指标存入hive表。The predefined indicators corresponding to the respective IP addresses are stored, and each IP address corresponds to one or more predefined indicators. Specifically, the predefined indicator may be stored in a hive table.
本步骤中涉及的第一周期优选为1个自然日(24小时)。The first period involved in this step is preferably 1 natural day (24 hours).
优选的,IP地址的历史数据还可以从第三方平台获取,如在第二周期内一次或多次从第三方平台获取第三方IP库和/或第三方IP黑名单,第三方IP库中往往包含IP地址的附加信息(如IP地址或IP地址段的分布,即该IP段对应的国家、省、市、运营商,也可能标明某公司名称,也可能标明为某数据中心)。第三方平台数据不定期更新,因此可在第三方平台的数据发生更新后获取,也可在第二周期准备进行计算前获取。Preferably, the historical data of the IP address can also be obtained from a third-party platform, such as obtaining a third-party IP library and/or a third-party IP blacklist from a third-party platform one or more times in the second period, and the third-party IP library often Additional information including the IP address (such as the IP address or the distribution of the IP address segment, that is, the country, province, city, and operator corresponding to the IP segment, may also indicate the name of a company, or may be marked as a data center). Third-party platform data is not updated regularly, so it can be obtained after the data of the third-party platform is updated, or it can be obtained before the second cycle is ready for calculation.
步骤102、对IP地址的历史数据进行分析,生成IP地址的信用数据;Step 102: Analyze historical data of the IP address, and generate credit data of the IP address;
本步骤中,对第二周期内与工作日对应的第一周期的预定义指标进行预处理并归一化后得到各个工作日中间值,对所述第二周期内与休息日对应的第一周期的预定义指标进行预处理并归一化后得到各个休息日中间值,所述第二周期包含多个与工作日对应的第一周期和多个与休息日对应的第一周期;对所述各个工作日中间值分别进行加权平均(包括算数加权平均或几何加权平均)处理得到工作日加权均值;对所述各个休息日中间值分别进行加权平均或最大值处理得到休息日加权均值或最大值;依据一个或者多个工作日加权均值,一个或多个休息日加权均值计算得到第二周期内的当前第二周期临时具体指标,所述当前第二周期临时具体指标包括:本周期为办公出口IP概率,本周期为家庭出口IP概率,本周期为真人概率,本周期活跃度分数,本周期人数分组;根据上一第二周期的最终具体指标与第三方IP库和/或第三方IP黑名单,对所述当前第二周期临时具体指标进行调整,得到当前第二周期的最终具体指标,以该当前第二周期的最终具体指标作为所述IP地址的信用数据。In this step, the pre-defined indicator of the first period corresponding to the working day in the second period is pre-processed and normalized to obtain an intermediate value of each working day, and the first corresponding to the rest day in the second period Pre-defined indicators of the period are pre-processed and normalized to obtain intermediate values of the respective rest days, wherein the second period includes a plurality of first periods corresponding to the working days and a plurality of first periods corresponding to the rest days; The median values of each working day are respectively weighted average (including arithmetic weighted average or geometric weighted average) to obtain the working day weighted mean; and the median values of the respective rest days are respectively weighted average or maximum processed to obtain the rest day weighted mean or maximum The current second period temporary specific indicator in the second period is calculated according to one or more working day weighted averages, and one or more rest day weighted average values, and the current second period temporary specific indicators include: the current period is office The probability of exporting IP, this period is the probability of household export IP, this period is the probability of real person, the activity score of this cycle, the number of people in this cycle Adjusting the current second periodic temporary specific indicator according to the final specific indicator of the previous second period and the third-party IP library and/or the third-party IP blacklist, and obtaining the final specific indicator of the current second period, The final specific indicator of the second period serves as credit data for the IP address.
第二周期为第一周期的整数倍;优选的,在第一周期为天时,第二周期为月或周。The second period is an integer multiple of the first period; preferably, when the first period is day, the second period is month or week.
以下对本步骤的具体算法举例进行说明。其中涉及的第一周期为1日,第二周期为1个月;本发明实施例中涉及的归一化算法使用变形的sigmoid函数,1.0/(1.0+math.exp(-分子/分母+4.0)),因为输入值均大于等于0,所以加上4.0。The following is an example of a specific algorithm for this step. The first period involved is 1 day, and the second period is 1 month; the normalization algorithm involved in the embodiment of the present invention uses a deformed sigmoid function, 1.0/(1.0+math.exp(-molecule/denominator+4.0 )), because the input values are greater than or equal to 0, so add 4.0.
1、工作日数据汇总:1. Summary of working day data:
a)求工作日每天数据中间值a) Find the median daily data of the working day
本步骤中,以IP为维度,计算工作日对应的第一周期的移动端UserAgent 数、PC端UserAgent数、请求数中间值。In this step, the IP number is used as the dimension, and the number of mobile UserAgents, the number of PC-side UserAgents, and the number of requests is calculated in the first cycle corresponding to the working day.
请求数:工作时段请求数加上休息时段请求数加上睡眠时段请求数。Number of requests: The number of work time requests plus the number of break time requests plus the number of sleep time requests.
(1)工作日IP地址为家庭出口IP概率:(1) The working day IP address is the probability of household export IP:
小时出现数分数:归一化算法分子为6,分母为出现小时数;The number of hours appears: the normalization algorithm has a numerator of 6, and the denominator is the number of hours of occurrence;
移动端UserAgent分数:归一化算法,分子为一段时间内各IP下移动端UserAgent数的均值(如10,不定期更新),分母为移动UserAgent数;Mobile UserAgent score: normalization algorithm, the numerator is the average of the number of mobile UserAgents under each IP for a period of time (such as 10, irregular update), and the denominator is the number of mobile UserAgents;
PC端UserAgent数分数:归一化算法,分子为一段时间内各IP下PC端UserAgent数的均值(如5,不定期更新),分母为PC端UserAgent数;PC-side UserAgent score: normalization algorithm, the numerator is the average of the number of PC-side UserAgents under each IP for a period of time (such as 5, irregular update), and the denominator is the number of PC-side UserAgents;
休息时段VS工作时段请求数分数:归一化算法,分子为休息时段请求数除以4,分母为工作时段请求数除以12;Break time VS work time request number score: normalization algorithm, the number of breaks is divided by 4, the denominator is the number of work time requests divided by 12;
休息时段VS睡眠时段请求数分数:归一化算法,分子为休息时段请求数除以4,分母为睡眠时段请求数除以8;Rest time VS sleep time request number score: normalization algorithm, the number of breaks is divided by 4, the denominator is the number of sleep time requests divided by 8;
访问域名数分数:归一化算法,分子为一段时间内各IP日均访问域名数量(不定期更新),分母为域名数;Access domain name score: normalization algorithm, the numerator is the number of domain names visited by IP on a daily basis (not updated regularly), and the denominator is the number of domain names;
以上小时出现数分数、移动端UserAgent分数、PC端UserAgent数分数、休息时段VS工作时段请求数分数、休息时段VS睡眠时段请求数分数、访问域名数分数的加权均值为工作日为家庭出口IP概率中间值。The number of scores in the above hours, the mobile userUser score, the PC userAgent score, the rest period VS working period request score, the rest period VS sleep period request score, and the number of access domain number scores are the working days for the household export IP probability. Median.
(2)工作日为办公出口IP概率中间值:(2) The working day is the intermediate value of the office exit IP probability:
工作时段请求数分数:归一化算法,分子为工作时段请求数除以12,分母为一段时间内各IP工作时段请求数小时平均值(不定期更新);Work time request number score: normalization algorithm, the number of work time requests divided by 12, the denominator is the average number of hours of each IP working time period within a period of time (not regularly updated);
工作时段VS休息时段请求数分数:预处理及归一化算法,分子为工作时段请求数除以12,分母为休息时段请求数除以4;Work time VS break time request number score: pre-processing and normalization algorithm, the number of work is the number of work time periods divided by 12, the denominator is the number of break time requests divided by 4;
休息时段VS睡眠时段请求数分数:预处理及归一化算法,分子为休息时段请求数除以4,分母为睡眠时段请求数除以8;Rest time VS sleep period request score: preprocessing and normalization algorithm, the number of breaks is divided by 4, the denominator is the number of sleep time requests divided by 8;
PC端UserAgent数分数:预处理及归一化算法,分子为一段时间内各IP下PC端UserAgent数的均值(如10,不定期更新),分母为PC端UserAgent数;PC-side UserAgent score: pre-processing and normalization algorithm, the numerator is the average of the number of PC-side UserAgents under each IP for a period of time (such as 10, irregular update), and the denominator is the number of PC-side UserAgents;
工作时段VS休息时段UserAgent数分数:预处理及归一化算法,分子为工作时段UserAgent数,分母为休息时段UserAgent数;Working period VS break period UserAgent number score: preprocessing and normalization algorithm, the numerator is the number of UserAgents during the working period, and the denominator is the number of UserAgents during the break period;
以上工作时段请求数分数、工作时段VS休息时段请求数分数、休息时段VS睡眠时段请求数分数、PC端UserAgent数分数、工作时段VS休息时段UserAgent数分数,对以上分数值加权平均,得到工作日为办公出口IP概率中间值;The above working hours request number score, working period VS rest period request number score, rest period VS sleep period request number score, PC side UserAgent number score, working period VS rest period UserAgent number score, weighted average of the above score values, and get the working day The intermediate value of the IP probability of the office exit;
(3)工作日为真人概率中间值:(3) The working day is the median probability of real person:
请求数分布分数:预处理及归一化算法,分子为工作时段请求数除以12与休息时段请求数除以4与睡眠时段除以8的标准差,分母为1;Request number distribution score: preprocessing and normalization algorithm, the numerator is the number of work period requests divided by 12 and the number of rest periods is divided by 4 and the sleep period is divided by 8 standard deviation, the denominator is 1;
UserAgent数分布分数:归一化算法,分子为工作时段UserAgent数,休息时段UserAgent数,睡眠时段UserAgent数的标准差,分母为1;UserAgent number distribution score: normalization algorithm, the number of users is the number of UserAgents during the working period, the number of UserAgents during the break period, the standard deviation of the number of UserAgents during the sleep period, and the denominator is 1;
小时出现数分数:归一化算法,分子为6,分母为小时出现数;The number of hours appears: the normalization algorithm, the numerator is 6, and the denominator is the number of hours;
域名数VS来源数分数:归一化算法,分子为来源数,分母为域名数;Domain name VS source number score: normalization algorithm, the numerator is the source number, and the denominator is the number of domain names;
移动端VSPC端UserAgent数分数:归一化算法,分子为移动端UserAgent数,分母为PC端UserAgent数。Mobile Agent VSPC UserAgent Number Score: Normalization algorithm, the numerator is the number of mobile UserAgent, and the denominator is the number of PC UserAgent.
以上分数加权均值为工作日为真人概率中间值。The above score weighted mean value is the median probability of the real person on the working day.
(4)工作日活跃度中间值:(4) Intermediate value of working day activity:
访问域名数分数:归一化算法,分子为访问域名数,分母为10;Number of access domain names: normalization algorithm, the number of accesses is the number of domain names, and the denominator is 10;
工作时段请求数分数:归一化算法,分子为工作时段请求数除以12,分母为一段时间内各IP工作时段请求数小时平均值(不定期更新);Work time request number score: normalization algorithm, the number of work time requests divided by 12, the denominator is the average number of hours of each IP working time period within a period of time (not regularly updated);
休息时段请求数分数:归一化算法,分子为休息时段请求数除以4,分母为一段时间内各IP休息时段请求数小时平均值(不定期更新);Break time request score: normalization algorithm, the number of breaks is divided by 4, and the denominator is the average number of hours of each IP break in a period of time (not regularly updated);
睡眠时段请求数分数:归一化算法,分子为睡眠时段请求数除以8,分母为一段时间内各IP睡眠时段请求数小时平均值(不定期更新);Sleep time request score: normalization algorithm, the number of sleep time requests divided by 8, the denominator is the average number of hours of IP sleep time requests within a period of time (not regularly updated);
小时出现数分数:归一化算法,分子为小时出现数,分母为6;The number of hours appears: normalization algorithm, the number of molecules is the number of hours, and the denominator is 6;
请求来源数分数:归一化算法,分子为请求来源数,分母为平均每个IP每日请求来源数平均值(不定期更新)。Request source score: Normalization algorithm, the numerator is the number of request sources, and the denominator is the average of the average number of daily requests per IP (not updated regularly).
以上所有分数加权均值得到工作日活跃度中间值。All of the above score weighted mean values get the median activity day value.
b)求工作日每天数据中间值的加权均值:b) Find the weighted mean of the median daily data on weekdays:
以IP为维度,计算工作日对应的第一周期移动端UserAgent数加权均值,PC端UserAgent数加权均值,请求数加权均值,工作日为家庭出口IP概率加权 均值,工作日为办公出口IP概率加权均值,工作日为真人概率加权均值,工作日活跃度加权均值。Taking IP as the dimension, calculate the weighted average of the number of mobile userUsers in the first cycle corresponding to the working day, the weighted mean of the number of UserAgents on the PC, the weighted average of the number of requests, the working day is the weighted average of the probability of the home exit IP, and the working day is the weight of the office exit IP probability. Mean, the working day is the real-life probability weighted mean, and the working day activity is weighted by the mean.
2、休息日数据汇总:2. Summary of rest day data:
a)求休息日每天数据中间值a) Find the median value of the daily data on the rest day
以IP为维度,计算休息日对应的第一周期的移动端UserAgent数、PC端UserAgent数、请求数的中间值。The IP is used as a dimension to calculate the number of mobile UserAgents, the number of PC-side UserAgents, and the number of requests in the first cycle corresponding to the rest day.
请求数:工作时段请求数加上休息时段请求数加上睡眠时段请求数。Number of requests: The number of work time requests plus the number of break time requests plus the number of sleep time requests.
(1)休息日为家庭出口IP概率中间值:类似工作日算法;(1) The rest day is the intermediate value of the household export IP probability: similar to the working day algorithm;
(2)休息日为真人概率中间值:类似工作日算法;(2) The rest day is the median probability of the real person: similar to the working day algorithm;
(3)休息日活跃度中间值:类似工作日算法。(3) Intermediate value of rest day activity: similar to the working day algorithm.
b)求休息日每天数据中间值的加权均值或最大值:b) Find the weighted mean or maximum of the median daily data for the rest day:
以IP为维度,计算休息日对应的第一周期的移动端UserAgent数最大值,PC端UserAgent数最大值,请求数加权均值,休息日为家庭出口IP概率加权均值,休息日为真人概率加权均值,休息日活跃度加权均值。Taking IP as the dimension, calculate the maximum number of mobile UserAgents in the first cycle corresponding to the rest day, the maximum number of UserAgents on the PC side, and the weighted mean of the number of requests. The rest day is the weighted average of the probability of household export IP, and the rest day is the weighted average of the probability of the real person. , rest day activity weighted mean.
3、工作日和休息日数据汇总,得到当前第二周期临时具体指标:3. The data of working days and rest days are summarized, and the current specific indicators of the second cycle are obtained:
将工作日和休息日按照IP连接一起进行计算,得到:Calculate the workdays and rest days according to the IP connection and get:
IPInt:将该IP转换为对应的长整型。IPInt: Convert this IP to the corresponding long integer.
本周期人数分组:求工作日PC端UserAgent数,计算与工作日对应的第一周期PC端UserAgent数加权平均值。工作日移动端UserAgent数,计算与工作日对应的第一周期移动端UserAgent数加权平均值。休息日PC端UserAgent数,计算与休息日对应的第一周期PC端UserAgent数加权平均值。休息日移动端UserAgent数,计算与休息日对应的第一周期移动端UserAgent数加权平均值。求工作日PC端UserAgent数、工作日移动端UserAgent数、休息日PC端UserAgent数、休息日移动端UserAgent数的最大值后,按照以下进行分组:1:0-1,2:2-5,3:6-10,4:11-30,5:31-50,6:51-100,7:101-500,8:501-2000,9:>2000。The number of people in this week group: Find the number of UserAgents on the PC side of the working day, and calculate the weighted average of the number of PC-side UserAgents in the first cycle corresponding to the working day. The number of mobile Agents on the working day is calculated, and the weighted average of the number of mobile userUsers in the first cycle corresponding to the working day is calculated. On the rest day, the number of UserAgents on the PC side calculates the weighted average of the number of PC-side UserAgents in the first cycle corresponding to the rest day. The number of mobile Agents on the rest day is calculated, and the weighted average of the number of mobile users in the first cycle corresponding to the rest day is calculated. After finding the maximum number of PC-side UserAgents on the working day, the number of working-time mobile-side UserAgents, the number of PC-side UserAgents on the rest day, and the number of UserAgents on the rest day, the group is grouped as follows: 1:0-1, 2:2-5, 3:6-10, 4:11-30, 5:31-50, 6:51-100, 7:101-500, 8:501-2000, 9:>2000.
本周期为办公出口IP概率:This week is the probability of office export IP:
工作日VS休息日PC端UserAgent数分数:归一化算法,分子为工作日PC端UserAgent数量,分母为休息日PC端UserAgent数量;Workday VS rest day PC-side UserAgent number score: normalization algorithm, the numerator is the number of PC-side UserAgents on the working day, and the denominator is the number of PC-side UserAgents on the rest day;
工作日VS休息日移动端UserAgent数分数:归一化算法,分子为工作日移动端UserAgent数量,分母为休息日移动端UserAgent数量;Workday VS rest day Mobile UserAgent score: normalization algorithm, the number of numerators is the number of UserAgents on the working day, and the denominator is the number of UserAgents on the rest day.
工作日VS休息日请求数分数:归一化算法,分子为工作日请求数,分母为休息日请求数;Workday VS rest day request score: normalization algorithm, the numerator is the number of workday requests, and the denominator is the rest day request number;
以上三个分数加权均值,与工作日为办公出口IP概率中间值的加权均值为本周期办公出口IP概率。The weighted average of the above three score weighted mean values and the median value of the office exit IP probability on the working day is the IP probability of the office exit IP in this cycle.
本周期为家庭出口IP概率:工作日为家庭出口IP概率中间值与休息日为家庭出口IP概率中间值的加权均值。This week is the probability of household export IP: the working day is the weighted average of the intermediate value of the household export IP probability and the rest day is the intermediate value of the household export IP probability.
本周期为真人概率:工作日为真人概率中间值与休息日为真人概率中间值的加权均值。This week is the probability of a real person: the working day is the weighted mean of the median probability of the real person and the rest day is the median probability of the real person.
本周期活跃度分数:工作日活跃度中间值与休息日活跃度中间值的加权均值。This week's activity score: the weighted average of the median activity day value and the median value of the rest day activity.
本周期人数分组:以工作日移动端UserAgent数量与休息日移动端UserAgent数量与工作日PC端UserAgent数量与休息日PC端UserAgent数量的最大值进行分组。This week's number of people grouping: grouping the number of mobile Agents on the working day and the number of mobile Agents on the rest day and the maximum number of PCs on the workday and the number of UserAgents on the rest of the PC.
将当前第二周期临时具体指标存入MySQL临时数据表,进入调整阶段。The temporary specific indicators of the current second period are stored in the MySQL temporary data table and enter the adjustment phase.
调整阶段Adjustment phase
以存入MySQL的当前第二周期临时具体指标,以及上一个第二周期存入MySQL的上一第二周期的最终具体指标作为本阶段的输入。The current specific indicator of the current second cycle stored in MySQL, and the final specific indicator of the last second cycle stored in MySQL in the previous second cycle are used as input for this phase.
遍历MySQL临时数据表中的所有IP,每个IP均对应有当前第二周期临时具体指标。Traverse all the IPs in the MySQL temporary data table, each IP corresponding to the current second period temporary specific indicators.
1、过滤掉IP不合语法的数据,过滤掉局域网IP。1. Filter out IP grammatical data and filter out LAN IP.
2、获取第三方IP库信息,判断第三方IP库包含的附加信息判断字符串中是否包含以下敏感字符串,并返回相应调整指数:“公司”,“数据中心”,“GSM/TD-SCDMA/LTE”。调整指数包括对“真人概率”,“为办公出口IP概率”,“为家庭出口IP概率”三个概率的调整指数,如不包含敏感字符串则三种概率的调整指数都为1。用三个概率的调整指数分别乘三个概率,并约定概率需在[0.05,0.95]范围内,如小于0.05则返回0.05,如大于0.95则返回0.95。2. Obtain third-party IP library information, determine whether the additional information contained in the third-party IP library determines whether the string contains the following sensitive string, and returns the corresponding adjustment index: “Company”, “Data Center”, “GSM/TD-SCDMA” /LTE". The adjustment index includes three adjustment indexes of “probability of real person”, “probability of IP for office export”, and “probability of IP for household export”. If no sensitive string is included, the adjustment index of all three probabilities is 1. The three probabilities are multiplied by three probabilities, and the probabilities need to be in the range of [0.05, 0.95]. If they are less than 0.05, they return 0.05. If they are greater than 0.95, they return 0.95.
3、获取上一个第二周期存入MySQL的最终具体指标。3. Obtain the final specific indicator of the last second period deposited in MySQL.
对于某IP地址,根据该IP地址是否有上一第二周期的最终具体指标,对该IP地址生成当前第二周期的最终具体指标的方式也不同,具体如下:For an IP address, according to whether the IP address has the final specific indicator of the previous second period, the manner in which the final specific indicator of the current second period is generated for the IP address is also different, as follows:
a)如有此IP的上一第二周期的最终具体指标,则对以下指标进行对应操作:a) If there is a final specific indicator of the last second period of this IP, the corresponding indicators are operated accordingly:
更新ID:即第几次更新第二周期数据。Update ID: The data of the second cycle is updated several times.
IPInt:将该IP转换为对应的长整型。IPInt: Convert this IP to the corresponding long integer.
该IP更新次数:上一第二周期该IP更新次数加1。The number of IP updates: The number of IP updates is increased by one in the previous second period.
最终人数分组:如果当前第二周期临时具体指标的人数分组和上一第二周期最终具体指标的最终人数分组为相邻的分组,则最终为人数大的分组;否则最终为临时数据的分组。Final number grouping: If the current number of people in the second period of the temporary specific indicator group and the final number of the final specific indicator in the previous second period are grouped into adjacent groups, the group is eventually large; otherwise it is the grouping of temporary data.
最终为办公出口IP概率之和:当前第二周期临时具体指标中的为办公出口IP概率加上一第二周期的最终具体指标中的最终为办公出口IP概率之和。Finally, the sum of the IP probabilities of the office exits: the sum of the office exit IP probabilities in the current second-period temporary specific indicators plus the final concrete IP probabilities in the final specific indicators of the second period.
最终为家庭出口IP概率之和:当前第二周期临时具体指标中的为家庭出口IP概率加上一第二周期的最终具体指标中的最终为家庭出口IP概率之和。Finally, the sum of the probability of household export IP: the current second-period temporary specific indicator is the sum of the household export IP probability plus the final specific indicator of the second cycle, which is the sum of the household export IP probabilities.
最终为真人概率之和:当前第二周期临时具体指标中的为真人IP概率加上一第二周期的最终具体指标中的最终为真人概率之和。Finally, the sum of the probabilities of the real people: the probability of the real IP in the temporary specific indicator of the second period plus the final probability of the real person in the final specific indicator of the second period.
最终活跃度分数之和:当前第二周期临时具体指标中的活跃度分数加上一第二周期的最终活跃度分数之和后,除以更新ID记录的最终更新次数,乘以该IP更新次数。The sum of the final activity scores: the sum of the activity scores in the temporary specific indicators of the current second period plus the final activity scores of the second period, divided by the number of final updates recorded by the update ID, multiplied by the number of IP updates .
如无此IP,则对以下指标进行对应操作:If there is no such IP, the corresponding operations are performed on the following indicators:
IPInt:将该IP转换为对应的长整型。IPInt: Convert this IP to the corresponding long integer.
更新ID:即第几次更新第二周期数据。Update ID: The data of the second cycle is updated several times.
该IP更新次数:为1。The number of IP updates: 1.
最终人数分组:当前第二周期临时具体指标的人数分组。Final grouping: The current number of people in the second cycle of temporary specific indicators.
最终为办公出口IP概率之和:当前第二周期临时具体指标的为办公出口IP概率。Finally, the sum of the IP probabilities of the office exits: the temporary specific indicators of the current second period are the IP addresses of the office exits.
最终为家庭出口IP概率之和:当前第二周期临时具体指标的为家庭出口IP概率。Finally, the sum of the probability of household export IP: the current specific indicator of the second cycle is the probability of household export IP.
最终为真人概率之和:当前第二周期临时具体指标的真人概率。The final is the sum of the probability of the real person: the real probability of the temporary specific indicator of the current second cycle.
最终活跃度分数之和:当前第二周期临时具体指标的活跃度分数除以更新ID,乘以该IP更新次数。The sum of the final activity scores: the activity score of the current second period temporary specific indicator divided by the update ID, multiplied by the number of IP updates.
此外,还可根据第三方IP黑名单生成IP信用污点数据,将所述IP信用污点数据加入所述当前第二周期的最终具体指标。In addition, the IP credit smear data may be generated according to the third party IP blacklist, and the IP credit smear data is added to the final specific indicator of the current second period.
具体的,黑名单中信息一般包括列入黑名单的IP地址或者IP地址段。信用污点数据优选可用信用污点分数的形式来表现,例如存在于越多的第三方黑名单中,其信用污点分数越高,不存在于黑名单中则信用污点分数为0。Specifically, the information in the blacklist generally includes a blacklisted IP address or an IP address segment. The credit smear data may preferably be expressed in the form of a credit smear score, for example, the more third party blacklists are present, the higher the credit smear score, and the credit smudge score is zero if it does not exist in the blacklist.
使用当前第二周期的最终具体指标覆盖所述前一第二周期的最终具体指标,记录更新第二周期的最终具体指标的次数和对相应IP地址更新的次数。将以上所有数据更新至MySQL。The final specific indicator of the current second period is used to cover the final specific indicator of the previous second period, and the number of times to update the final specific indicator of the second period and the number of updates to the corresponding IP address are recorded. Update all of the above data to MySQL.
当前第二周期临时具体指标和最终具体指标存至MySQL时可以IP作为索引;也可以IP转为对应长整数后以对应的长整数为索引,按照IPInt均分为256份后分表存储,以方便查询及提高查询速度。When the temporary specific indicator and the final specific indicator of the second period are stored in MySQL, the IP can be used as an index; or the IP can be converted to a corresponding long integer and then indexed by the corresponding long integer, and the IPInt is divided into 256 parts and then stored in a sub-table. Easy to query and improve the speed of the query.
在获取依据当前第二周期更新的最终具体指标后,即以IP地址对应的最终具体指标作为该IP地址的信用数据,依据该信用数据对IP地址进行评价。可向第三方提供接口,允许通过所述接口访问所述IP地址的信用数据;也可接收第三方发出针对IP地址的IP验证请求,查找所述IP地址对应的信用数据,根据所述信用数据对所述IP地址进行信用等级评价,向所述第三方返回评价结果。After obtaining the final specific indicator updated according to the current second period, the final specific indicator corresponding to the IP address is used as the credit data of the IP address, and the IP address is evaluated according to the credit data. Providing an interface to the third party, allowing access to the credit data of the IP address through the interface; or receiving an IP verification request for the IP address by the third party, searching for credit data corresponding to the IP address, according to the credit data The credit rating is evaluated on the IP address, and the evaluation result is returned to the third party.
可应用于防火墙对IP拦截的操作中,防火墙根据IP地址的信用数据判定该IP地址的合法性,也可独立成一个IP信用等级平台,向防火墙提供IP验证结果。也可在不影响现有防火墙功能的前提下,提供一个二次验证的机制,即在防火墙判定IP地址可疑时,再由IP信用等级平台依据信用数据进行二次验证,进一步提高防火墙拦截的准确性,防止误拦。It can be applied to the operation of the IP interception by the firewall. The firewall determines the legality of the IP address based on the credit data of the IP address, and can also independently form an IP credit rating platform to provide IP verification results to the firewall. It can also provide a secondary verification mechanism without affecting the existing firewall function. That is, when the firewall determines that the IP address is suspicious, the IP credit rating platform performs secondary verification based on the credit data to further improve the accuracy of the firewall interception. Sex, prevent mistakes.
本发明实施例提供的IP地址分析方法,能够与现有的互联网架构相结合,如图2所示,收集用户访问日志作为原始日志,结合第三方的IP黑名单与IP地址库,使用本发明实施例提供的IP地址分析方法,得到主要由最终具体指标构成的IP用户属性数据、主要根据第三方IP黑名单生成的IP污点数据和根据第三方IP地址库得到的IP地址库数据,并将IP用户属性数据、IP污点数据与IP 地址库数据整合至IP信用等级平台,对IP地址进行信用评级,得到IP地址的信用数据。IP地址的信用数据全面的描述了IP地址的特征,可用于信息安全领域对IP地址安全性的确认,或基于IP的用户画像领域,实现了基于大数据分析的IP地址精确描绘。应用结果亦可反馈给IP信用等级平台,对已有结果进行算法迭代和参数调整,赋予系统自学习自调整的能力,进一步提高IP信用等级平台对IP分析的精度。The IP address analysis method provided by the embodiment of the present invention can be combined with the existing Internet architecture. As shown in FIG. 2, the user access log is collected as the original log, combined with the third party IP blacklist and IP address library, and the present invention is used. The IP address analysis method provided by the embodiment obtains IP user attribute data mainly composed of final specific indicators, IP stain data mainly generated according to a third-party IP blacklist, and IP address pool data obtained according to a third-party IP address library, and IP user attribute data, IP smear data and IP address pool data are integrated into the IP credit rating platform, and the IP address is credit-rated to obtain credit data of the IP address. The credit data of the IP address comprehensively describes the characteristics of the IP address, and can be used for confirming the security of the IP address in the information security field, or realizing the IP address based on the big data analysis. The application results can also be fed back to the IP credit rating platform to perform algorithm iteration and parameter adjustment on the existing results, giving the system the ability to self-learn and self-adjust, and further improve the accuracy of the IP credit rating platform for IP analysis.
下面结合附图,对本发明的实施例二进行说明。Embodiment 2 of the present invention will be described below with reference to the accompanying drawings.
本发明实施例提供了一种IP地址分析系统,其结构如图3所示,包括:An embodiment of the present invention provides an IP address analysis system, and the structure thereof is as shown in FIG. 3, including:
大数据计算平台和离线计算平台。Big data computing platform and offline computing platform.
其中,大数据平台包括:Hadoop计算平台、spark计算平台;Among them, the big data platform includes: Hadoop computing platform, spark computing platform;
离线计算平台包括:服务器或服务器集群。Offline computing platforms include: servers or server clusters.
所述大数据平台,设置为存储原始日志,计算原始日志,收集并存储IP地址的历史数据;The big data platform is configured to store original logs, calculate original logs, and collect and store historical data of IP addresses;
所述离线计算平台,设置为对所述大数据平台收集的IP地址的历史数据进行分析,生成IP地址的信用数据。The offline computing platform is configured to analyze historical data of an IP address collected by the big data platform to generate credit data of an IP address.
所述离线计算平台,还能够通过所述大数据平台与第三方进行通信,向第三方提供所述IP地址的信用数据,或接收第三方查询请求返回依据所述信用数据地IP地址验证的辅助信息。The offline computing platform is further configured to communicate with a third party through the big data platform, provide credit data of the IP address to a third party, or receive a third party query request to return an authentication according to the IP address of the credit data. information.
优选的,该IP地址分析系统还包括存储平台,所述存储平台支持MySQL系统,可设置为存储所述原始日志、IP地址的信用数据、从第三方获取的第三方IP库与第三方IP黑名单、本周期的最终具体指标、上一第二周期的最终具体指标、当前第二周期的临时具体指标及运算过程中产生的中间数据等。Preferably, the IP address analysis system further includes a storage platform, the storage platform supports a MySQL system, and can be configured to store the original log, credit data of an IP address, a third-party IP library obtained from a third party, and a third-party IP black. The list, the final specific indicators of the current cycle, the final specific indicators of the previous second cycle, the temporary specific indicators of the current second cycle, and the intermediate data generated during the operation.
本发明实施例还提供了一种计算机可读存储介质,此存储介质上存储有计算机程序,所述程序被处理器执行时实现上述方法的步骤。The embodiment of the present invention further provides a computer readable storage medium, where the computer program stores a computer program, and when the program is executed by the processor, the steps of the foregoing method are implemented.
本发明实施例还提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行此程序时实现上述方法的步骤。The embodiment of the invention further provides a computer device, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, the processor implementing the program to implement the steps of the above method.
本发明的实施例提供了一种IP地址分析系统,能够与本发明的实施例提供的一种IP地址分析方法相结合,通过收集IP地址的历史数据,对IP地址的历 史数据进行分析,生成IP地址的信用数据,实现了对IP地址细化精确的分析,以大数据确定IP地址属性,对IP地址信用情况有了全面准确的了解,可应用于IP地址合法性验证、IP地址拦截等场景中,解决了IP地址安全性认知错误的问题,有效防止IP地址合法性误判、IP地址误拦的发生。The embodiment of the present invention provides an IP address analysis system, which can be combined with an IP address analysis method provided by an embodiment of the present invention to collect historical data of an IP address and analyze historical data of the IP address. The credit data of the IP address realizes accurate analysis of IP address refinement, determines the IP address attribute with big data, and has a comprehensive and accurate understanding of the IP address credit status, which can be applied to IP address legality verification, IP address interception, etc. In the scenario, the problem of IP address security cognition error is solved, and the IP address legality misjudgment and IP address misinterpretation are effectively prevented.
上面描述的内容可以单独地或者以各种方式组合起来实施,而这些变型方式都在本发明的保护范围之内。The above description may be implemented individually or in combination in various ways, and such modifications are within the scope of the invention.
最后应说明的是:以上实施例仅用以说明本发明实施例的技术方案,而非对其限制。尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention, and are not limited thereto. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the foregoing embodiments, or equivalently replace some of the technical features. Modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the invention.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些组件或所有组件可以被实施为由处理器,如数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and functional blocks/units of the methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical The components work together. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on a computer readable medium, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to those of ordinary skill in the art, the term computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules or other data. Sex, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridge, magnetic tape, magnetic disk storage or other magnetic storage device, or may Any other medium used to store the desired information and that can be accessed by the computer. Moreover, it is well known to those skilled in the art that communication media typically includes computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and can include any information delivery media. .
工业实用性Industrial applicability
在本发明实施例中,通过收集IP地址的历史数据,对IP地址的历史数据进行分析,生成IP地址的信用数据,实现了对IP地址细化精确的分析,以大数据确定IP地址属性,对IP地址信用情况有了全面准确的了解,可应用于IP地址合法性验证、IP地址拦截等场景中,解决了IP地址安全性认知错误的问题,有效防止IP地址合法性误判、IP地址误拦的发生。In the embodiment of the present invention, by collecting the historical data of the IP address, analyzing the historical data of the IP address, generating the credit data of the IP address, realizing the accurate analysis of the IP address refinement, and determining the IP address attribute by the big data. A comprehensive and accurate understanding of the IP address credit situation can be applied to scenarios such as IP address legality verification and IP address interception, which solves the problem of IP address security cognitive error, effectively preventing IP address legality misjudgment, IP The occurrence of an address error.

Claims (12)

  1. 一种IP地址分析方法,包括:An IP address analysis method, including:
    收集IP地址的历史数据;Collect historical data of IP addresses;
    对IP地址的历史数据进行分析,生成IP地址的信用数据。The historical data of the IP address is analyzed to generate credit data of the IP address.
  2. 根据权利要求1所述的IP地址分析方法,其中,所述收集IP地址的历史数据的步骤包括:The IP address analysis method according to claim 1, wherein the step of collecting historical data of the IP address comprises:
    在第一周期内收集并解析原始日志;Collect and parse the original log in the first cycle;
    将所述原始日志中的信息格式化,得到预定义指标,所述预定义指标至少包含以下信息中的任一项或任意多项:Formatting the information in the original log to obtain a predefined indicator, the predefined indicator including at least one or more of the following information:
    时间,IP,工作时段请求数,休息时段请求数,睡眠时段请求数,工作时段请求文件大小,休息时段请求文件大小,睡眠时段请求文件大小,工作时段用户代理UserAgent数,休息时段UserAgent数,睡眠时段UserAgent数,移动端UserAgent数,PC端UserAgent数,访问来源数量,访问域名数量,出现小时数;Time, IP, working time request number, rest period request number, sleep period request number, working period request file size, rest period request file size, sleep period request file size, working period user agent UserAgent number, rest period UserAgent number, sleep Number of time UserAgents, number of mobile UserAgents, number of UserAgents on the PC, number of access sources, number of visited domain names, number of hours of occurrence;
    存储各个IP地址对应的所述预定义指标,每个IP地址对应一个或多个预定义指标。The predefined indicators corresponding to the respective IP addresses are stored, and each IP address corresponds to one or more predefined indicators.
  3. 根据权利要求2所述的IP地址分析方法,其中,所述收集IP地址的历史数据的步骤还包括:The IP address analysis method according to claim 2, wherein the step of collecting historical data of the IP address further comprises:
    在第二周期内,一次或多次从第三方平台获取第三方IP库和/或第三方IP黑名单。In the second cycle, third-party IP libraries and/or third-party IP blacklists are obtained from third-party platforms one or more times.
  4. 根据权利要求3所述的IP地址分析方法,其中,对IP地址的历史数据进行分析,生成IP地址的信用数据包括:The IP address analysis method according to claim 3, wherein analyzing the historical data of the IP address and generating the credit data of the IP address comprises:
    对第二周期内与工作日对应的第一周期的预定义指标进行预处理并归一化后得到各个工作日中间值,对所述第二周期内与休息日对应的第一周期的预定义指标进行预处理并归一化后得到各个休息日中间值,所述第二周期包含多个与工作日对应的第一周期和多个与休息日对应的第一周期;Pre-defining the pre-defined indicators of the first period corresponding to the working days in the second period and normalizing to obtain intermediate values of the respective working days, and pre-defining the first period corresponding to the rest days in the second period The indicator is pre-processed and normalized to obtain an intermediate value of each rest day, where the second period includes a plurality of first periods corresponding to the working days and a plurality of first periods corresponding to the rest days;
    对所述各个工作日中间值分别进行加权平均处理得到工作日加权均值;Performing weighted average processing on the intermediate values of the respective working days to obtain a working day weighted average value;
    对所述各个休息日中间值分别进行加权平均或最大值处理得到休息日加权均值或最大值;Performing a weighted average or maximum value for each of the rest day intermediate values to obtain a rest day weighted mean or maximum value;
    依据一个或者多个工作日加权均值,一个或多个休息日加权均值计算得到第二周期内的当前第二周期临时具体指标,所述当前第二周期临时具体指标包括:The current second period temporary specific indicator in the second period is calculated according to one or more working day weighted averages and one or more rest day weighted average values, and the current second period temporary specific indicators include:
    本周期为办公出口IP概率,本周期为家庭出口IP概率,本周期为真人概率,本周期活跃度分数,本周期人数分组;This week is the probability of office export IP. This period is the probability of household export IP. This period is the probability of real person, the activity score of this cycle, and the number of people in this cycle.
    根据上一第二周期的最终具体指标与第三方IP库和/或第三方IP黑名单,对所述当前第二周期临时具体指标进行调整,得到当前第二周期的最终具体指标,以该当前第二周期的最终具体指标作为所述IP地址的信用数据。Adjusting the current second periodic temporary specific indicator according to the final specific indicator of the previous second period and the third-party IP library and/or the third-party IP blacklist, and obtaining the final specific indicator of the current second period, The final specific indicator of the second period serves as credit data for the IP address.
  5. 根据权利要求4所述的IP地址分析方法,其中,所述对第二周期内与工作日对应的第一周期的预定义指标进行预处理及归一化后得到各个工作日中间值,对所述第二周期内与休息日对应的第一周期的预定义指标进行预处理及归一化后得到各个休息日中间值的步骤包括:The IP address analysis method according to claim 4, wherein the pre-defined and normalized first-period indicators corresponding to the working day in the second period are pre-processed and normalized to obtain intermediate values of the respective working days. The steps of pre-processing and normalizing the predefined indicators of the first period corresponding to the rest day in the second period to obtain the intermediate values of each rest day include:
    计算与工作日对应的第一周期小时出现数分数、移动端UserAgent分数、PC端UserAgent数分数、休息时段VS工作时段请求数分数、休息时段VS睡眠时段请求数分数、访问域名数分数,对以上分数取加权均值,得到工作日为家庭出口IP概率中间值;Calculate the first cycle hour number score corresponding to the working day, the mobile userAgent score, the PC userAgent score, the rest period VS work time request number score, the rest period VS sleep time request number score, the access domain name score, and the above The score is taken as a weighted mean, and the working day is the intermediate value of the household exit IP probability;
    计算与工作日对应的第一周期工作时段请求数分数、工作时段VS休息时段请求数分数、休息时段VS睡眠时段请求数分数、PC端UserAgent数分数、工作时段VS休息时段UserAgent数分数,对以上分数值加权平均,得到工作日为办公出口IP概率中间值;Calculating the first cycle working period request number score corresponding to the working day, the working period VS rest period request number score, the rest period VS sleep period request number score, the PC end UserAgent number score, the working period VS rest period UserAgent number score, and the above The weighted average of the points is obtained, and the working day is the intermediate value of the office exit IP probability;
    计算与工作日对应的第一周期请求数分布分数、UserAgent数分布分数、小时出现数分数、域名数VS来源数分数、移动端VS PC端UserAgent数分数,对以上分数取加权均值,得到工作日为真人概率中间值;Calculate the first cycle request number distribution score, the UserAgent number distribution score, the hour appearance number score, the domain name number VS source number score, the mobile terminal VS PC end UserAgent number score corresponding to the working day, and the weighted average value of the above scores is obtained, and the working day is obtained. The median probability of being a real person;
    计算与工作日对应的第一周期访问域名数分数、工作时段请求数分数、休息时段请求数分数、睡眠时段请求数分数、小时出现数分数、请求来源数分数,对以上分数取加权均值,得到工作日活跃度中间值;Calculating the first period access domain number score, the working period request number score, the rest period request number score, the sleep period request number score, the hour appearance number score, the request source number score corresponding to the working day, and weighting the average score of the above scores, Intermediate value of working day activity;
    计算与休息日对应的第一周期小时出现数分数、移动端UserAgent分数、PC端UserAgent数分数、休息时段VS工作时段请求数分数、休息时段VS睡眠时段请求数分数、访问域名数分数,对以上分数取加权均值,得到休息日为家庭 出口IP概率中间值;Calculate the first cycle hour number score corresponding to the rest day, the mobile userAgent score, the PC end UserAgent number score, the rest period VS work period request number score, the rest period VS sleep period request number score, the access domain name score, and the above The score takes the weighted mean value, and the rest day is the intermediate value of the household exit IP probability;
    计算与休息日对应的第一周期休息时段请求数分数、UserAgent数分布分数、小时出现数分数、域名数VS来源数分数、移动端VS PC端UserAgent数分数,对以上分数取加权均值,得到休息日为真人概率中间值;Calculate the first period rest period request number score, the UserAgent number distribution score, the hour appearance number score, the domain name number VS source number score, the mobile end VS PC end UserAgent number score corresponding to the rest day, and take the weighted mean value of the above scores to obtain a rest The daily value of the probability of a real person;
    计算与休息日对应的第一周期访问域名数分数、工作时段请求数分数、休息时段请求数分数、睡眠时段请求数分数、小时出现数分数、请求来源数分数,对以上分数取加权均值,得到休息日活跃度中间值。Calculating the first period access domain number score, the work period request number score, the rest period request number score, the sleep period request number score, the hour appearance number score, the request source number score corresponding to the rest day, and weighting the average score of the above scores, The median value of the rest day activity.
  6. 根据权利要求4所述的IP地址分析方法,其中,所述依据一个或者多个工作日加权均值,一个或多个休息日加权均值计算得到第二周期内的当前第二周期临时具体指标的步骤包括:The IP address analysis method according to claim 4, wherein the step of calculating the current second period temporary specific indicator in the second period according to one or more working day weighted average values and one or more rest day weighted average values include:
    预处理及归一化后得到工作日VS休息日PC端UserAgent数分数、工作日VS休息日移动端UserAgent数分数、工作日VS休息日请求数分数,对以上三个分数取加权均值,与工作日为办公出口IP概率中间值的加权均值为所述本周期为办公出口IP概率;After pre-processing and normalization, the number of UserAgents on the workday VS rest day, the number of workdays on the workday VS rest days, the number of workdays VS rest days, and the number of breaks on the rest days are obtained. The weighted average value of the intermediate value of the office exit IP probability is the probability that the current period is the office exit IP;
    以工作日为家庭出口IP概率中间值与休息日为家庭出口IP概率中间值加权均值作为本周期为家庭出口IP概率;The working day is the intermediate value of the household export IP probability and the rest day is the weighted average of the intermediate value of the household export IP probability as the current period is the probability of the household export IP;
    以工作日为真人概率中间值与休息日为真人概率中间值的加权均值作为本周期为真人概率;The working day is the weighted mean of the median probability of the real person and the rest day as the median probability of the real person as the true probability of the cycle;
    以工作日活跃度中间值与休息日活跃度中间值的加权均值作为本周期活跃度分数。The weighted average of the median activity day value and the median value of the rest day activity is used as the cycle activity score.
    以工作日移动端UserAgent数量与休息日移动端UserAgent数量与工作日PC端UserAgent数量与休息日PC端UserAgent数量的最大值进行分组作为本周期人数分组。The number of mobile Agents on the working day and the number of mobile Agents on the rest day and the maximum number of PCs on the workday and the number of UserAgents on the rest of the PC are grouped as the number of people in this cycle.
  7. 根据权利要求4所述的IP地址分析方法,其中,所述最终具体指标至少包含以下信息的任一项或任意多项,The IP address analysis method according to claim 4, wherein the final specific indicator includes at least one or more of the following information.
    IP,IPInt,更新ID,该IP更新次数,最终人数分组,最终为办公出口IP概率之和,最终为家庭出口IP概率之和,最终为真人概率之和,最终活跃度分数之和,IP, IPInt, update ID, the number of IP updates, the final number of people grouped, and finally the sum of the probability of office export IP, and finally the sum of the probability of household export IP, and finally the sum of the probability of real people, the sum of the final activity scores,
    其中,“IPInt”为IP地址对应的长整型,“更新ID”为更新第二周期的最终具 体指标的次数,“该IP更新次数”为某IP地址的更新第二周期的最终具体指标的次数,The IPInt is a long integer corresponding to the IP address, the Update ID is the number of times to update the final specific indicator of the second period, and the IP update number is the final specific indicator of the second period of the update of the IP address. frequency,
    所述根据上一第二周期的最终具体指标与第三方IP库和/或第三方IP黑名单,对所述当前第二周期临时具体指标进行调整,得到当前第二周期的最终具体指标的步骤包括:The step of adjusting the temporary specific indicator of the current second period according to the final specific indicator of the previous second period and the third-party IP library and/or the third-party IP blacklist to obtain the final specific indicator of the current second period include:
    对于在所述上一第二周期的最终具体指标与所述当前第二周期临时具体指标中均涉及的IP地址,通过如下计算获取当前第二周期的最终具体指标:For the IP address involved in the final specific indicator of the last second period and the temporary specific indicator of the current second period, the final specific indicator of the current second period is obtained by the following calculation:
    在当前第二周期临时具体指标的人数分组和上一第二周期的最终具体指标的最终人数分组为相邻的分组时,选择人数大的分组作为当前第二周期的最终人数分组,否则选择当前第二周期临时具体指标的人数分组,When the current number of people of the temporary specific indicator in the second period and the final number of the final specific indicator of the previous second period are grouped into adjacent groups, the group with the large number of people is selected as the final number of groups in the current second period, otherwise the current selection is selected. The number of people in the second cycle of temporary specific indicators,
    当前第二周期临时具体指标中的为办公出口IP概率加上一第二周期的最终具体指标中的最终为办公出口IP概率之和作为当前第二周期的最终为办公出口IP概率之和,The sum of the probability of the office exit IP in the temporary specific indicator of the second period plus the probability of the final office outlet IP in the final specific indicator of the second period is the sum of the probability of the final office IP of the current second period.
    当前第二周期临时具体指标中的为家庭出口IP概率加上一第二周期的最终具体指标中的最终为家庭出口IP概率之和作为当前第二周期的最终为家庭出口IP概率之和,The sum of the probability of the household export IP plus the probability of the final household IP in the final specific indicator of the second period in the temporary specific indicator of the second period is the sum of the probability of the final household IP of the current second period.
    当前第二周期临时具体指标中的为真人概率加上一第二周期的最终具体指标中的最终为真人概率之和作为当前第二周期的最终为真人概率之和,The sum of the probability of the real person in the temporary specific indicator of the second period plus the final probability of the real person in the final specific indicator of the second period is the sum of the probability of the real person in the current second period,
    当前第二周期临时具体指标中的活跃度分数加上一第二周期的最终活跃度分数之和后,除以更新ID记录的最终更新次数,乘以该IP更新次数,作为当前第二周期的最终活跃度分数之和;After the sum of the activity score in the temporary specific indicator of the second period plus the final activity score of the second period, divided by the number of final updates recorded by the update ID, multiplied by the number of IP updates, as the current second period The sum of the final activity scores;
    对于在所述上一第二周期的最终具体指标中未涉及而在所述当前第二周期临时具体指标中涉及的IP地址,通过如下计算获取当前第二周期的临时具体指标:For the IP address involved in the temporary specific indicator of the current second period, which is not involved in the final specific indicator of the last second period, obtain the temporary specific indicator of the current second period by the following calculation:
    以当前第二周期临时具体指标中的人数分组作为当前第二周期的最终人数分组,Grouping the number of people in the temporary specific indicator of the current second period as the final number of people in the current second period.
    以当前第二周期临时具体指标中的为办公出口IP概率作为当前第二周期的最终为办公出口IP概率之和,Taking the IP of the office exit IP in the temporary specific indicator of the current second period as the sum of the probability of the final office IP of the current second period,
    以当前第二周期临时具体指标中的为家庭出口IP概率作为当前第二周期的 最终为家庭出口IP概率之和,Taking the probability of household export IP in the temporary specific indicator of the current second period as the sum of the probability of household export IP in the current second period,
    以当前第二周期临时具体指标中的为真人概率作为当前第二周期的最终真人概率之和,Taking the probability of being a real person in the temporary specific indicator of the current second period as the sum of the final real probability of the current second period,
    以当前第二周期临时具体指标中的活跃度分数除以更新ID,乘以该IP更新次数,作为当前第二周期的最终活跃度分数之和;Dividing the activity score in the temporary specific indicator of the current second period by the update ID, multiplying the IP update count as the sum of the final activity scores of the current second period;
    使用当前第二周期的最终具体指标覆盖所述上一第二周期的最终具体指标,记录更新第二周期的最终具体指标的次数和对相应IP地址更新的次数。The final specific indicator of the current second period is used to cover the final specific indicator of the previous second period, and the number of times to update the final specific indicator of the second period and the number of updates to the corresponding IP address are recorded.
  8. 根据权利要求7所述的IP地址分析方法,其中,所述根据上一第二周期的最终具体指标与第三方IP库和/或第三方IP黑名单,对所述当前第二周期临时具体指标进行调整,得到当前第二周期的最终具体指标的步骤还包括:The IP address analysis method according to claim 7, wherein the current specific indicator of the current second period is based on the final specific indicator of the last second period and the third-party IP library and/or the third-party IP blacklist. The steps of adjusting to obtain the final specific indicator of the current second cycle include:
    过滤掉所述当前第二周期临时具体指标中对应IP不合语法或对应IP为局域网IP的数据;Filtering data corresponding to the IP non-syntax or the corresponding IP as the local area network IP in the temporary specific indicator of the current second period;
    根据第三方IP库中包含的IP地址附加信息,调整当前第二周期临时具体指标中的为办公出口IP概率、家庭出口IP概率及真人概率;According to the additional information of the IP address included in the third-party IP library, adjust the IP probability of the office exit, the probability of the household exit IP, and the probability of the real person in the temporary specific indicators of the current second period;
    根据所述第三方IP黑名单生成IP信用污点数据,将所述IP信用污点数据加入所述当前第二周期的最终具体指标。Generating IP credit smear data according to the third-party IP blacklist, and adding the IP credit smear data to the final specific indicator of the current second period.
  9. 根据权利要求1所述的IP地址分析方法,其中,该方法还包括:The IP address analysis method according to claim 1, wherein the method further comprises:
    向第三方提供接口,允许通过所述接口访问所述IP地址的信用数据;或,Providing an interface to a third party to allow access to credit data of the IP address through the interface; or
    接收第三方发出针对IP地址的IP验证请求,查找所述IP地址对应的信用数据,根据所述信用数据对所述IP地址进行信用等级评价,向所述第三方返回评价结果。Receiving a third party to issue an IP verification request for the IP address, searching for credit data corresponding to the IP address, performing credit rating evaluation on the IP address according to the credit data, and returning the evaluation result to the third party.
  10. 一种IP地址分析系统,包括大数据平台与离线计算平台;An IP address analysis system, including a big data platform and an offline computing platform;
    所述大数据平台,设置为存储原始日志,计算原始日志,收集并存储IP地址的历史数据;The big data platform is configured to store original logs, calculate original logs, and collect and store historical data of IP addresses;
    所述离线计算平台,设置为对所述大数据平台收集的IP地址的历史数据进行分析,生成IP地址的信用数据。The offline computing platform is configured to analyze historical data of an IP address collected by the big data platform to generate credit data of an IP address.
  11. 一种计算机可读存储介质,所述存储介质上存储有计算机程序,所述程序被处理器执行时实现权利要求1至9中任意一项所述方法的步骤。A computer readable storage medium having stored thereon a computer program, the program being executed by a processor to perform the steps of the method of any one of claims 1 to 9.
  12. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至9中任意一项所述方法的步骤。A computer device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor executing the program to implement any one of claims 1 to The steps of the method.
PCT/CN2018/079732 2017-04-01 2018-03-21 Method for analyzing ip address, system, computer readable storage medium, and computer device WO2018177167A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710216069.1A CN107707516B (en) 2017-04-01 2017-04-01 A kind of IP address analysis method and system
CN201710216069.1 2017-04-01

Publications (1)

Publication Number Publication Date
WO2018177167A1 true WO2018177167A1 (en) 2018-10-04

Family

ID=61169473

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/079732 WO2018177167A1 (en) 2017-04-01 2018-03-21 Method for analyzing ip address, system, computer readable storage medium, and computer device

Country Status (2)

Country Link
CN (1) CN107707516B (en)
WO (1) WO2018177167A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107707516B (en) * 2017-04-01 2018-11-13 贵州白山云科技有限公司 A kind of IP address analysis method and system
CN110401727B (en) * 2018-04-24 2022-04-19 北京数安鑫云信息技术有限公司 IP address analysis method and device
CN108683531B (en) * 2018-05-02 2019-06-21 百度在线网络技术(北京)有限公司 Method and apparatus for handling log information
CN109873811A (en) * 2019-01-16 2019-06-11 光通天下网络科技股份有限公司 Network safety protection method and its network security protection system based on attack IP portrait

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1746916A (en) * 2005-10-25 2006-03-15 二六三网络通信股份有限公司 Network IP address credit assessment and use in electronic mail system
CN104506356A (en) * 2014-12-24 2015-04-08 网易(杭州)网络有限公司 Method and device for determining credibility of IP (Internet protocol) address
US20150215334A1 (en) * 2012-09-28 2015-07-30 Level 3 Communications, Llc Systems and methods for generating network threat intelligence
US9319382B2 (en) * 2014-07-14 2016-04-19 Cautela Labs, Inc. System, apparatus, and method for protecting a network using internet protocol reputation information
CN107707516A (en) * 2017-04-01 2018-02-16 贵州白山云科技有限公司 A kind of IP address analysis method and system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7761558B1 (en) * 2006-06-30 2010-07-20 Google Inc. Determining a number of users behind a set of one or more internet protocol (IP) addresses
CN101014072A (en) * 2007-02-15 2007-08-08 北京互联易通信息技术有限公司 Method and apparatus for obtaining and analyzing data information aimed at data object
CN101719824B (en) * 2009-11-24 2012-07-25 北京信息科技大学 Network behavior detection-based trust evaluation system and network behavior detection-based trust evaluation method
US20130067062A1 (en) * 2011-09-12 2013-03-14 Microsoft Corporation Correlation of Users to IP Address Lease Events
CN103475637B (en) * 2013-04-24 2018-03-27 携程计算机技术(上海)有限公司 The method for network access control and system of behavior are accessed based on IP
CN104954188B (en) * 2015-06-30 2018-05-01 北京奇安信科技有限公司 Web log file safety analytical method based on cloud, device and system
CN105610616B (en) * 2015-12-29 2019-04-26 赛尔网络有限公司 The single IP average flow rate statistical method of access net and system based on ICP liveness
CN106230890A (en) * 2016-07-15 2016-12-14 中电长城网际系统应用有限公司 A kind of message normalization processing method and system
CN106254096A (en) * 2016-07-21 2016-12-21 柳州龙辉科技有限公司 A kind of processing means of Linux daily record

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1746916A (en) * 2005-10-25 2006-03-15 二六三网络通信股份有限公司 Network IP address credit assessment and use in electronic mail system
US20150215334A1 (en) * 2012-09-28 2015-07-30 Level 3 Communications, Llc Systems and methods for generating network threat intelligence
US9319382B2 (en) * 2014-07-14 2016-04-19 Cautela Labs, Inc. System, apparatus, and method for protecting a network using internet protocol reputation information
CN104506356A (en) * 2014-12-24 2015-04-08 网易(杭州)网络有限公司 Method and device for determining credibility of IP (Internet protocol) address
CN107707516A (en) * 2017-04-01 2018-02-16 贵州白山云科技有限公司 A kind of IP address analysis method and system

Also Published As

Publication number Publication date
CN107707516B (en) 2018-11-13
CN107707516A (en) 2018-02-16

Similar Documents

Publication Publication Date Title
WO2018177167A1 (en) Method for analyzing ip address, system, computer readable storage medium, and computer device
AU2017224993B2 (en) Malicious threat detection through time series graph analysis
US9602530B2 (en) System and method for predicting impending cyber security events using multi channel behavioral analysis in a distributed computing environment
EP3487144B1 (en) Malicious domain scoping recommendation system
US11522905B2 (en) Malicious virtual machine detection
CN108776934B (en) Distributed data calculation method and device, computer equipment and readable storage medium
Dou et al. A confidence-based filtering method for DDoS attack defense in cloud environment
US8959571B2 (en) Automated policy builder
US8484700B2 (en) Cross-network reputation for online services
US10044737B2 (en) Detection of beaconing behavior in network traffic
US10686807B2 (en) Intrusion detection system
WO2020133986A1 (en) Botnet domain name family detecting method, apparatus, device, and storage medium
CN108809749B (en) Performing upper layer inspection of a stream based on a sampling rate
WO2015096580A1 (en) Network flow control device, and security strategy configuration method and device thereof
CN110519077A (en) A kind of Log Collect System, method, apparatus, server and storage medium
US11297105B2 (en) Dynamically determining a trust level of an end-to-end link
CN113728581A (en) System and method for SIEM rule classification and conditional execution
CN110875907A (en) Access request control method and device
CN114338064A (en) Method, device, equipment and storage medium for identifying network traffic type
US11516138B2 (en) Determining network flow direction
US10965693B2 (en) Method and system for detecting movement of malware and other potential threats
CN112769739B (en) Database operation violation processing method, device and equipment
CN114205146B (en) Processing method and device for multi-source heterogeneous security log
CN115296855B (en) User behavior baseline generation method and related device
CN113055333A (en) Network flow clustering method and device capable of self-adaptively and dynamically adjusting density grids

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18774565

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18774565

Country of ref document: EP

Kind code of ref document: A1