CN112800419A - Method, apparatus, medium and device for identifying IP group - Google Patents

Method, apparatus, medium and device for identifying IP group Download PDF

Info

Publication number
CN112800419A
CN112800419A CN201911109108.3A CN201911109108A CN112800419A CN 112800419 A CN112800419 A CN 112800419A CN 201911109108 A CN201911109108 A CN 201911109108A CN 112800419 A CN112800419 A CN 112800419A
Authority
CN
China
Prior art keywords
cluster
behavior
same
ips
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911109108.3A
Other languages
Chinese (zh)
Inventor
潘廷珅
丛磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shuan Xinyun Information Technology Co ltd
Original Assignee
Beijing Shuan Xinyun Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shuan Xinyun Information Technology Co ltd filed Critical Beijing Shuan Xinyun Information Technology Co ltd
Priority to CN201911109108.3A priority Critical patent/CN112800419A/en
Publication of CN112800419A publication Critical patent/CN112800419A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2463/00Additional details relating to network architectures or network communication protocols for network security covered by H04L63/00
    • H04L2463/146Tracing the source of attacks

Abstract

A method, apparatus, medium and device for identifying IP groups, the method comprising: acquiring URL and N behavior characteristics accessed by each IP based on log data in a preset time period; aggregating all IPs into different clusters based on the URL and N behavior characteristics accessed by each IP; and when the cluster meets the preset condition, determining that the user corresponding to the IP in the cluster is an IP group. The method realizes more accurate group clustering for the user behaviors of the web log, combines users with similar access behaviors and URLs accessed within a period of time, ensures the accuracy of clustering results, is also effective for low-frequency group, and ensures the interpretability and flexibility of recognition results through specific rule parameters.

Description

Method, apparatus, medium and device for identifying IP group
Technical Field
This document relates to Web network security and, more particularly, to methods, apparatus, media and devices for identifying IP groups.
Background
IP group behavior, i.e., the behavior of a group of organized robots to perform attacks together over a period of time. In Web security, the industry generally uses internet traffic data collected by security devices to analyze abnormal user behaviors, such as CC attack, crawler, SQL injection, etc., through data mining algorithms.
In the related art, the existing Web application firewall generally analyzes abnormal user behaviors by analyzing an offline Web log and applying a data mining algorithm, and there is no mature scheme for the identification dimension of a specific group behavior at present. The existing Web application firewall has weak recognition capability on low-frequency group behavior, low accuracy and high misjudgment risk. As a result of the identified group behavior, the similarity of the user behavior is not high, and the interpretability is not strong.
Disclosure of Invention
To overcome the problems in the related art, a method, apparatus, medium, and device for identifying an IP group are provided.
According to a first aspect herein, there is provided a method of identifying IP groups, comprising:
acquiring URL and N behavior characteristics accessed by each IP based on log data in a preset time period;
aggregating all IPs into different clusters based on the URL and N behavior characteristics accessed by each IP;
and when the cluster meets the preset condition, determining that the user corresponding to the IP in the cluster is an IP group.
The behavior characteristics include: page browsing number PV, number of accessed URLs, number of used users-agents, number of parameters, and number of GETTPT requests.
The aggregating all IPs into different clusters based on the URLs visited by each IP and the N behavior characteristics comprises:
counting M URLs with the maximum visit amount in the preset time period;
constructing a bag-of-words model of each IP for the M URLs;
inputting all the bag-of-words models of the IPs into an LDA model for training, and aggregating the accessed IPs with similar URLs to the same theme;
establishing a behavior characteristic vector for the IP aggregated by the same theme;
inputting the behavior characteristic vector into a K-means model, and aggregating the IPs with similar behavior characteristics in the IPs aggregated by the same theme in the same cluster.
The establishing of the behavior feature vector for the IP aggregated to the same topic comprises:
establishing a behavior feature vector based on the IP aggregated by the same theme and the N behavior features;
or establishing a behavior feature vector based on the IP of the same topic aggregation and the broad-dimension features of the N behavior features.
Combining the N behavior characteristics with one specific behavior characteristic to obtain N-1 new characteristics, wherein the broad-dimension characteristics of the N behavior characteristics are as follows: the set sum of the N behavior features and the N-1 new features.
When the cluster meets the preset condition, the step of determining that the user corresponding to the IP in the cluster is an IP group comprises the following steps:
when the number of IPs in a cluster is greater than 5 and the page view amount per IP is greater than 4,
if the IP users in the cluster use the same user-agent, and in the sorting of all the user-agents according to the number of users from high to low, the user-agents are ranked at the last 30%, determining that the IP users in the cluster are in a group; alternatively, the first and second electrodes may be,
if the section B IPs of the IP in the cluster are the same and the same user-agent is used, determining that the IP user in the cluster is a group; alternatively, the first and second electrodes may be,
and if the ratio of the number of the IP in the cluster to the number of the IP same as the IP in the C section in the cluster is more than 2, the ratio of the number of the IP in the cluster to the number of the IP same as the IP in the B section in the cluster is less than or equal to 3, and the average number of the IP in the cluster accessing different URLPattern is less than or equal to 5, determining that the IP user in the cluster is a group.
According to another aspect herein, there is provided an apparatus for identifying IP groups, comprising:
the log analysis module is used for acquiring URL and N behavior characteristics accessed by each IP based on log data in a preset time period;
the aggregation module is used for aggregating all the IPs into different clusters based on the URL and the N behavior characteristics accessed by each IP;
and the determining module is used for determining that the IP corresponding user in the cluster is an IP group when the cluster meets the preset condition.
The behavior characteristics include: page view quantity PV, URL number visited, user-agent number used, parameter number, GETTPT request number.
The aggregation module includes:
counting M URLs with the most accesses in the preset time period;
constructing a bag-of-words model of each IP for the M URLs;
inputting all the bag-of-words models of the IPs into an LDA model for training, and aggregating the accessed IPs with similar URLs to the same theme;
establishing a behavior characteristic vector for the IP aggregated by the same theme;
inputting the behavior characteristic vector into a K-means model, and aggregating the IPs with similar behavior characteristics in the IPs aggregated by the same theme in the same cluster.
The IP establishment behavior feature vector aggregated for the same topic comprises:
establishing a behavior feature vector based on the IP aggregated by the same theme and the N behavior features;
or establishing a behavior feature vector based on the IP of the same topic aggregation and the broad-dimension features of the N behavior features.
Combining the N behavior characteristics with one specific behavior characteristic to obtain N-1 new characteristics, wherein the broad-dimension characteristics of the N behavior characteristics are as follows: the sum of the N behavior features and the N-1 new features.
When the cluster meets the preset condition, the step of determining that the user corresponding to the IP in the cluster is an IP group comprises the following steps:
when the number of IPs in a cluster is greater than 5 and the page view amount per IP is greater than 4,
if the IP users in the cluster use the same user-agent, and in the sorting of all the user-agents according to the number of users from high to low, the user-agents are ranked at the last 30%, determining that the IP users in the cluster are in a group; alternatively, the first and second electrodes may be,
if the section B IPs of the IP in the cluster are the same and the same user-agent is used, determining that the IP user in the cluster is a group; alternatively, the first and second electrodes may be,
and if the ratio of the number of the IP in the cluster to the number of the IP same as the IP in the C section in the cluster is more than 2, the ratio of the number of the IP in the cluster to the number of the IP same as the IP in the B section in the cluster is less than or equal to 3, and the average number of the IP in the cluster accessing different URLPattern is less than or equal to 5, determining that the IP user in the cluster is a group.
According to another aspect herein, there is provided a computer readable storage medium having stored thereon a computer program which, when executed, carries out the steps of the method of identifying IP groups.
According to another aspect herein, there is provided a computer device comprising a processor, a memory and a computer program stored on the memory, wherein the processor when executing the computer program implements the steps of the method of identifying IP groups.
The IP cluster management method comprises the steps of aggregating accessed IPs with similar URLs and similar behavior characteristics into a cluster by analyzing log data, analyzing characteristics of the IPs in the cluster, and determining that the corresponding users of the IPs in the cluster are IP group partners when preset conditions are met. The method realizes more accurate group clustering for the user behavior of the web log, ensures the accuracy of clustering results, is also effective for low-frequency group, and ensures the interpretability and flexibility of recognition results through specific rule parameters.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification. In the drawings:
fig. 1 is a flow diagram illustrating a method of identifying IP groups in accordance with an example embodiment.
Fig. 2 is a block diagram illustrating an apparatus for identifying IP groups, according to an example embodiment.
FIG. 3 is a block diagram illustrating a computer device according to an example embodiment.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some but not all of the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments herein without making any creative effort, shall fall within the scope of protection. It should be noted that the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict.
Fig. 1 is a flow diagram illustrating a method of identifying IP groups in accordance with an example embodiment. Referring to fig. 1, a method of identifying IP groups, comprising:
in step S11, based on the log data in the preset time period, the URL and N behavior features accessed by each IP are obtained.
In step S12, all IPs are aggregated into different clusters based on the URLs and N behavior characteristics visited by each IP.
Step S13, when the cluster meets the preset condition, it is determined that the IP corresponding user in the cluster is an IP group.
Firstly, log data in a preset time period is accessed, and the time period is set according to specific conditions. In an embodiment, the condition for performing the clustering algorithm analysis on the IPs needs to satisfy the minimum aggregation number of 10000 IPs and the maximum aggregation number of 200000 IPs. In practical application, a reasonable time period is set, so that the IP number of the users in the time period meets the condition of clustering algorithm analysis.
After a certain amount of log data are accessed, the log data are further analyzed to obtain the URL accessed by each IP and N behavior characteristics of each IP when accessing the page. In one embodiment, the behavioral characteristics include: page browsing number PV, number of accessed URLs, number of used users-agents, number of parameters, and number of GETTPT requests. It is important to note that the above-described behavior features are only some of the behavior features used in one embodiment, and not all of them. In order to make the analysis result of the clustering algorithm more accurate, the behavior characteristics may be divided more finely, or other behavior characteristics may be added, which is not limited in this document.
In an embodiment, step S12 specifically includes:
in step S121, the M URLs visited the most in the preset time period are counted. According to the acquired log data in the preset time period, the top M URLs with the largest visit amount can be counted according to the URLs visited by each IP in the time period, in this embodiment, M is set to 100, that is, 100 URLs with the largest visit amount are counted in all the visited URLs. In practical applications, the number of M needs to be considered comprehensively according to practical situations, and is not limited herein.
And step S122, constructing a bag-of-words model of each IP for M URLs.
And corresponding to each IP, setting the characteristic value of the accessed URL as the access frequency and the characteristic value of the URL which is not accessed as 0 aiming at the M URLs, and establishing a bag-of-words model.
And S123, inputting all the bag-of-words models of the IPs into an LDA model for training, and aggregating the accessed IPs with similar URLs to the same theme.
Through the training of the bag-of-word model of all IPs by the LDA model, the IPs with similar visited URLs, namely the IPs with similar visiting preferences of users, are aggregated on the same theme. If the user A visits the website of Taobao, Baidu and Jingdong, the user B visits the website of Google, Taobao and Jingdong, the user C visits the website of mother and baby, health and automobile, the websites visited by the user A and the user B are similar through LDA model training, and the websites visited by the user A and the user B are both Taobao and Jingdong, so that the visiting preferences of the user A and the user B are similar, correspondingly, the URLs visited by the corresponding IPs of the user A and the user B are similar, and the corresponding IPs of the user A and the user B are aggregated to the same theme, such as a shopping theme. By the method, all IP can be aggregated to different topics, such as shopping topics, scientific research topics and the like. In practical application, the category of the output theme is adjusted by setting parameters of the LDA model, and the parameters are comprehensively considered according to practical situations, which is not limited herein.
Step S124, establishing a behavior feature vector for the IP aggregated by the same topic. After the IP similar to the accessed URL is aggregated to the same theme, the behavior feature vector can be established according to the number of the IP under the theme and the N behavior features of each IP. For example, K IPs are aggregated under the shopping theme, each IP having 4 behavioral characteristics as follows: page browsing number PV, number of accessed URLs, number of used users-agents, number of parameters, and the behavior feature vector of (K,4) can be established.
In one embodiment, establishing behavior feature vectors for the IPs aggregated for the same topic comprises:
establishing a behavior feature vector based on the IP and N behavior features aggregated by the same theme;
or establishing a behavior feature vector based on the IP of the same topic aggregation and the broad-dimension features of the N behavior features.
For example, N behavior features are combined with one specific behavior feature to obtain N-1 new features, and the broad-dimension features of the N behavior features are: the set sum of N behavior features and N-1 new features. For example, K IPs are aggregated within a shopping theme, each IP having 4 behavioral characteristics as follows: page browsing number PV, number of accessed URLs, number of used users-agents, and number of parameters. The behavior feature vector of (K,4) may be established. Combining the 4 behavior characteristics with the page browsing number PV can obtain three new characteristics of the number of accessed URLs/PV, the number of used users/PV and the number of parameters/PV. The original 4 features are changed into 7 features, the behavior feature vector of (K,7) is established, the wide dimension of the existing behavior features is realized, and the clustering accuracy can be further improved by performing the wide dimension on the existing features.
And step S125, inputting the behavior characteristic vectors into a K-means model, and aggregating the IPs with similar behavior characteristics in the IPs aggregated on the same theme in the same cluster.
Through the embodiment, the word bag model is built for all IPs, the LDA model is input, the IPs of users with similar preference are aggregated under the same theme, the behavior characteristic vector is built for the IPs under the theme, the K-means model is input, the IPs with similar behavior characteristics in the IPs under the same theme are clustered, and the interpretability and the flexibility of the recognition result are ensured through specific rule parameters.
In an embodiment, when the cluster satisfies the preset condition, determining that the IP corresponding user in the cluster is an IP group comprises:
when the number of IPs in a cluster is greater than 5 and the page view amount per IP is greater than 4,
if the IP users in the cluster use the same user-agent, and in the sorting of all the user-agents according to the number of users from more to less, the user-agent is ranked at the last 30 percent, the IP users in the cluster are determined to be a group; alternatively, the first and second electrodes may be,
if the section B IPs of the IP in the cluster are the same and the same user-agent is used, determining that the IP users in the cluster are group partners; alternatively, the first and second electrodes may be,
if the ratio of the number of the IP in the cluster to the number of the IP same as the IP in the C section in the cluster is more than 2, the ratio of the number of the IP in the cluster to the number of the IP same as the IP in the B section in the cluster is less than or equal to 3, and the average number of the IP in the cluster accessing different URLPattern is less than or equal to 5, the IP user in the cluster is determined to be a group.
A group behavior is typically attacked by the same program, attacking similar target addresses at the same time. So that the user-agent used by the attacker during this period will be mostly the same; a large number of users with the same BC section IP are arranged on the IP dimension, and IP groups can be identified by extracting the features in the clusters. In this embodiment, in order to prevent the mistaken click behavior of some users from being identified as the group behavior, when it is determined that the IP user in the cluster is the group, the precondition is satisfied first, and the IP user in the cluster is greater than 5, which is considered as the group, and meanwhile, the page browsing volume of each IP in the cluster is greater than 4, which prevents the user from being identified as the group due to the mistaken click of the user. Of course, the above numerical values need to be adjusted according to actual situations, and the listed numerical values are only used for better understanding of the scheme and are not used for limiting the scheme.
And aggregating the accessed IPs with similar URLs and similar behavior characteristics into a cluster by analyzing log data, and determining that the IP corresponding to the user in the cluster is an IP group when preset conditions are met. The method realizes more accurate group clustering for the user behaviors of the Web log, and combines users with similar access preference and access behavior in a period of time, thereby ensuring the accuracy of clustering results, being also effective for low-frequency group, and ensuring the interpretability and flexibility of identification results through specific rule parameters.
Fig. 2 is a block diagram illustrating an apparatus for identifying IP groups, according to an example embodiment. Referring to fig. 2, an apparatus for identifying IP groups, comprising: the system comprises a log analysis module 201, an aggregation module 202 and a determination module 203.
The log analysis module 201 is configured to obtain, based on log data in a preset time period, a URL and N behavior features that each IP accesses;
the aggregation module 202 is configured to aggregate all IPs into different clusters based on the URLs and N behavior characteristics visited by each IP;
the determining module 203 is configured to determine that the IP corresponding user in the cluster is an IP group when the cluster satisfies a preset condition.
The behavior characteristics comprise: page browsing number PV, number of accessed URLs, number of used users-agents, number of parameters, and number of GETTPT requests.
The aggregation module includes:
counting M URLs with the most accesses in a preset time period;
constructing a bag-of-words model of each IP for M URLs;
inputting all the bag-of-words models of the IPs into an LDA model for training, and aggregating the accessed IPs with similar URLs to the same theme;
establishing behavior characteristic vectors for the IP aggregated by the same theme;
and inputting the behavior characteristic vector into a K-means model, and aggregating the IPs with similar behavior characteristics in the IPs aggregated on the same theme in the same cluster.
The method for establishing the behavior characteristic vector for the IP aggregated by the same theme comprises the following steps:
establishing a behavior feature vector based on the IP and N behavior features aggregated by the same theme;
or establishing a behavior feature vector based on the IP of the same topic aggregation and the broad-dimension features of the N behavior features.
Dividing the N behavior characteristics by a specific behavior characteristic to obtain N-1 new characteristics, wherein the broad-dimension characteristics of the N behavior characteristics are as follows: the set sum of N behavior features and N-1 new features.
When the cluster meets the preset condition, determining that the user corresponding to the IP in the cluster is the IP group comprises the following steps:
when the number of IPs in a cluster is greater than 5 and the page view amount per IP is greater than 4,
if the IP users in the cluster use the same user-agent, and in the sorting of all the user-agents according to the number of users from more to less, the user-agent is ranked at the last 30 percent, the IP users in the cluster are determined to be a group; alternatively, the first and second electrodes may be,
if the section B IPs of the IP in the cluster are the same and the same user-agent is used, determining that the IP users in the cluster are group partners; alternatively, the first and second electrodes may be,
if the ratio of the number of the IP in the cluster to the number of the IP same as the IP in the C section in the cluster is more than 2, the ratio of the number of the IP in the cluster to the number of the IP same as the IP in the B section in the cluster is less than or equal to 3, and the average number of the IP in the cluster accessing different URLPattern is less than or equal to 5, the IP user in the cluster is determined to be a group.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 3 is a block diagram illustrating a computer device 300 for identifying IP groups, according to an example embodiment. For example, the computer device 300 may be provided as a server. Referring to fig. 3, the computer device 300 includes a processor 301, and the number of the processors may be set to one or more as necessary. The computer device 300 further comprises a memory 302 for storing instructions, such as an application program, executable by the processor 301. The number of the memories can be set to one or more according to needs. Which may store one or more application programs. Processor 301 is configured to execute instructions to perform the above-described method of identifying IP groups.
As will be appreciated by one skilled in the art, the embodiments herein may be provided as a method, apparatus (device), or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, including, but not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer, and the like. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments herein. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of additional like elements in the article or device comprising the element.
While the preferred embodiments herein have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following appended claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of this disclosure.
It will be apparent to those skilled in the art that various changes and modifications may be made herein without departing from the spirit and scope thereof. Thus, it is intended that such changes and modifications be included herein, provided they come within the scope of the appended claims and their equivalents.

Claims (14)

1. A method of identifying IP groups, comprising:
acquiring URL and N behavior characteristics accessed by each IP based on log data in a preset time period;
aggregating all IPs into different clusters based on the URL and N behavior characteristics accessed by each IP;
and when the cluster meets the preset condition, determining that the user corresponding to the IP in the cluster is an IP group.
2. A method of identifying IP groups according to claim 1, wherein the behaviour characteristics include: page browsing number PV, number of accessed URLs, number of used users-agents, number of parameters, and number of GETTPT requests.
3. The method of identifying IP groups according to claim 2, wherein said aggregating all IPs into different clusters based on URLs visited by each IP and N behavior characteristics comprises:
counting M URLs with the maximum visit amount in the preset time period;
constructing a bag-of-words model of each IP for the M URLs;
inputting all the bag-of-words models of the IPs into an LDA model for training, and aggregating the accessed IPs with similar URLs to the same theme;
establishing a behavior characteristic vector for the IP aggregated by the same theme;
inputting the behavior characteristic vector into a K-means model, and aggregating the IPs with similar behavior characteristics in the IPs aggregated by the same theme in the same cluster.
4. A method of identifying IP groups according to claim 3, wherein the aggregating IP establishment behavior feature vectors for the same topic comprises:
establishing a behavior feature vector based on the IP aggregated by the same theme and the N behavior features;
or establishing a behavior feature vector based on the IP of the same topic aggregation and the broad-dimension features of the N behavior features.
5. A method of identifying IP groups according to claim 4, characterized in that the N behavior characteristics are combined with a specific behavior characteristic to obtain N-1 new characteristics, and the broad-dimension characteristics of the N behavior characteristics are: the set sum of the N behavior features and the N-1 new features.
6. The method for identifying IP group, according to any of claims 1-5, wherein said determining that the IP corresponding user in the cluster is an IP group when the cluster satisfies a predetermined condition comprises:
when the number of IPs in a cluster is greater than 5 and the page view amount per IP is greater than 4,
if the IP users in the cluster use the same user-agent, and in the sorting of all the user-agents according to the number of users from high to low, the user-agents are ranked at the last 30%, determining that the IP users in the cluster are in a group; alternatively, the first and second electrodes may be,
if the section B IPs of the IP in the cluster are the same and the same user-agent is used, determining that the IP user in the cluster is a group; alternatively, the first and second electrodes may be,
and if the ratio of the number of the IP in the cluster to the number of the IP same as the IP in the C section in the cluster is more than 2, the ratio of the number of the IP in the cluster to the number of the IP same as the IP in the B section in the cluster is less than or equal to 3, and the average number of the IP in the cluster accessing different URLPattern is less than or equal to 5, determining that the IP user in the cluster is a group.
7. An apparatus for identifying IP groups, comprising:
the log analysis module is used for acquiring URL and N behavior characteristics accessed by each IP based on log data in a preset time period;
the aggregation module is used for aggregating all the IPs into different clusters based on the URL and the N behavior characteristics accessed by each IP;
and the determining module is used for determining that the user corresponding to the IP in the cluster is an IP group when the cluster meets the preset condition.
8. An apparatus for identifying IP groups according to claim 7, wherein the behaviour characteristics include: page browsing number PV, number of accessed URLs, number of used users-agents, number of parameters, and number of GETTPT requests.
9. The apparatus for identifying IP groups according to claim 8, wherein the aggregation module comprises:
counting M URLs with the most accesses in the preset time period;
constructing a bag-of-words model of each IP for the M URLs;
inputting all the bag-of-words models of the IPs into an LDA model for training, and aggregating the accessed IPs with similar URLs to the same theme;
establishing a behavior characteristic vector for the IP aggregated by the same theme;
inputting the behavior characteristic vector into a K-means model, and aggregating the IPs with similar behavior characteristics in the IPs aggregated by the same theme in the same cluster.
10. An apparatus for identifying IP groups according to claim 9, wherein the IP establishment behavior feature vectors aggregated for the same topic comprise:
establishing a behavior feature vector based on the IP aggregated by the same theme and the N behavior features;
or establishing a behavior feature vector based on the IP of the same topic aggregation and the broad-dimension features of the N behavior features.
11. An apparatus for identifying IP groups according to claim 10, wherein said N behavior characteristics are combined with a specific behavior characteristic to obtain N-1 new characteristics, and the broad-dimension characteristics of said N behavior characteristics are: the sum of the N behavior features and the N-1 new features.
12. An apparatus for identifying IP group, according to claims 7-11, wherein said determining that the IP correspondent user in the cluster is an IP group when said cluster satisfies a predetermined condition comprises:
when the number of IPs in a cluster is greater than 5 and the page view amount per IP is greater than 4,
if the IP users in the cluster use the same user-agent, and in the sorting of all the user-agents according to the number of users from high to low, the user-agents are ranked at the last 30%, determining that the IP users in the cluster are in a group; alternatively, the first and second electrodes may be,
if the section B IPs of the IP in the cluster are the same and the same user-agent is used, determining that the IP user in the cluster is a group; alternatively, the first and second electrodes may be,
and if the ratio of the number of the IP in the cluster to the number of the IP same as the IP in the C section in the cluster is more than 2, the ratio of the number of the IP in the cluster to the number of the IP same as the IP in the B section in the cluster is less than or equal to 3, and the average number of the IP in the cluster accessing different URLPattern is less than or equal to 5, determining that the IP user in the cluster is a group.
13. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed, implements the steps of the method according to any one of claims 1-6.
14. A computer arrangement comprising a processor, a memory and a computer program stored on the memory, characterized in that the steps of the method according to any of claims 1-6 are implemented when the computer program is executed by the processor.
CN201911109108.3A 2019-11-13 2019-11-13 Method, apparatus, medium and device for identifying IP group Pending CN112800419A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911109108.3A CN112800419A (en) 2019-11-13 2019-11-13 Method, apparatus, medium and device for identifying IP group

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911109108.3A CN112800419A (en) 2019-11-13 2019-11-13 Method, apparatus, medium and device for identifying IP group

Publications (1)

Publication Number Publication Date
CN112800419A true CN112800419A (en) 2021-05-14

Family

ID=75803511

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911109108.3A Pending CN112800419A (en) 2019-11-13 2019-11-13 Method, apparatus, medium and device for identifying IP group

Country Status (1)

Country Link
CN (1) CN112800419A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113726783A (en) * 2021-08-31 2021-11-30 北京知道创宇信息技术股份有限公司 Abnormal IP address identification method and device, electronic equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202480A (en) * 2016-07-19 2016-12-07 淮阴工学院 A kind of network behavior based on K means and LDA bi-directional verification custom clustering method
CN107800684A (en) * 2017-09-20 2018-03-13 贵州白山云科技有限公司 A kind of low frequency reptile recognition methods and device
CN109145934A (en) * 2017-12-22 2019-01-04 北京数安鑫云信息技术有限公司 User behavior data processing method, medium, equipment and device based on log
CN109271418A (en) * 2018-08-14 2019-01-25 阿里巴巴集团控股有限公司 Suspicious clique's recognition methods, device, equipment and computer readable storage medium
CN109325232A (en) * 2018-09-25 2019-02-12 北京明朝万达科技股份有限公司 A kind of user behavior exception analysis method, system and storage medium based on LDA

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202480A (en) * 2016-07-19 2016-12-07 淮阴工学院 A kind of network behavior based on K means and LDA bi-directional verification custom clustering method
CN107800684A (en) * 2017-09-20 2018-03-13 贵州白山云科技有限公司 A kind of low frequency reptile recognition methods and device
CN109145934A (en) * 2017-12-22 2019-01-04 北京数安鑫云信息技术有限公司 User behavior data processing method, medium, equipment and device based on log
CN109271418A (en) * 2018-08-14 2019-01-25 阿里巴巴集团控股有限公司 Suspicious clique's recognition methods, device, equipment and computer readable storage medium
CN109325232A (en) * 2018-09-25 2019-02-12 北京明朝万达科技股份有限公司 A kind of user behavior exception analysis method, system and storage medium based on LDA

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113726783A (en) * 2021-08-31 2021-11-30 北京知道创宇信息技术股份有限公司 Abnormal IP address identification method and device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
EP3622402B1 (en) Real time detection of cyber threats using behavioral analytics
Aljawarneh et al. An enhanced J48 classification algorithm for the anomaly intrusion detection systems
EP3435623B1 (en) Malware detection using local computational models
JP6676167B2 (en) Information recommendation method and device
CN108920947B (en) Abnormity detection method and device based on log graph modeling
US9723016B2 (en) Detecting web exploit kits by tree-based structural similarity search
CN107241296B (en) Webshell detection method and device
Miskovic et al. Appprint: automatic fingerprinting of mobile applications in network traffic
CN109561052B (en) Method and device for detecting abnormal flow of website
CN107257390B (en) URL address resolution method and system
US20190273789A1 (en) Establishing and utilizing behavioral data thresholds for deep learning and other models to identify users across digital space
US20170091303A1 (en) Client-Side Web Usage Data Collection
US20210263979A1 (en) Method, system and device for identifying crawler data
Mengiste et al. Effect of edge pruning on structural controllability and observability of complex networks
EP3893128A1 (en) Crawler data recognition method, system and device
WO2013110357A1 (en) Social network analysis
CN110855648A (en) Early warning control method and device for network attack
CN111600894A (en) Network attack detection method and device
Abawajy et al. Hybrid consensus pruning of ensemble classifiers for big data malware detection
WO2017086992A1 (en) Malicious web content discovery through graphical model inference
Tang et al. HSLF: HTTP header sequence based LSH fingerprints for application traffic classification
CN106610989B (en) Search keyword clustering method and device
CN112839055B (en) Network application identification method and device for TLS encrypted traffic and electronic equipment
CN112800419A (en) Method, apparatus, medium and device for identifying IP group
CN110019400B (en) Data storage method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination