CN110324327B - User and server IP address calibration device and method based on specific enterprise domain name data - Google Patents

User and server IP address calibration device and method based on specific enterprise domain name data Download PDF

Info

Publication number
CN110324327B
CN110324327B CN201910537333.0A CN201910537333A CN110324327B CN 110324327 B CN110324327 B CN 110324327B CN 201910537333 A CN201910537333 A CN 201910537333A CN 110324327 B CN110324327 B CN 110324327B
Authority
CN
China
Prior art keywords
calibration
server
data
judged
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910537333.0A
Other languages
Chinese (zh)
Other versions
CN110324327A (en
Inventor
窦禹
任彦
王一宇
薛晨
陆希玉
易立
王云荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN201910537333.0A priority Critical patent/CN110324327B/en
Publication of CN110324327A publication Critical patent/CN110324327A/en
Application granted granted Critical
Publication of CN110324327B publication Critical patent/CN110324327B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/30Managing network names, e.g. use of aliases or nicknames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0236Filtering by address, protocol, port number or service, e.g. IP-address or URL
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/102Entity profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a user and server IP address calibration device and method based on specific enterprise domain name data, and belongs to the technical field of communication. The device is provided with a data acquisition, cleaning and storage module, a stream data processing module, a domain name data processing module, a fusion calibration module and the like on a processor. The method of the invention collects the stream data and the private DNS domain name data of the private route of the enterprise, after cleaning and storing, respectively calculates the IP type and the confidence coefficient of the two parts of data according to the extracted IP behavior characteristics, respectively stores the IP type and the confidence coefficient into two calibration libraries, performs fusion calibration on the IP in the two calibration libraries, and then performs flow processing, white list setting and the like according to the calibrated type during flow analysis and supervision. The invention has the advantages of high calibration speed, real-time calibration realization, accurate calibration result and powerful data basis for the subsequent flow analysis and monitoring.

Description

User and server IP address calibration device and method based on specific enterprise domain name data
Technical Field
The invention relates to the technical field of communication, in particular to an accurate and efficient device and method for realizing IP type calibration based on IP characteristics and behavior analysis.
Background
With the increasingly prominent important role of the IP portrait technology in the internet industry, the traditional IP address attributes are mostly geographically located by adopting active measurement and routing analysis methods to obtain the location attributes of the IP address, and the IP is that information such as user IP or server IP is hidden in different data sources, has a certain correlation with each other, and needs to be analyzed in a data mining manner. The identification of the IP address belongs to a user or a server, and plays an essential and important role in perfecting the IP portrait and realizing IP identity authentication.
The application requirements of the invention aim to explore how to calibrate the server IP and the user IP more accurately and efficiently by comprehensively analyzing the flow data of the enterprise private routing equipment and the domain name data of the enterprise private DNS, automatically update the IP attribute value of the IP resource library, perfect the basic attribute of the enterprise related IP and realize the portrayal of the enterprise related IP user.
Disclosure of Invention
Aiming at the requirements, in order to realize the IP address calibration of the user and the server, the invention provides the IP calibration device, the calibration method and the calibration library of the user and the server, which are good in real-time and high in accuracy and are formed by performing characteristic analysis based on enterprise private routing equipment stream data and private DNS domain name data to form a basic calibration method and finally fusing the results and the characteristics of the two methods.
The invention relates to a user and server IP address calibration device based on specific enterprise domain name data, which is characterized in that a processor is provided with the following components:
and the data acquisition module is used for acquiring the stream data of the enterprise private route and the private DNS domain name data.
And the data cleaning module is used for cleaning the data acquired by the data acquisition module.
And the data storage module is used for storing the cleaned data file and is provided with a first calibration library, a second calibration library and a third calibration library.
The flow data processing module is used for analyzing the flow data of the enterprise private route, identifying the type of the IP according to the IP behavior characteristics, outputting the type and the confidence coefficient of the IP to be judged through each IP behavior characteristic, performing weighted summation on the output confidence coefficient to serve as a first calibration result, and storing the first calibration result in a first calibration library; the types of the IP are user IP and server IP.
And the domain name data processing module is used for analyzing the domain name data of the enterprise private DNS, identifying the type of the IP according to the IP behavior characteristics, outputting the type and confidence coefficient of the IP to be judged through each IP behavior characteristic, performing weighted summation on the output confidence coefficients to serve as a second calibration result, and storing the second calibration result in a second calibration library.
The fusion calibration module is used for weighting and summing the two calibration results of the to-be-judged IP in the first calibration library and the second calibration library at the same time, determining the type and the confidence coefficient of the to-be-judged IP and storing the type and the confidence coefficient into a third calibration library; or, the fusion calibration module selects IP behavior characteristics and output thereof for the IP to be judged, performs weighted summation on the output confidence coefficient, determines the type and the confidence coefficient of the IP to be judged, and stores the type and the confidence coefficient into a third calibration library; the selected IP behavior characteristics include: calculating the ratio of the uplink flow to the downlink flow of the IP to be detected, the ratio of the out-degree to the in-degree of the IP to be detected and the active time ratio of the IP to be detected within a set time aiming at the flow data; detecting whether the IP address to be detected is bound with the domain name or not aiming at the domain name data; and deleting the calibration results stored in the first calibration library and the second calibration library from the IP stored in the third calibration library.
And the flow analysis and supervision module reads the IP types from the three calibration libraries, sets a white list according to the IP types and filters the network flow.
The method for calibrating the IP address of the user and the server based on the specific enterprise domain name data mainly comprises the following 3 steps.
Step 1, calibrating IP addresses of users and servers based on enterprise private routing equipment stream data;
NETFLOW flow data is sampled in real time based on enterprise private routing equipment, and whether the IP address is a server IP address or a user IP address is judged according to IP behavior characteristics;
the IP behavior characteristics comprise: (1) whether a stable port of the IP starts a known service port or not is judged, the server IP starts the known service port, and the user IP does not start the known service port; (2) calculating the ratio of the number of outflow bytes to the number of inflow bytes of the IP to be judged in a fixed time period; the number of bytes of outflow of the server IP is greater than the number of bytes of inflow, the number of bytes of inflow of the user IP is greater than the number of bytes of outflow; (3) calculating the flow ratio of the known port flow and the dynamic port of the IP to be judged; the flow of the server IP is centralized at a known port, and the flow of the user IP is centralized at a dynamic port; (4) counting the number of corresponding source IPs when the IP to be judged is used as a target IP; when the server IP is used as the target IP, the number of the corresponding source IPs is large, and when the user IP is used as the target IP, the number of the corresponding source IPs is small;
and outputting confidence coefficients that the IP to be judged is the user IP and the server IP respectively according to each IP behavior characteristic, performing weighted summation on the output to serve as a first calibration result, and updating in a calibration library.
Step 2, calibrating the IP addresses of users and servers based on the domain name data of the enterprise private DNS;
distinguishing a server IP and a user IP according to IP behavior characteristics of the collected private DNS domain name data;
the IP behavior characteristics in the step 2 comprise: (1) when judging whether 53 ports exist in the stable ports of the IP, opening 53 ports by the domain name resolution server IP, and not opening 53 ports by the user IP; (2) counting the ratio of the number of packets of the IP to be judged, which are respectively used as a source IP and a target IP in the DNS request flow and the DNS response flow, within a fixed time period; the domain name resolution server IP is more used as a destination IP in DNS request flow and more used as a source IP in DNS response flow; the user IP only serves as a source IP in DNS request flow and a target IP in DNS response flow; (3) counting the number of corresponding different source IPs when the IP to be judged is used as a target IP; when the domain name resolution server IP is used as a target IP, the number of corresponding source IPs is large; when the user IP is used as the target IP, the corresponding source IP is small in quantity;
and outputting confidence coefficients that the IP to be judged is the user IP and the server IP respectively according to each IP behavior characteristic, performing weighted summation on the output to serve as a second calibration result, and updating in a second calibration library.
And step 3, fusing and calibrating the IP addresses of the user and the server. The step is to perform fusion calibration on the IP with calibration results in the steps 1 and 2; and (3) performing fusion calibration on the IP calibration result obtained only in the step (1) or the step (2) without performing fusion calibration, and directly taking the calibration result as the standard.
Carrying out weighted summation on the IP to be judged according to the first calibration result and the second calibration result obtained in the step 1 and the step 2 respectively to obtain the type of the IP to be judged; the first calibration result and the second calibration result are confidence coefficients that the IP to be judged is the IP of the server side;
further, the IP to be judged is subjected to weighted summation by selecting the IP behavior characteristic output used in the step 1 and the step 2 to obtain the type of the IP to be judged; outputting each IP behavior characteristic, and using the confidence coefficient that the IP to be judged is the IP of the server side;
and after the type of the IP to be judged is determined, adding a type identifier into the recorded IP information, and analyzing and monitoring the flow according to the type identifier.
The invention has the advantages and positive effects that: the invention can mark whether the IP is a server or a user IP, perfect the IP portrait, and the marked result can be used for filtering the flow of some white lists and the deep analysis of suspicious flow so as to serve enterprise services such as flow supervision, firewall configuration and the like; in the aspect of optimizing the data processing performance of an enterprise, if the IP is not calibrated, resources are consumed in processing and analyzing under the condition of extremely large flow, the calibration result of the invention also avoids the waste of resources and accelerates the data processing speed. The IP calibration device and the IP calibration method have the advantages of low calculation degree involved in the judgment, high judgment speed, capability of realizing real-time judgment, accurate calibration result in consideration of IP behavior characteristics, and capability of further providing a powerful data basis for flow analysis and monitoring.
Drawings
FIG. 1 is a schematic illustration of the present invention for pre-processing of collected data;
FIG. 2 is a schematic diagram of user and server IP address targeting based on enterprise private routing device flow data in accordance with the present invention;
FIG. 3 is a schematic diagram of user and server IP address targeting based on enterprise private DNS domain name data in accordance with the present invention;
FIG. 4 is a schematic diagram of user and server IP address fusion calibration;
FIG. 5 is a diagram of a calibration result statistical query and verification module according to an embodiment of the present invention.
Detailed Description
The following describes implementation of the technical solution of the present invention with reference to the drawings and embodiments.
The invention relates to a user and server IP address calibration method based on specific enterprise domain name data, which mainly comprises a data preprocessing stage, user and server IP address calibration based on enterprise private routing equipment stream data, user and server IP address calibration based on enterprise private DNS domain name data, fusion calibration of user and server IP addresses and the like.
As shown in fig. 1, the data preprocessing results include data acquisition, data cleansing, and data storage. The data acquisition work mainly comprises two parts of enterprise flow data and domain name data acquisition, and the method is carried out based on enterprise private equipment and data, and directly copies the enterprise private route and DNS traffic to a processor through special equipment.
Because the private data of the enterprise contains huge information, the data needs to be cleaned according to the actual data requirements of the invention, cleaning rules are configured, and useful information in the stream data is retained, such as: and writing the cleaned data into the Kafka message queue cluster in real time, and waiting for the next processing, wherein in the implementation process of the invention, different themes topic are respectively created for the data source of each method, and the storage time of the information in the Kafka message queue is set to be 1 day.
In order to improve the utilization rate of the cleaned data and facilitate backtracking and searching, in the implementation process of the invention, the cleaned data is stored into a file and is loaded into the hive data warehouse in batch at regular time, and meanwhile, the calibration results of the IP addresses of the user and the server of the three methods of flow calibration, domain name calibration and calibration fusion are respectively tabulated and stored in the hive data warehouse for subsequent analysis, display and accuracy verification.
As shown in fig. 2, the enterprise private routing device flow data stored in the data warehouse is processed to calibrate the user and server IP addresses.
A data stream (data stream) is a sequence of ordered data sequences of bytes having a start and an end, representing a sequence of digitally encoded signals of information used in transmission. The invention calibrates the IP user and server attributes in the streaming data based on the NETFLOW streaming data analysis sampled by the enterprise private routing equipment in real time, namely, the IP is judged as the server IP or the user IP.
Server IP and user IP are behaviourally distinct, and the different behavioural characteristics cause them to behave differently in network traffic. Therefore, by analyzing the flow data of the IP, the behavior characteristics of different IPs can be distinguished, thereby completing the distinction of the IP of the server or the IP of the user.
As shown in fig. 2, first, a feature that can be used to distinguish IP addresses of a user and a server is constructed according to NETFLOW stream data; then, combining the characteristics, weighting and voting, and finally calibrating the identities of the IP user and the server; and updating a first calibration library of the user and server IP, wherein the calibration library stores the IP calibration result.
The distinguishing features of the server IP address and the user IP address include the following.
(1) Whether the stable port of the IP has a known service port.
Known well-known port numbers are those reserved by the internet name and number assignment authority (ICANN) for use by the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP). Known port numbers are 0-1023. Some common well-known port numbers and their corresponding applications are listed below:
Figure BDA0002101514950000041
Figure BDA0002101514950000051
since the server IP needs to provide services to the outside, it usually needs to open a well-known port, such as 80 or 443 ports opened by a WEB server, 110 ports opened by a mail server, and the like. The user IP is typically a dynamic IP and does not need to provide services externally, so the user IP typically does not open a well-known port.
(2) And calculating the ratio of the number of outflow bytes and the number of inflow bytes of the IP in a fixed time period, namely the ratio of the uplink flow and the downlink flow of the IP. The server IP provides services to the outside, and thus there is generally a large amount of traffic flowing out, such as WEB servers, FTP servers, and the like. The user IP is the reverse of the server IP and receives service traffic, so the incoming traffic is large. Therefore, whether it is the server IP or the user IP can be distinguished according to the ratio of the number of outgoing bytes and the number of incoming bytes of the IP.
(3) The ratio of the IP's small port (known port) traffic to large port (dynamic port) traffic is calculated. The traffic of the server IP is basically concentrated on the well-known ports, i.e. the small ports, while the user IP traffic is concentrated on the dynamic ports, i.e. the large ports.
(4) And counting the number of corresponding source IPs when the IP is taken as the destination IP. When the server IP is used as the target IP, the corresponding source IP is the user IP, and the server serves the IP of the whole network, so that the number of the source IPs is large; when the user IP is used as the target IP, the corresponding source IP is the server IP, the service accessed by the user is limited, and the number of the source IPs is small.
Each characteristic can independently judge whether the IP is a user IP or a server IP, but the accuracy is general. The classification of the IP needs to integrate all the characteristics as much as possible, and the invention integrates all the characteristics by adopting a weighted voting method. The weighted voting formula is as follows:
Figure BDA0002101514950000052
wherein x represents the IP address to be detected, H (x) represents the calibration result of x, T represents the characteristic number, omegaiThe weight representing the characteristic i, j represents the usage of the IP, means the user IP or the server IP,
Figure BDA0002101514950000053
the class label output of the use mode j for x through the feature i is shown,
Figure BDA0002101514950000054
Figure BDA0002101514950000055
and representing the confidence coefficient of the type j of the IP to be detected, calculating the confidence coefficients of the IP to be detected belonging to the two categories under the characteristic i, and determining the category output by comparing the confidence coefficients. Finally outputting the IP using mode H (x) with the maximum weighted vote. H (x) is a value of (0,1) to indicate the type and confidence of the IP to be detected, and the IP usage with high confidence is obtained by the argmax function.
Since the server IP may be replaced and the user IP is mostly a dynamic IP, the usage of the IP may change. The invention needs to operate on the latest NETFLOW stream data, and updates and labels the user/server IP address library stored in the database after the IP address is calibrated.
As shown in fig. 3, user and server IP addresses are labeled based on enterprise private DNS domain name data.
The Domain Name System (DNS) is an important internet service infrastructure, primarily used to accomplish the translation of domain names into IP addresses. Before a user accesses the internet application, the addressing conversion from the domain name to the IP address needs to be completed by using a domain name system. The invention marks the IP addresses of the user and the server in the domain name access data based on the domain name data acquired by the private DNS of the enterprise, and judges the IP as the IP of the domain name resolution server or the IP of the user.
The server IP and user IP behavior characteristics differ, and the different behavior characteristics cause them to behave differently in domain name access traffic. Therefore, by analyzing the domain name access data of the IP, the behavior characteristics of different IPs can be distinguished, thereby completing the distinguishing of the domain name resolution server IP or the user IP.
As shown in fig. 3, a process of calibrating according to enterprise private DNS domain name monitoring data includes first constructing a feature that can be used to distinguish a user or a domain name resolution server usage pattern; then, combining the characteristics to finally calibrate the IP of the user and the server; and finally, updating the second calibration libraries of the user and the server IP.
The following lists some typical features that can directly distinguish the usage of the user and the domain name server:
(1) whether there are 53 ports in the stable ports of the IP. The domain name server provides domain name resolution service to the outside, and usually needs to open 53 ports. The user IP is typically dynamic IP and does not need to be serviced out, so the user IP typically does not open 53 ports.
(2) And counting the ratio of the number of the source IP packets to the number of the destination IP packets in the DNS request flow and the DNS response flow of the IP to be judged in a fixed time period.
An enterprise private DNS domain name resolution server IP provides service for the outside, and simultaneously needs to send DNS requests to other domain name resolutions for domain names which are not authoritative resolved by the enterprise private DNS domain name resolution server IP, so that more DNS request flows are generally used as destination IPs, and more DNS response flows are used as source IPs; and the user IP basically only serves as a source IP in DNS request traffic and a destination IP in DNS response traffic.
(3) And counting the number of the corresponding different source IPs when the IP is used as the target IP.
When the domain name resolution server IP is used as a target IP, the corresponding source IP is a user IP, the server provides IP service for the whole network or partial area, and the source IP is large in quantity; when the user IP is used as the target IP, the corresponding source IP is the server IP, the service accessed by the user is limited, and the number of the source IPs is small.
The above is a typical statistical characteristic that can distinguish the IP usage patterns, and other characteristics can also distinguish the IP usage patterns, which are not described in detail here. Each of the above features can independently judge whether the IP is a user or a domain name server IP, but the accuracy is general. The classification of the IP needs to integrate all the characteristics as much as possible, and the invention integrates all the characteristics by adopting a weighted voting method. The weighted voting formula is as follows:
Figure BDA0002101514950000061
wherein x represents the IP to be detected, H (x) represents the calibration result of x, T represents the characteristic number, omegaiThe weight of the characteristic i is represented, j represents the IP using mode, the IP is a user IP or a server IP,
Figure BDA0002101514950000062
the class label output of the use mode j for x through the feature i is shown,
Figure BDA0002101514950000063
Figure BDA0002101514950000064
and representing the confidence coefficient of the type j of the IP to be detected, and calculating the confidence coefficients belonging to the two categories under the characteristic i to determine category output. Finally outputting the IP using mode H (x) with the maximum weighted vote. H (x) is a value of (0,1) to indicate the type and confidence of the IP to be detected, and the IP usage with high confidence is obtained by the argmax function.
As shown in fig. 4, for the IP simultaneously appearing in the result of the stream data analysis and the result of the DNS data analysis, the method of the present invention further performs a fusion calibration on the IP addresses of the user and the server. The IP address is classified in the above two aspects, i.e. the access relation of the IP address in the enterprise stream data and the request relation of the source IP address in the enterprise DNS data, respectively, to determine whether the IP address belongs to the user or the server. Then the method of the invention fuses the information of the two methods and mutually verifies the classification results so as to improve the classification accuracy. The following describes a model used for information fusion by the method of the present invention, and a method for selecting parameters in the model.
The information fusion model is constructed from two layers, as shown in fig. 4, on one hand, the analysis results of the flow data analysis and the DNS data analysis are fused, and on the other hand, the features used in the analysis processes of the flow data analysis and the DNS data analysis are fused.
First, the analysis results of the flow data analysis and the DNS data analysis are merged. The above-described flow data analysis and DNS data analysis of the present invention will give a value between (0,1) for each IP address, indicating the type and confidence with which it considers that IP address. A score of 0.99 indicates that the IP address is considered to belong to the server with a high degree of confidence. These two can be considered to be in the form of the following function, function hAThe input of (1) is an IP address, the output yAIs a real number belonging to (0, 1).
yA=hA(IP)
The function based on stream data analysis uses h, assuming that an IP address appears in two different analysis results at the same timeF(IP) denotes a function h based on DNS Domain name data analysisD(IP) indicates that the analysis of IP according to the above method results in two values y between 0 and 1FAnd yDFor convenience of calculation, yFAnd yDConfidence for the server IP is employed. The present invention can be fused by linearly combining the outputs of these two as shown below:
yA=hA(IP)=wF·hF(IP)+wD·hD(IP)
wherein wFAnd wDIs an adjustable weight parameter, and wF+wD1, represents the significance of two results. The determination of these two parameters is further described below. y isAThe classification result is a fused classification result, and the closer to 1, the higher confidence is indicated, and the IP can be considered to belong to the server.
Second, the features used in the analysis process of the flow data analysis and the DNS data analysis are fused. In the process of analyzing the stream data and DNS data, the data is inspected from a plurality of different angles and classified according to the characteristic indexes of the data. For example, the stream data analysis calculates, for each IP address, an index such as an ingress/egress degree of the IP in a certain period of time, and a ratio of uplink and downlink traffic. The invention combines all indexes considered in the two analysis methods for classification in a combined manner, thereby achieving better classification effect. The framework has the advantages that more accurate and careful weight distribution can be carried out, so that the model is more accurate, and the classification effect is better.
The indicators can be divided into two categories, boolean indicators and numerical indicators. The boolean indicator indicates the presence or absence of a certain feature, such as whether a certain IP address is bound to a certain DNS, as can be derived from DNS data. The numerical indicator is a continuous numerical indicator, and can be compared in size, for example, an IP address in a certain period of time can be calculated from stream dataThe ratio of the upstream traffic to the downstream traffic. Note that each index can be understood as an argument as a function of IP address, in Ik=fk(IP) representation. A plurality of such indices are extracted from the stream data and DNS data, respectively, and then linearly combined as follows:
Figure BDA0002101514950000081
wherein k represents a feature number, wkRepresents a corresponding index IkThe weight value, and the method for determining the weight value will be further described later. Because the output of each index is 0,1 or any real number, the value range of I is any real number. In order to map the interval to the (0,1) to represent the confidence that the interval belongs to the server, a Logistic function is adopted to change I, which is represented as follows:
Figure BDA0002101514950000082
the function compresses the input in the real number range into the (0,1) interval, which can represent the confidence of the model classification. The larger the output value of the function is, the higher the probability that the IP address is a server is.
Examples of the metrics available in the model from streaming data and DNS data are listed below, respectively, and more metrics will continue to be extracted in practice in the hope of achieving better classification.
The stream data includes:
Figure BDA0002101514950000083
the ratio of the uplink flow to the downlink flow of the IP address to be detected in a certain period of time;
Figure BDA0002101514950000084
the ratio of the out-degree to the in-degree of the IP address to be detected in a certain period of time is used as the ratio of the source IP to the target IP;
Figure BDA0002101514950000085
the active time of the IP address to be detected in a certain period of time is proportional.
The DNS data includes:
Figure BDA0002101514950000086
and whether the IP address is bound with the domain name is detected.
In the two models for fusing information from different layers, the determination of the weight parameters related to model fusion is very critical, and whether the fusion can bring the improvement of classification accuracy is directly determined. Two methods for determining the weight according to the present invention will be described below, and advantages and disadvantages of using them will be described.
The first weight determination method comprises the following steps: and (5) performing a heuristic method. In the heuristic method, the weights are distributed evenly, namely the weights of all items in the fusion model are equal. The method has the advantages of no need of extra large amount of calculation and high efficiency. But may have a large difference from the optimal value and the model performance is not good. The heuristic method is more suitable for fusing the analysis results of the flow data analysis and the DNS data analysis. The fusion is carried out by the method, which is equivalent to two methods for carrying out fair voting and jointly determining the class of a certain IP.
The second weight determination method comprises the following steps: a learning method. When the characteristics used in the analysis processes of the flow data analysis and the DNS data analysis are fused, the heuristic method is not good, because there is certainly a great difference in importance between the indexes used from the flow data and the DNS data when the fusion is performed in this level. If the average weight is simply performed, the classification effect of the model is poor. To this end, the invention proposes to use a learning-type method for determining these weights.
According to the user IP address resource library and the server IP address resource library, a batch of IP addresses with known types can be obtained, and corresponding indexes can be calculated by analyzing the flow and DNS access conditions related to the IP addresses. Combining these calculated metrics with their corresponding IP classes, a training data set is obtained. And finding an optimal group of weights through the training data set, so that the classification accuracy of the fused model is highest. After training, the model parameters are determined, and then the samples outside the training set, namely the IP to be detected, can be classified by using the weights learned by the omics.
As shown in fig. 5, after the calibration library is obtained by fusion determination, the present invention further provides a web interface analysis display, which can count the total number of the generated server IP and the user IP in real time, and provide a batch query function for use in accuracy verification. During accuracy verification, a batch of information for determining the IP attribute of the user/server is uploaded to a system through a web project, comparative analysis is carried out, and the accuracy of a result is finally output.
Correspondingly, the invention provides a user and server IP address calibration device based on specific enterprise domain name data, which comprises: the system comprises a data acquisition module, a data cleaning module, a data storage module, a stream data processing module, a domain name data processing module, a fusion calibration module, a result storage and display module, a flow analysis and supervision module and the like.
And the data acquisition module is used for acquiring the stream data of the enterprise private route and the domain name data of the private DNS. Enterprise private routing and DNS traffic is replicated to the handler using the dedicated traffic device. And the data cleaning module is used for cleaning the acquired data and reserving useful field information such as a source IP, a source port, a destination IP, a destination port, a timestamp and the like. And the data storage module is a hive data warehouse and is used for storing the cleaned data files, and a first calibration library, a second calibration library and a third calibration library are respectively arranged for the stream data processing module, the domain name data processing module and the fusion calibration module so as to store the IP calibration result. The cleaned data is stored in a vectorization mode and is input into the stream data processing module and the domain name data processing module in real time for processing. And the stream data processing module is used for analyzing stream data of the enterprise private route, identifying the type of the IP according to the IP behavior characteristics, outputting the type and the confidence coefficient of the IP to be judged through each IP behavior characteristic, performing weighted summation on the output confidence coefficient to serve as a first calibration result, and storing the first calibration result in a first calibration library. And the domain name data processing module is used for analyzing the domain name data of the enterprise private DNS, identifying the type of the IP according to the IP behavior characteristics, outputting the type and confidence coefficient of the IP to be judged through each IP behavior characteristic, performing weighted summation on the output confidence coefficients to serve as a second calibration result, and storing the second calibration result in a second calibration library. The IP behavior characteristics adopted by the stream data processing module and the domain name data processing module can be referred to the description in the above method. The fusion calibration module is used for weighting and summing calibration results obtained by the IP to be judged from the first calibration library and the second calibration library, determining the type and the confidence coefficient of the IP to be judged and storing the type and the confidence coefficient into a third calibration library; or, the fusion calibration module selects IP behavior characteristics and output thereof for the IP to be judged, performs weighted summation on the output confidence coefficient, determines the type and the confidence coefficient of the IP to be judged, and stores the type and the confidence coefficient into a third calibration library; the selected IP behavior characteristics can be found in the description of the above method. And the flow analysis and supervision module reads the type of the IP from the third calibration library, sets a white list according to the type of the IP and filters the network flow.
In the embodiment of the invention, the cleaned data in the hive library is converted into RDD in real time, associative analysis and preprocessing are carried out through Spark SQL and Spark Streaming, and the preprocessed result set is temporarily stored in the memory and used for Spark MLlib machine learning. And using some known user/server IP as a result set, adopting a scimit-lean decision tree algorithm to calculate the weight of each feature, wherein the IP address sets analyzed by the user and server IP address calibration based on enterprise private routing equipment stream data and the user and server IP address calibration method based on enterprise private DNS domain name data are two IP address sets with intersecting parts but not completely coincident, namely a certain IP address can be analyzed by two at the same time or only one. The information fusion is carried out on the IP which simultaneously appears in the flow data analysis result and the DNS data analysis result, if the IP only appears in one of the flow data analysis result and the DNS data analysis result, the information fusion is not carried out, the data is directly fused by using spark based on the analysis result, and the fused result is also stored in the hive library.

Claims (6)

1. A user and server IP address calibration device based on specific enterprise domain name data is provided with the following component modules on a processor:
the data acquisition module is used for acquiring stream data of the enterprise private route and private DNS domain name data;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
the data storage module is used for storing the cleaned data file and is provided with a first calibration library, a second calibration library and a third calibration library;
the flow data processing module is used for analyzing the flow data of the enterprise private route, identifying the type of the IP according to the IP behavior characteristics, outputting the type and the confidence coefficient of the IP to be judged through each IP behavior characteristic, performing weighted summation on the output confidence coefficient to serve as a first calibration result, and storing the first calibration result in a first calibration library; the types of the IP are user IP and server IP;
the domain name data processing module is used for analyzing the domain name data of the private DNS of the enterprise, identifying the type of the IP according to the IP behavior characteristics, outputting the type and the confidence coefficient of the IP to be judged through each IP behavior characteristic, performing weighted summation on the output confidence coefficients to serve as a second calibration result, and storing the second calibration result in a second calibration library;
the fusion calibration module is used for weighting and summing the two calibration results of the to-be-judged IP in the first calibration library and the second calibration library at the same time, determining the type and the confidence coefficient of the to-be-judged IP, storing the to-be-judged IP in the third calibration library, and deleting the calibration stored in the first calibration library and the second calibration library; or, the fusion calibration module selects IP behavior characteristics and output thereof for the IP to be judged, performs weighted summation on the output confidence coefficient, determines the type and the confidence coefficient of the IP to be judged, and stores the type and the confidence coefficient into a third calibration library; the selected IP behavior characteristics include: calculating the ratio of the uplink flow to the downlink flow of the IP to be detected, the ratio of the out-degree to the in-degree of the IP to be detected and the active time ratio of the IP to be detected within a set time aiming at the flow data; detecting whether the IP address to be detected is bound with the domain name or not aiming at the domain name data; deleting the calibration results stored in the first calibration library and the second calibration library from the IP stored in the third calibration library;
and the flow analysis and supervision module reads the IP types from the three calibration libraries, sets a white list according to the IP types and filters the network flow.
2. The apparatus of claim 1, wherein the fused calibration module performs an average weighted summation on the calibration results obtained from the first calibration library and the second calibration library to determine the type and the confidence of the IP.
3. The apparatus of claim 1, wherein the fusion calibration module determines weights for the outputs of the selected IP feature behaviors according to a learning method, and performs weighted summation; the learning method is as follows: and acquiring the IP addresses of the calibrated types and the characteristic behavior output corresponding to each IP, taking the acquired characteristic behavior output as input and the calibrated types as output, performing weight training, and finding the optimal weight with the highest classification accuracy.
4. A method for calibrating user and server IP addresses based on specific enterprise domain name data is characterized by comprising the following steps:
step 1, calibrating IP addresses of users and servers based on enterprise private routing equipment stream data;
NETFLOW flow data is sampled in real time based on enterprise private routing equipment, and whether the IP address is a server IP address or a user IP address is judged according to IP behavior characteristics;
the IP behavior characteristics comprise: (1) whether a stable port of the IP starts a known service port or not is judged, the server IP starts the known service port, and the user IP does not start the known service port; (2) calculating the ratio of the number of outflow bytes to the number of inflow bytes of the IP to be judged in a fixed time period; the number of bytes of outflow of the server IP is greater than the number of bytes of inflow, the number of bytes of inflow of the user IP is greater than the number of bytes of outflow; (3) calculating the flow ratio of the known port flow and the dynamic port of the IP to be judged; the flow of the server IP is centralized at a known port, and the flow of the user IP is centralized at a dynamic port; (4) counting the number of corresponding source IPs when the IP to be judged is used as a target IP; when the server IP is used as the target IP, the number of the corresponding source IPs is large, and when the user IP is used as the target IP, the number of the corresponding source IPs is small;
according to each IP behavior characteristic, outputting confidence coefficients that the IP to be judged is the user IP and the server IP respectively, performing weighted summation on the outputs to serve as a first calibration result, and updating in a first calibration library;
step 2, calibrating the IP addresses of users and servers based on the domain name data of the enterprise private DNS;
distinguishing a server IP and a user IP according to IP behavior characteristics of the collected private DNS domain name data;
the IP behavior feature in this step includes: (1) when judging whether 53 ports exist in the stable ports of the IP, opening 53 ports by the domain name resolution server IP, and not opening 53 ports by the user IP; (2) counting the ratio of the number of packets of the IP to be judged, which are respectively used as a source IP and a target IP in the DNS request flow and the DNS response flow, within a fixed time period; the domain name resolution server IP is more used as a destination IP in DNS request flow and more used as a source IP in DNS response flow; the user IP only serves as a source IP in DNS request flow and a target IP in DNS response flow; (3) counting the number of corresponding different source IPs when the IP to be judged is used as a target IP; when the domain name resolution server IP is used as a target IP, the number of corresponding source IPs is large; when the user IP is used as the target IP, the corresponding source IP is small in quantity;
according to each IP behavior characteristic, outputting confidence coefficients that the IP to be judged is the user IP and the server IP respectively, performing weighted summation on the outputs to serve as a second calibration result, and updating in a second calibration library;
step 3, fusing and calibrating the IP addresses of the user and the server; the step is to perform fusion calibration on the IP with calibration results in the steps 1 and 2; performing fusion calibration on the IP calibration result obtained only in the step 1 or 2 without performing fusion calibration, and directly taking the calibration result as the standard;
carrying out weighted summation on the IP to be judged according to the first calibration result and the second calibration result obtained in the step 1 and the step 2 respectively to obtain the type of the IP to be judged; the first calibration result and the second calibration result are confidence coefficients that the IP to be judged is the IP of the server side;
further, the IP to be judged is subjected to weighted summation by selecting the IP behavior characteristic output used in the step 1 and the step 2 to obtain the type of the IP to be judged; outputting each IP behavior characteristic, and using the confidence coefficient that the IP to be judged is the IP of the server side;
and after the type of the IP to be judged is determined, adding a type identifier into the recorded IP information, and analyzing and monitoring the flow according to the type identifier.
5. The method according to claim 4, wherein in step 3, when the first calibration result and the second calibration result are weighted and summed, a heuristic method is used for weight distribution, and the heuristic method is to distribute weights evenly.
6. The method of claim 4, wherein in step 3, when performing weighted summation on the IP behavior feature output, the weights are determined by using a learning method, and the learning method comprises: and acquiring the IP addresses of the calibrated types and the characteristic behavior output corresponding to each IP, taking the acquired characteristic behavior output as input and the calibrated types as output, performing weight training, and finding the optimal weight with the highest classification accuracy.
CN201910537333.0A 2019-06-20 2019-06-20 User and server IP address calibration device and method based on specific enterprise domain name data Active CN110324327B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910537333.0A CN110324327B (en) 2019-06-20 2019-06-20 User and server IP address calibration device and method based on specific enterprise domain name data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910537333.0A CN110324327B (en) 2019-06-20 2019-06-20 User and server IP address calibration device and method based on specific enterprise domain name data

Publications (2)

Publication Number Publication Date
CN110324327A CN110324327A (en) 2019-10-11
CN110324327B true CN110324327B (en) 2021-07-13

Family

ID=68120986

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910537333.0A Active CN110324327B (en) 2019-06-20 2019-06-20 User and server IP address calibration device and method based on specific enterprise domain name data

Country Status (1)

Country Link
CN (1) CN110324327B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112929458B (en) * 2019-12-06 2023-04-07 中国电信股份有限公司 Method and device for determining address of server of APP (application) and storage medium
CN112272245B (en) * 2020-10-22 2022-11-01 广州大学 Lightweight domain name system data generation system and method
CN112671952B (en) * 2020-12-31 2022-12-13 恒安嘉新(北京)科技股份公司 IP detection method, device, equipment and storage medium
CN113630409B (en) * 2021-08-05 2023-04-28 哈尔滨工业大学(威海) Abnormal flow identification method based on DNS analysis flow and IP flow fusion analysis
CN114466398A (en) * 2021-12-20 2022-05-10 中盈优创资讯科技有限公司 Method and device for analyzing 5G terminal user behaviors through netflow data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101990003B (en) * 2010-10-22 2012-11-28 西安交通大学 User action monitoring system and method based on IP address attribute
CN103412930A (en) * 2013-08-17 2013-11-27 北京品友互动信息技术有限公司 Method for identifying attributes of internet users
CN105704259A (en) * 2016-01-21 2016-06-22 中国互联网络信息中心 IP recognition method and system for domain name authority service source
CN106528561A (en) * 2015-09-11 2017-03-22 飞思达技术(北京)有限公司 An internet content resource detection method based on the internet crawler technology
CN107704586A (en) * 2017-10-09 2018-02-16 陈包容 A kind of methods, devices and systems of user's portrait based on User Activity address

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8935383B2 (en) * 2010-12-31 2015-01-13 Verisign, Inc. Systems, apparatus, and methods for network data analysis
US20150215334A1 (en) * 2012-09-28 2015-07-30 Level 3 Communications, Llc Systems and methods for generating network threat intelligence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101990003B (en) * 2010-10-22 2012-11-28 西安交通大学 User action monitoring system and method based on IP address attribute
CN103412930A (en) * 2013-08-17 2013-11-27 北京品友互动信息技术有限公司 Method for identifying attributes of internet users
CN106528561A (en) * 2015-09-11 2017-03-22 飞思达技术(北京)有限公司 An internet content resource detection method based on the internet crawler technology
CN105704259A (en) * 2016-01-21 2016-06-22 中国互联网络信息中心 IP recognition method and system for domain name authority service source
CN107704586A (en) * 2017-10-09 2018-02-16 陈包容 A kind of methods, devices and systems of user's portrait based on User Activity address

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于溯源数据与流数据的IP属性分析;藕雪婷;《中国优秀硕士学位论文全文数据库信息科技辑》;20190115;全文 *

Also Published As

Publication number Publication date
CN110324327A (en) 2019-10-11

Similar Documents

Publication Publication Date Title
CN110324327B (en) User and server IP address calibration device and method based on specific enterprise domain name data
WO2020078385A1 (en) Data collecting method and apparatus, and storage medium and system
US20120026914A1 (en) Analyzing Network Activity by Presenting Topology Information with Application Traffic Quantity
WO2021169294A1 (en) Application recognition model updating method and apparatus, and storage medium
US11558769B2 (en) Estimating apparatus, system, method, and computer-readable medium, and learning apparatus, method, and computer-readable medium
CN110300065A (en) A kind of application traffic identification method and system based on software defined network
CN110177123A (en) Botnet detection method based on DNS mapping association figure
CN111953552A (en) Data flow classification method and message forwarding equipment
Zhao et al. A few-shot learning based approach to IoT traffic classification
CN111064817B (en) City-level IP positioning method based on node sorting
CN114374626A (en) Router performance detection method under 5G network condition
CN114401516B (en) 5G slice network anomaly detection method based on virtual network traffic analysis
CN112449371A (en) Performance evaluation method of wireless router and electronic equipment
CN117290719B (en) Inspection management method and device based on data analysis and storage medium
CN111565124B (en) Topology analysis method and device
CN116401586A (en) Intelligent sensing and accurate classifying method for full scene service
CN113259263B (en) Data packet scheduling method in deep packet inspection cluster
Li et al. Cyber performance situation awareness on fuzzy correlation analysis
Li et al. Iot devices identification based on machine learning
CN113726809B (en) Internet of things equipment identification method based on flow data
Pekar et al. Towards threshold‐agnostic heavy‐hitter classification
CN110175635B (en) OTT application program user classification method based on Bagging algorithm
Wang et al. Ensemble classifier for traffic in presence of changing distributions
CN112953961A (en) Equipment type identification method in power distribution room Internet of things
CN114679394B (en) Bitcoin address classification verification method based on network space search engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant