WO2022067539A1 - 网络流量处理方法、装置、存储介质及计算机设备 - Google Patents

网络流量处理方法、装置、存储介质及计算机设备 Download PDF

Info

Publication number
WO2022067539A1
WO2022067539A1 PCT/CN2020/118964 CN2020118964W WO2022067539A1 WO 2022067539 A1 WO2022067539 A1 WO 2022067539A1 CN 2020118964 W CN2020118964 W CN 2020118964W WO 2022067539 A1 WO2022067539 A1 WO 2022067539A1
Authority
WO
WIPO (PCT)
Prior art keywords
clustering
network traffic
distance
network
result
Prior art date
Application number
PCT/CN2020/118964
Other languages
English (en)
French (fr)
Inventor
刘澍嶷
崔应杰
渠海峡
Original Assignee
山石网科通信技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山石网科通信技术股份有限公司 filed Critical 山石网科通信技术股份有限公司
Priority to CN202080002208.0A priority Critical patent/CN112352412B/zh
Priority to US17/043,714 priority patent/US11874901B2/en
Priority to PCT/CN2020/118964 priority patent/WO2022067539A1/zh
Publication of WO2022067539A1 publication Critical patent/WO2022067539A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Definitions

  • the present invention relates to the field of network security, and in particular, to a method, device, storage medium and computer equipment for processing network traffic.
  • Embodiments of the present invention provide a network traffic processing method, device, storage medium, and computer equipment, so as to at least solve the technical problem that the related art cannot provide a basis for formulating a network control strategy from complex network information due to complex network topology. .
  • a method for processing network traffic including: acquiring network traffic, and using the acquired network traffic as discrete objects; and clustering the discrete objects to obtain a clustering result ; output the clustering result.
  • clustering the discrete objects to obtain the clustering result includes: using a hierarchical agglomerative clustering HAC method to cluster the discrete objects to obtain the clustering result.
  • using the HAC method to cluster the discrete objects, and obtaining the clustering result includes: according to the source IP address and destination IP address of the network traffic that are the discrete objects, using the The HAC method performs clustering on the discrete objects to obtain a clustering result.
  • the cohesion distance used in the HAC method includes: Chebyshev distance.
  • the aggregation distance used in the HAC method is based on the first distance and the second distance, and the first distance and the first distance.
  • the respective weights occupied by the two distances are determined, wherein the first distance is the distance between the source virtual machines of the two network flows used for clustering, and the second distance is the two network flows used for clustering. The distance between the destination virtual machines.
  • outputting the clustering result includes: when the network traffic is network traffic transmitted in a network using a physical machine, outputting the clustering result in an IP/Mask format; When the network traffic is the network traffic transmitted in the cloud network using the virtual machine, the clustering result is output in the form of an address book.
  • clustering the discrete objects to obtain the clustering result includes: determining multiple agglomerative distance ladder values; clustering the discrete objects according to the multiple agglomeration distance ladder values, and obtaining a The multiple clustering results corresponding to the multiple agglomeration distance step values respectively.
  • clustering the discrete objects to obtain the clustering result includes: acquiring clustering control parameters; filtering the discrete objects according to the clustering control parameters to obtain a filtering result; The filtering results are clustered to obtain the clustering results.
  • the clustering control parameter includes: a port of the network traffic.
  • treating the acquired network traffic as a discrete object includes: extracting feature information of the network traffic; preprocessing the feature information to obtain a preprocessing result; and mapping the preprocessing result into plane rectangular coordinates a point in the system, and use this point as the discrete object.
  • mapping the preprocessing result to a point in the plane rectangular coordinate system includes: determining whether the divergence degree of the feature information used for clustering reaches a predetermined threshold; When the degree of divergence of the feature information reaches the predetermined threshold, adjust the coordinates of the preprocessing result to obtain a preprocessing result after adjusting the coordinates; map the preprocessing result after adjusting the coordinates into the plane rectangular coordinate system a point in .
  • the method is applied to flow control of firewalls.
  • a network traffic processing apparatus comprising: an acquisition module for acquiring network traffic, and using the acquired network traffic as discrete objects; a clustering module for The discrete objects are clustered to obtain a clustering result; an output module is used to output the clustering result.
  • a computer-readable storage medium includes a stored program, wherein when the program runs, the device where the storage medium is located is controlled to execute any of the above The network traffic processing method described above.
  • a computer device including: a memory and a processor, where the memory stores a computer program; the processor is configured to execute the computer program stored in the memory, the computer When the program runs, the processor executes the network traffic processing method described in any one of the above.
  • the network traffic is regarded as a discrete object, and the discrete object is clustered, and the purpose of providing a basis for formulating a network control strategy is achieved by outputting the clustering result, thereby realizing an efficient and reasonable solution.
  • the technical effect of formulating the network control strategy further solves the technical problem that the related technology cannot provide the basis for formulating the network control strategy from the complex network information due to the complex network topology.
  • FIG. 1 is a flowchart of a network traffic processing method according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a HAC clustering algorithm provided according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a stepwise clustering result provided according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a coordinate scaling optimization clustering result provided according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a binary tree formed by an IP address set provided according to an embodiment of the present invention.
  • FIG. 7 is a structural block diagram of a network traffic processing apparatus provided according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a network policy clustering apparatus provided according to an embodiment of the present invention.
  • Network traffic is the flow of data transmitted over the network.
  • Each specific piece of network traffic has some parameters, such as source IP address, destination IP address, source port number and destination port number.
  • Cluster a concept in cluster analysis, refers to a group obtained by grouping physical or abstract objects.
  • a cluster can include one or more objects.
  • Chebyshev distance a measure of distance in coordinate space, defines the distance between two points as the maximum value of the absolute value of the difference of each coordinate value.
  • a method for processing network traffic is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings may be executed in a computer system such as a set of computer-executable instructions, and although the flowchart A logical order is shown in the figures, but in some cases steps shown or described may be performed in an order different from that herein.
  • FIG. 1 is a flowchart of a method for processing network traffic according to an embodiment of the present invention. As shown in FIG. 1 , the flowchart includes the following steps:
  • Step S102 acquiring network traffic, and using the acquired network traffic as a discrete object
  • Step S104 clustering the discrete objects to obtain a clustering result
  • Step S106 output the clustering result.
  • the network traffic is used as a discrete object, and the discrete object is clustered, and the purpose of providing a basis for formulating network control strategies is achieved by outputting the clustering results. Because the clustering results reflect some network traffic. Therefore, when formulating a network control strategy, a unified control process can be performed on the network traffic with this commonality (for example, a unified cut or opening, etc.), so as to achieve the technical effect of efficiently and reasonably formulating network control strategies. This further solves the technical problem that the related art cannot provide a basis for formulating a network control strategy from complex network information due to the complex network topology.
  • the embodiments of the present invention provide network administrators with more efficient and easy-to-understand policy suggestions by clustering network traffic, help network administrators understand the characteristics of traffic in the current network, and formulate appropriate network control strategies.
  • the embodiments of the present invention can optimize the adaptive learning, so that the result of the adaptive learning can more effectively realize the formulation of the network control strategy.
  • clustering discrete objects to obtain a clustering result includes: using a hierarchical agglomerative clustering HAC method to cluster discrete objects to obtain a clustering result.
  • the HAC clustering method is a method for clustering multiple objects to be clustered. This method does not need to manually select the initial points of clustering and the number of clusters, and can be performed automatically. The following is a brief description of the hierarchical agglomerative clustering HAC method. First, take each object as an independent Cluster, traverse all the Clusters, and calculate the distance between the two Clusters; then combine the two closest Clusters into one, and the coordinates of the new Cluster are the two Clusters that combine it. Coordinates of the midpoint; repeat the above cohesion operation, each time two clusters are cohesive together, the total number of clusters is reduced by 1, and finally all objects will be aggregated to one cluster.
  • FIG. 2 is a schematic diagram of a HAC clustering algorithm provided according to an embodiment of the present invention.
  • the HAC distance algorithm has an adjustable parameter - the maximum clustering distance d. It can be seen from e) that when the aggregation distance is d1, there are 3 Clusters; when the aggregation distance is d2, there are 2 Clusters; when the aggregation distance is d3, there is only one Cluster.
  • using the HAC method to cluster discrete objects, and obtaining a clustering result includes: using the HAC method to cluster the discrete objects according to source IP addresses and destination IP addresses of network traffic that are discrete objects. class to get the clustering result.
  • the source IP address and destination IP address of the network traffic as discrete objects are used as the basis for clustering, and according to the source IP address and destination IP address of the two discrete objects, the similarity or similarity of the two discrete objects is judged, and the discrete object is realized. to cluster.
  • clustering results corresponding to multiple agglomeration distances may be provided. For example, when clustering discrete objects and obtaining a clustering result, multiple agglomerative distance ladder values can be determined; according to the multiple agglomerative distance ladder values, cluster the discrete objects, and obtain corresponding agglomeration distance ladder values respectively. multiple clustering results. As an optional embodiment, a step-by-step aggregation result display in descending order of the maximum aggregation distance can be provided. It is convenient for network administrators to choose the best clustering result.
  • FIG. 3 is a schematic diagram of a stepwise clustering result provided according to an embodiment of the present invention. As shown in FIG. 3 , the stepwise value l can be used to control the aggregation result as required, and the stepwise display can be achieved.
  • clustering discrete objects to obtain a clustering result includes: acquiring clustering control parameters; filtering the discrete objects according to the clustering control parameters to obtain a filtering result; clustering the filtering results , get the clustering result.
  • the clustering control parameters may also include multiple types, for example, the clustering control parameters may include: ports of network traffic. Wherein, the port includes the source port and the destination port of the discrete object.
  • control parameter for clustering it can be selected by the user to filter the discrete objects participating in the clustering. For example, if the user specifies that only HTTP traffic is to be clustered, only objects whose port number contains 80 will be clustered.
  • using the acquired network traffic as a discrete object can be implemented in various ways.
  • the following method can be used: first extract feature information of the network traffic, where the feature information can be various, for example , which can be the source IP address, destination IP address, source port number and destination port number of the network traffic; preprocess the feature information to obtain a preprocessing result, where the preprocessing referred to here can include a variety of processing , for example, the above IP address can be normalized; the preprocessing result is mapped to a point in the plane rectangular coordinate system, and the point is regarded as a discrete object.
  • the network traffic is mapped to a point in the plane rectangular coordinate system, which can intuitively reflect the clustering results.
  • the clustering effect is not ideal.
  • the following processing may be adopted: determining the divergence of the feature information used for clustering Whether the degree reaches a predetermined threshold, where the predetermined threshold can be obtained according to experience or statistics; when the divergence degree of the feature information used for clustering reaches the predetermined threshold, adjust the coordinates of the preprocessing result to obtain the adjusted coordinates.
  • Preprocessing result map the preprocessing result after adjusting the coordinates to a point in the plane rectangular coordinate system.
  • the adjustment coordinates referred to above may be the scale adjustment of the original coordinates in the coordinate system, or the scale adjustment of the coordinate values of the discrete objects participating in the clustering directly. Which method to use can be flexibly selected according to needs.
  • the policy configuration converted from the clustered Cluster is: from 10.100.1.101/30 to 10.100.2.0/24.
  • This Cluster corresponds to a long and narrow rectangle in the clustering coordinate system, and it is easy to cluster it into multiple Clusters during clustering.
  • the method of coordinate scaling is used, so that the more divergent terms are projected more convergently in the coordinate system, so as to obtain better clustering results.
  • FIG. 4 is a schematic diagram of a clustering result of coordinate bloom optimization provided according to an embodiment of the present invention. As shown in FIG. 4 , in a coordinate system of normal proportions, 8 clusters may be clustered into two clusters, A and B. Through the change of the scale of the coordinates, the divergent term becomes convergent, and better clustering results can be obtained by using the algorithm.
  • the "converged source address" scheme can be specified.
  • a coefficient ⁇ ( ⁇ 1) is added, so that the source address is collectively mapped to the coordinate system.
  • Different networks may have different maintenance methods. For example, using a physical machine's network is not the same as using a virtual machine's network.
  • network traffic processing method of this solution when performing clustering processing on network traffic transmitted in different networks, in order to achieve efficient clustering, different distance representation methods can be used to complete the clustering.
  • the cohesion distance used in the HAC method includes: Chebyshev distance.
  • the clustering is based on the distance between the two clusters.
  • the HAC method clusters the two closest clusters in the coordinate system space into a new cluster, and then re-traverses all the clusters to find the new cluster with the closest distance. cluster, perform the next step of clustering.
  • the characteristic of the network traffic is that the source IP addresses of multiple network traffic are usually concentrated, and the same is true of the destination IP addresses.
  • the corresponding area of the network strategy obtained by clustering is a square or rectangle in the coordinate system space where the clustering operation is performed.
  • the configuration strategy converted from a clustered Cluster is: from 10.100.1.0/24 to 10.100.2.0/24
  • the corresponding area of this Cluster in the coordinate system space of the clustering operation is a square.
  • the Chebyshev distance is used to evaluate the distance between two clusters, and a good clustering effect can be obtained.
  • the Chebyshev distance formula is:
  • d is the distance between the two clusters1 and cluster2
  • x1 is the abscissa of cluster1
  • x2 is the abscissa of cluster2
  • y1 is the ordinate of cluster1
  • y2 is the ordinate of cluster2 .
  • FIG. 5 is a schematic diagram of calculating the Chebyshev distance according to an embodiment of the present invention. As shown in Figure 5, when a cluster is at the origin, the area with a Chebyshev distance of 1 from the cluster is a square.
  • the aggregation distance used in the HAC method is based on the first distance and the second distance.
  • the distance, and the respective weights occupied by the first distance and the second distance are determined, wherein the first distance is the distance between the source VMs of the two network traffic used for clustering, and the second distance is the two distances used for clustering.
  • the network environment is flat, and the IP addresses of VMs are allocated more messily than in the traditional network environment. Therefore, using the Chebyshev distance to define the distance between two network traffic cannot achieve good convergence. class effect.
  • the clustering effect may not be guaranteed.
  • a method of defining the distance between two VMs by using an address book and a joint probability is used to solve the above problem.
  • the distance between two VMs will be exemplarily described below according to an embodiment of the present invention.
  • VMs in a cloud platform are generally divided according to functions, and each VM belongs to one or several address books.
  • VM ⁇ belongs to M address books among the N address books, and M is less than or equal to N.
  • conditional probability that the address book containing VM ⁇ x also contains VM ⁇ x is:
  • the distance between VM ⁇ x and VM ⁇ x is defined as:
  • the distance between the destination VMs corresponding to the two network flows F ⁇ and F ⁇ can be defined as
  • the distance between the source VMs and the distance between the destination VMs of the two network flows are obtained. Accordingly, the distance between the two network traffic can be defined according to the distance between the source VM and the destination VM, and the respective weights occupied by the two distances. specific:
  • is a weight parameter that characterizes the distance between the source VM and the destination VM.
  • the distance between two network traffic transmitted in the cloud is characterized.
  • the network traffic in the cloud network environment can be well preliminarily clustered. After clustering network traffic several times, multiple clusters can be obtained, where each cluster includes multiple network traffic, which also means that each cluster includes multiple source VMs and destination VMs. At this time, the distance between any two clusters can be evaluated using the multi-dimensional joint probability, so that the HAC clustering method using the address book to represent the distance can be performed stably until multiple clusters that have not been clustered meet the preset until the termination condition.
  • the output clustering results may also be different.
  • the network traffic is the network traffic transmitted in the network using physical
  • the clustering result is output in the form of an address book; if the network traffic is the network traffic transmitted in the cloud network using the virtual machine, the clustering result is output in the form of an address book.
  • the source/destination IP addresses of all discrete objects in a Cluster in one or several IP/mask format address segments.
  • the network traffic is the network traffic transmitted in the network using physical machines
  • the terminal is connected to the Internet through the network constructed by the multi-level routing layer, so the source/destination IP addresses clustered in a cluster are more likely to be in the same network segment. This output method is more suitable for this network environment.
  • IP/mask output modes for users to choose from as needed: lite mode and strict mode.
  • the difference between the two modes is explained below through the implementation principle of the output algorithm.
  • FIG. 6 is a schematic diagram of a binary tree formed by a set of IP addresses provided according to an embodiment of the present invention.
  • the IPV4 address has a total of 32 bits, and all IP addresses in the Cluster are constructed in the order from high to low. 33 (the root node is meaningless) binary tree. When a bit is 0, it is used as the left subtree, and when it is 1, it is used as the right subtree.
  • IP/mask address output according to the first fork point of the IP binary tree as shown in point A in the above figure.
  • the IP Clusters that constitute the binary tree are clustered and have certain commonalities. Therefore, outputting the IP/mask format address according to the location of point A can display the network situation in a simplified way as much as possible while meeting most of the needs.
  • Strict mode The IP/mask address output according to the root node of the largest full binary tree. As shown at point B in the above figure, the subtree whose root node is B is a full binary tree, and an IP/mask address segment can be used to represent all nodes of the subtree with B as the root node. When the smallest subtree is a subtree with only one root node (corresponding to point C in the above figure), the output mask is 32. The output of strict mode closely matches the IP addresses in the Cluster, no more, no less.
  • the IP addresses in the Cluster can be matched to one or several address books, and the matching clustering results can be output in the form of address books, for example:
  • the address book matching algorithm is implemented as follows:
  • Table 1 is a schematic table of the inclusion situation of whether each IP address in the Cluster is included in each address book provided according to the embodiment of the present invention, as shown in Table 1 shown:
  • network traffic processing solution provided by the embodiments of the present invention can be applied to various scenarios, for example, can be applied to traditional firewall devices, public clouds, private clouds, adaptive policy recommendation of data centers, and other complex network environments firewall policy.
  • the following processing is adopted: extracting source/destination IPs in traffic logs, projecting traffic log features into a plane rectangular coordinate system; using unsupervised clustering algorithm to analyze traffic logs, and providing policy configuration suggestions ; Use Chebyshev distance as the evaluation value of agglomeration distance; use the coordinate scaling method to optimize the clustering effect; use the discrete joint probability of virtual machine to calculate the evaluation value of agglomeration distance; Address book format output. It can provide intelligent policy recommendations to facilitate network administrators to manage firewall policies more quickly and efficiently.
  • FIG. 7 is a structural block diagram of a network traffic processing apparatus provided according to an embodiment of the present invention. As shown in FIG. 7 , the apparatus includes: The acquisition module 72, the clustering module 74 and the output module 76 are described below.
  • an acquisition module 72 configured to acquire network traffic, and use the acquired network traffic as discrete objects
  • a clustering module 74 connected to the aforementioned acquisition module 72, for clustering the discrete objects to obtain a clustering result
  • the output module 76 connected to the above-mentioned clustering module 74, is used for outputting the clustering result.
  • FIG. 8 is a schematic structural diagram of a network policy clustering apparatus provided according to an embodiment of the present invention. As shown in FIG.
  • the network policy clustering apparatus includes: a log information extraction module 82, a feature information mapping module 84, a clustering module 86, The strategy output module 88, wherein the log information extraction module 82 and the feature information mapping module 84 implement the functions of the above-mentioned acquisition module 72, the clustering module 84 implements the functions of the above-mentioned clustering module 74, and the strategy output module 88 is the same as the above-mentioned output module 76. They are described below.
  • the log information extraction module 82 is configured to extract the source/destination IP and source/destination port number of each flow from the flow log, and perform deduplication processing. After processing, a plurality of discrete and deduplicated objects to be clustered are obtained, and each object contains parameters of 4 dimensions.
  • the feature information mapping module 84 connected to the log information extraction module 82, is used to map the source/destination IP of the discrete clustering object as the horizontal and vertical coordinates of the plane rectangular coordinate system to a point in the plane rectangular coordinate system.
  • the source IP address as an example, first obtain the minimum and maximum IP addresses of all objects, and then normalize the IP addresses of each object.
  • the normalized value is mapped to an unsigned 32-bit number as The abscissa of this object in the plane Cartesian coordinate system.
  • the destination IP is mapped to the ordinate of the plane rectangular coordinate system in the same way.
  • the clustering module 86 connected to the above-mentioned feature information mapping module 84, includes: a clustering execution unit and a clustering mode control unit.
  • the clustering execution unit is used for clustering the mapped discrete objects using a hierarchical agglomerative clustering algorithm (HAC).
  • HAC hierarchical agglomerative clustering algorithm
  • the clustering mode control unit is used to receive some user parameters and control the clustering execution unit to obtain a more optimized clustering result.
  • the policy output module 88 is connected to the above-mentioned clustering module 86, and is used for outputting the clustering results.
  • all discrete source/destination IP addresses in a Cluster can be output in two forms: IP/mask format output, address book matching output.
  • Embodiments of the present invention may provide a computer terminal, and the computer terminal may be any computer terminal device in a computer terminal group.
  • the above-mentioned computer terminal may also be replaced by a terminal device such as a mobile terminal.
  • the above-mentioned computer terminal may be located in at least one network device among multiple network devices of a computer network.
  • the computer terminal may include: one or more (only one is shown in the figure) processors, memories, and the like.
  • the memory stores computer programs, for example, can be used to store software programs and modules, such as program instructions/modules corresponding to the image processing method and apparatus in the embodiments of the present invention, and the processor runs the software programs and modules stored in the memory by running the software programs and modules. , so as to perform various functional applications and data processing, that is, to implement the above-mentioned image processing method.
  • the memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory may further include memory located remotely from the processor, and these remote memories may be connected to the computer terminal through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the processor may call the information and application programs stored in the memory through the transmission device to execute the computer program stored in the memory, and when the computer program runs, the processor executes the network traffic processing method described in any one of the above.
  • Embodiments of the present invention also provide a storage medium.
  • the above-mentioned storage medium may be used to store the program code executed by the network traffic processing method provided in the above-mentioned Embodiment 1, and when the program is running, the device where the storage medium is located is controlled to execute any of the above-mentioned methods.
  • the above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.
  • the disclosed technical content can be implemented in other ways.
  • the device embodiments described above are only illustrative, for example, the division of the units may be a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention.
  • the aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes .

Abstract

一种网络流量处理方法、装置、存储介质及计算机设备。其中,该方法包括:获取网络流量,并将获取的网络流量作为离散对象(S102);对离散对象进行聚类,得到聚类结果(S104),以及输出聚类结果(S106)。该方法解决了相关技术由于网络拓扑复杂,无法从复杂的网络信息中提供用于制定网络控制策略的依据的技术问题。

Description

网络流量处理方法、装置、存储介质及计算机设备 技术领域
本发明涉及网络安全领域,具体而言,涉及一种网络流量处理方法、装置、存储介质及计算机设备。
背景技术
当前互联网领域,随着网络设备数量的增长,网络拓扑越来越复杂,对网络控制策略进行合理规划也越来越困难。特别是近几年,随着数据中心发展迅速,数据中心内东西向流量庞大且复杂,给网络控制策略的规划增加了很大的难度,单纯通过人力进行的网络策略管理,难以实现高效、准确、及时的策略控制制定和改变。
基于上述问题,相关技术中提出了零信任网络和自适应学习等概念。其中,自适应学习产品往往根据特定网络信息,输出可视化或命令行形式的网络拓扑。但面对较复杂的网络时,简单的将网络拓扑进行陈列并不能帮助网络管理员高效、合理地制定网络控制策略,其他方法则有服务昂贵,计算周期长等缺点。
针对上述的问题,目前尚未提出有效的解决方案。
发明内容
本发明实施例提供了一种网络流量处理方法、装置、存储介质及计算机设备,以至少解决相关技术由于网络拓扑复杂,无法从复杂的网络信息中提供用于制定网络控制策略的依据的技术问题。
根据本发明实施例的一个方面,提供了一种网络流量的处理方法,包括:获取网络流量,并将获取的所述网络流量作为离散对象;对所述离散对象进行聚类,得到聚类结果;输出所述聚类结果。
可选地,对所述离散对象进行聚类,得到所述聚类结果包括:采用层次凝聚聚类HAC方法,对所述离散对象进行聚类,得到所述聚类结果。
可选地,采用所述HAC方法,对所述离散对象进行聚类,得到所述聚类结果包括:根据作为所述离散对象的所述网络流量的源IP地址和目的IP地址,采用所述HAC方法对所述离散对象进行聚类,得到聚类结果。
可选地,在所述网络流量为在使用物理机的网络中传输的网络流量的情况下,所 述HAC方法中所采用的凝聚距离包括:切比雪夫距离。
可选地,在所述网络流量为使用虚拟机的云端网络中传输的网络流量的情况下,所述HAC方法中所采用的凝聚距离依据第一距离和第二距离,以及第一距离和第二距离分别所占的权重确定,其中,所述第一距离为用于聚类的两条网络流量的源虚拟机之间的距离,所述第二距离为用于聚类的两条网络流量的目的虚拟机之间的距离。
可选地,输出所述聚类结果包括:在所述网络流量为在使用物理机的网络中传输的网络流量的情况下,以IP/Mask格式的方式输出所述聚类结果;在所述网络流量为使用虚拟机的云端网络中传输的网络流量的情况下,以地址簿的方式输出所述聚类结果。
可选地,对所述离散对象进行聚类,得到所述聚类结果包括:确定多个凝聚距离阶梯值;根据所述多个凝聚距离阶梯值,对所述离散对象进行聚类,得到与所述多个凝聚距离阶梯值分别对应的多个聚类结果。
可选地,对所述离散对象进行聚类,得到所述聚类结果包括:获取聚类控制参数;根据所述聚类控制参数,对所述离散对象进行过滤,得到过滤结果;对所述过滤结果进行聚类,得到所述聚类结果。
可选地,所述聚类控制参数包括:所述网络流量的端口。
可选地,将获取的所述网络流量作为离散对象包括:提取所述网络流量的特征信息;对所述特征信息进行预处理,得到预处理结果;将所述预处理结果映射成平面直角坐标系中的一个点,并将该点作为所述离散对象。
可选地,将所述预处理结果映射成所述平面直角坐标系中的一个点包括:确定用于聚类的所述特征信息的发散程度是否达到预定阈值;在用于聚类的所述特征信息的发散程度达到所述预定阈值的情况下,对所述预处理结果的坐标进行调整,得到调整坐标后的预处理结果;将调整坐标后的预处理结果映射成所述平面直角坐标系中的一个点。
可选地,所述方法应用于防火墙的流量控制。
根据本发明实施例的另一个方面,提供了一种网络流量处理装置,包括:获取模块,用于获取网络流量,并将获取的所述网络流量作为离散对象;聚类模块,用于对所述离散对象进行聚类,得到聚类结果;输出模块,用于输出所述聚类结果。
根据本发明实施例的再一个方面,提供了一种计算机可读存储介质,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行上述任意一项所述的网络流量处理方法。
根据本发明实施例的又一个方面,提供了计算机设备,包括:存储器和处理器,所述存储器存储有计算机程序;所述处理器,用于执行所述存储器中存储的计算机程序,所述计算机程序运行时使得所述处理器执行上述任意一项所述的网络流量处理方法。
在本发明实施例中,采用将网络流量作为离散对象,对该离散对象进行聚类的方式,通过输出聚类结果,达到了为制定网络控制策略提供依据的目的,从而实现了高效、合理地制定网络控制策略的技术效果,进而解决了相关技术由于网络拓扑复杂,无法从复杂的网络信息中提供用于制定网络控制策略的依据技术问题。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的网络流量处理方法的流程图;
图2是根据本发明实施例提供的HAC聚类算法的示意图;
图3是根据本发明实施例提供的阶梯化聚类结果的示意图;
图4是根据本发明实施例提供的坐标缩放优化聚类结果的示意图;
图5是根据本发明实施例提供的切比雪夫距离计算示意图;
图6是根据本发明实施例提供的IP地址集合形成的二叉树的示意图;
图7是根据本发明实施例提供的网络流量处理装置的结构框图;
图8是根据本发明实施例提供的网络策略聚类装置的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这 样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
首先,在对本申请实施例进行描述的过程中出现的部分名词或术语适用于如下解释:
网络流量,是在网络上传输的数据流。每一条具体的网络流量都具有一些参数,例如,源IP地址,目的IP地址,源端口号和目的端口号等。
集合(Cluster),聚类分析中的概念,指将物理或者抽象的对象分组后得到的群组,一个群组集合cluster中可以包括一个或者多个对象。
切比雪夫距离,一种坐标空间中距离的度量方式,将二个点之间的距离定义为其各坐标数值差的绝对值的最大值。
实施例1
根据本发明实施例,提供了一种网络流量处理方法,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
图1是根据本发明实施例的网络流量处理方法的流程图,如图1所示,该流程包括如下步骤:
步骤S102,获取网络流量,并将获取的网络流量作为离散对象;
步骤S104,对离散对象进行聚类,得到聚类结果;
步骤S106,输出聚类结果。
通过上述步骤,采用将网络流量作为离散对象,对该离散对象进行聚类的方式,通过输出聚类结果,达到了为制定网络控制策略提供依据的目的,由于聚类结果体现了一些网络流量的共性,因此,在制定网络控制策略时,可以对具有该共性的网络流量进行统一控制处理(例如,进行统一的切断或者开通等),从而实现了高效、合理地制定网络控制策略的技术效果,进而解决了相关技术由于网络拓扑复杂,无法从复杂的网络信息中提供用于制定网络控制策略的依据技术问题。
另外,本发明实施例通过对网络流量的聚类,为网络管理员提供更高效易懂的策 略建议,帮助网络管理员了解现行网络中流量的特点,制定合适的网络控制策略。另外,本发明实施例可以优化自适应学习,使自适应学习的结果能更有效的实现网络控制策略的制定。
作为一种可选的实施例,对离散对象进行聚类,得到聚类结果包括:采用层次凝聚聚类HAC方法,对离散对象进行聚类,得到聚类结果。HAC聚类方法是一种对多个待聚类对象进行聚类的方法,该方法并不需要人为选择聚类初始点和聚类的数目,可以自动化进行。下面对该层次凝聚聚类HAC方法进行简单说明。首先将每一个对象作为一个独立的Cluster,遍历所有Cluster,计算两两Cluster之间的距离;然后将距离最近的两个Cluster合成1个,新Cluster的坐标是合成它的两个Cluster连线的中点坐标;重复上述凝聚操作,每次两个Cluster凝聚在一起的时,Cluster的总数减1,最终所有对象会凝聚到一个Cluster上。
图2是根据本发明实施例提供的HAC聚类算法的示意图,如图2所示,初始时有A、B、C、D共4个Cluster。在a)中,将距离最近的A、B两个Cluster凝聚为1个Cluster。b)中将距离最近的C、D凝聚为1个Cluster。c)中将凝聚后的两个Cluster再次凝聚成1个Cluster。HAC距离算法有一个可调参数——最大聚类距离d。从e)中可以看到,在聚合距离为d1时,有3个Cluster;聚合距离为d2时,有2个Cluster;聚合距离为d3时,只有唯一的一个Cluster。
作为一种可选的实施例,采用HAC方法,对离散对象进行聚类,得到聚类结果包括:根据作为离散对象的网络流量的源IP地址和目的IP地址,采用HAC方法对离散对象进行聚类,得到聚类结果。将作为离散对象的网络流量的源IP地址和目的IP地址作为聚类的依据,根据两个离散对象的源IP地址和目的IP地址,判断两个离散对象的相似或者相近程度,实现对离散对象进行聚类。
从上述HAC方法中可见,最大凝聚距离是影响聚类结果的主要因素。因此,作为一种可选的实施例,可以提供与多个凝聚距离对应的聚类结果。例如,在对离散对象进行聚类,得到聚类结果时,可以确定多个凝聚距离阶梯值;根据多个凝聚距离阶梯值,对离散对象进行聚类,得到与多个凝聚距离阶梯值分别对应的多个聚类结果。作为一种可选的实施例,可以提供最大凝聚距离由大到小的阶梯式聚合结果展示。方便网络管理员选择最佳的聚类结果。
假设初始时有n个Cluster,会经过n-1次凝聚操作得到1个Cluster,凝聚过程中共有n-1个凝聚距离——d 1≤d 2≤…d n-1。定义凝聚距离阶梯值:l 0≤l 1≤…l k,使得n-1个凝聚距离非均匀的分布在每两个相邻的凝聚阶梯之间。最终以凝聚距离阶梯值做界限划分,阶梯展示聚类结果。图3是根据本发明实施例提供的阶梯化聚类结果的示意图,如图3所示,可根据需要以阶梯值l控制聚合结果,阶梯化的展示。
需要说明的是,上述凝聚距离可以看作是执行聚类的一个控制参数,在对离散对象进行聚类,得到聚类结果时,还可以依据其它参数对聚类执行控制。作为一种可选的实施例,对离散对象进行聚类,得到聚类结果包括:获取聚类控制参数;根据聚类控制参数,对离散对象进行过滤,得到过滤结果;对过滤结果进行聚类,得到聚类结果。其中,该聚类控制参数也可以包括多种,例如,聚类控制参数可以包括:网络流量的端口。其中,该端口包括离散对象的源端口和目的端口。
离散对象的源/目的端口有2种参与聚类展示的方式:
不影响聚类过程,仅作为聚类结果的附属内容,一同展示;
作为聚类的控制参数,可以由用户选择,过滤参与聚类的离散对象。例如,用户指定只聚类HTTP流量,则只聚类端口号包含80的对象。
作为一种可选的实施例,将获取的网络流量作为离散对象可以采用多种方式实现,例如,可以采用以下方式:先提取网络流量的特征信息,其中,该特征信息可以是多种,例如,可以是该网络流量的源IP地址,目的IP地址,源端口号和目的端口号等;对特征信息进行预处理,得到预处理结果,其中,此处所指的预处理可以包括多种处理,例如,可以是对上述IP地址进行归一化处理;将预处理结果映射成平面直角坐标系中的一个点,并将该点作为离散对象。将网络流量映射成平面直角坐标系中的一个点,能够直观地体现聚类结果。
当聚类数据中源地址或目的地址某一项比较汇聚,另一项比较发散时,聚类效果并不理想。作为一种可选的实施例,为得到更好的聚类效果,在将预处理结果映射成平面直角坐标系中的一个点时,可以采用以下处理:确定用于聚类的特征信息的发散程度是否达到预定阈值,其中,该预定阈值可以根据经验或者统计获得;在用于聚类的特征信息的发散程度达到预定阈值的情况下,对预处理结果的坐标进行调整,得到调整坐标后的预处理结果;将调整坐标后的预处理结果映射成平面直角坐标系中的一个点。其中,上述所指的调整坐标可以是对坐标系中的原始坐标进行比例调整,也可以是直接对参与聚类的离散对象的坐标值进行比例调整。具体采用哪种方式,可以根据需要灵活选择。
举例来说,比如希望聚类后的Cluster转换出的策略配置是:from 10.100.1.101/30 to 10.100.2.0/24。这个Cluster在聚类坐标系中对应一个狭长的矩形,在聚类时容易将其聚类成多个Cluster。为了应对这种情况,使用坐标缩放的方法,使较发散的一项更汇聚地投射在坐标系中,以得到更好的聚类结果。图4是根据本发明实施例提供的坐标绽放优化聚类结果的示意图,如图4所示,在正常比例的坐标系中,8个Cluster可能被聚类成A、B两个Cluster。通过坐标的比例的变化,发散的一项变得汇聚,使用 算法可以得到更好的聚类结果。
在具体应用时,如果网络管理员知道源地址比较发散(对网络的了解或使用本方案先进行一次聚类对网络流量特点有大致了解),可以指定“汇聚的源地址”方案。此时源地址IP映射到平面直角坐标系时,增加一个系数γ(γ<1),使源地址汇聚地映射到坐标系中。
不同的网络,维护方式也可能不同。例如,在使用物理机的网络与使用虚拟机的网络就不同。在本方案的网络流量的处理方法中,对在不同的网络中传输的网络流量进行聚类处理时,为实现高效地聚类,可以采用不同的距离表征方式来完成聚类。
作为一种可选的实施例,在网络流量为在使用物理机的网络中传输的网络流量的情况下,HAC方法中所采用的凝聚距离包括:切比雪夫距离。HAC方法中,聚类的依据为两个cluster之间的距离,HAC方法将坐标系空间中距离最近的两个cluster聚类为一个新的cluster,然后重新遍历所有cluster,找到新的距离最近的cluster,进行下一步聚类。当网络流量为在使用物理机的网络中传输的网络流量时,网络流量的特点为,多个网络流量的源IP地址通常比较集中,其目的IP地址也是如此。因此,聚类得到的网络策略在进行聚类操作的坐标系空间中对应的区域都是一个正方形或矩形。例如,若将某个聚类后的Cluster转换出的配置策略为:from 10.100.1.0/24 to 10.100.2.0/24,这个Cluster在进行聚类操作的坐标系空间中对应区域就是一个正方形。根据这个特点,采用切比雪夫距离评估两个Cluster间的距离,可以得到很好地聚类效果。切比雪夫距离公式为:
d=max(abs(x 1-x 1),abs(y 1-y 1))
其中,d为两个cluster1与cluster2之间的距离,x 1为cluster1的横坐标,x 2为cluster2的横坐标,y 1为cluster1的纵坐标,y 2为cluster2的纵坐标。
图5是根据本发明实施例提供的切比雪夫距离计算示意图。如图5所示,当一个cluster在原点,与该cluster的切比雪夫距离为1的区域为一个正方形。
作为一种可选的实施例,在网络流量为使用虚拟机(Virtual Machine,简称VM)的云端网络中传输的网络流量的情况下,HAC方法中所采用的凝聚距离依据第一距离和第二距离,以及第一距离和第二距离分别所占的权重确定,其中,第一距离为用于聚类的两条网络流量的源VM之间的距离,第二距离为用于聚类的两条网络流量的目的VM之间的距离。由于使用VM的云端网络中,网络环境扁平化,VM的IP地址的分配方式比传统网络环境中杂乱,所以使用切比雪夫距离来定义两个网络流量之间的距离并不能取得很好的聚类效果。针对对于扁平化的云平台网络环境,聚类效果可能无法得到保障。本发明实施例中使用了以地址簿和联合概率来定义两个VM间的距离的方法,以解决上述问题。具体的,下面会根据本发明实施例对两个VM之间的距离 进行示例性说明。
由于在云端的每一条东西向流量都是从一个虚拟机VM到另一个VM,而VM对应的IP地址杂乱无章难以管理和维护,因此可以直接用VM和VM所属的地址簿定义两条网络流量之间的距离。当网络流量为从VM αx流向VM αy时,将该网络流量记做:Fα:VM αx~VM αy。通过对流量日志的处理,可以得到多条网络流量F 1~F n。为了使用HAC凝聚算法将F 1~F n聚类成若干个Cluster,还需要找到一个评估F α到F β距离的方式。
云平台中的VM一般根据功能进行划分,每个VM属于一个或几个地址簿。
假设共有N个地址簿,VM α属于N个地址簿其中的M个地址簿,M小于等于N。则定义:
Figure PCTCN2020118964-appb-000001
对于两条网络流量F α和F β各自对应的源VM,分别记为VM αx和VM βx,则包含VM αx的地址簿也同时包含VM βx的条件概率为:
Figure PCTCN2020118964-appb-000002
据此定义VM αx和VM βx之间的距离为:
Figure PCTCN2020118964-appb-000003
其中,
Figure PCTCN2020118964-appb-000004
表示VM αx和Vm βx之间的距离,
Figure PCTCN2020118964-appb-000005
Figure PCTCN2020118964-appb-000006
的定义如上所示。显然,根据如上定义公式,两个VM同属于一个地址簿的概率越大,则判断两个VM间的距离越小,即两个VM越接近。
使用与上述方法相同的方法,可以定义两条网络流量F α和F β各自对应的目的VM之间的距离,记为
Figure PCTCN2020118964-appb-000007
Figure PCTCN2020118964-appb-000008
通过上述计算,得到了两条网络流量的源VM间的距离和目的VM间的距离。据此,两条网络流量之间的距离可以根据源VM间的距离和目的VM间的距离以及两个距离分别所占的权重进行定义。具体的:
Figure PCTCN2020118964-appb-000009
其中,
Figure PCTCN2020118964-appb-000010
为F α到F β之间的距离,
Figure PCTCN2020118964-appb-000011
为F α和F β的源VM间的距离,
Figure PCTCN2020118964-appb-000012
为F α和F β的目的VM间的距离,γ为表征源VM间的距离和目的VM间的距 离的权重参数。
通过上述方法,表征了在云端传输的两条网络流量之间的距离。根据上述表征距离的方法,可以很好地将云端网络环境中的网络流量进行初步聚类。对网络流量进行若干次聚类后,可以得到多个cluster,其中每个cluster包括多条网络流量,这也意味着每个cluster中包括多个源VM和目的VM。此时,可以使用多维联合概率对任意两个cluster之间的距离进行评估,使得以地址簿表征距离的HAC聚类方法可以稳定地执行下去,直到还未被聚类的多个cluster满足预设的终止条件为止。使用该距离表征方式表征的距离进行HAC聚类计算,可以很好地将云端的网络流量聚类到一起,同时便于以地址簿的形式将聚类结果输出给用户查看,也便于后续的维护和管理。
作为一种可选的实施例,基于上述网络的不同,输出聚类结果也可以不同,例如:在网络流量为在使用物理机的网络中传输的网络流量的情况下,以IP/Mask格式的方式输出聚类结果;在网络流量为使用虚拟机的云端网络中传输的网络流量的情况下,以地址簿的方式输出聚类结果。下面分别说明。
将一个Cluster中所有离散的源/目的IP地址以2种形式输出:IP/mask格式输出和地址簿匹配输出。
(1)IP/mask格式输出:
将一个Cluster中的所有离散对象的源/目的IP地址以一条或几条IP/mask格式地址段输出。例如,当网络流量为在使用物理机的网络中传输的网络流量,终端通过多级路由层层器构建的网络接入互联网,因此聚类在一个Cluster中的源/目的IP地址更容易处于同一网段。这种输出方式更适用于该网络环境。
在具体输出聚类结果时,可以提供2种方案的IP/mask输出模式供用户按需选择:精简模式和严格模式。下面通过输出算法的实现原理解释2种模式的差异。
以IPV4地址聚类成的Cluster为例说明。图6是根据本发明实施例提供的IP地址集合形成的二叉树的示意图,如图6所示,IPV4地址共32位,按照从高位到低位的顺序将Cluster中所有IP地址构建为一颗深度为33(根节点无意义)的二叉树。当某一位为0时作为左子树,为1时作为右子树。
精简模式:按照IP二叉树首个分叉点输出的IP/mask地址,如上图中点A。构成二叉树的IP Cluster是经过聚类后的,已经具有一定的共性。所以按照点A所在位置输出IP/mask格式地址可以在满足大部分需求的情况下已尽量精简的方式展示网络情况。
严格模式:按照最大满二叉树根节点输出的IP/mask地址。如上图中点B,以其 为根节点的子树是一个满二叉树,可以用一个IP/mask地址段表示以B为根节点子树的所有节点。最小的子树为只有一个根节点的子树时(对应上图中点C),输出mask为32。严格模式输出的结果与Cluster中的IP地址严格吻合,不多也不少。
(2)地址簿匹配输出
对于一些使用地址簿进行策略配置的用户,特别是云平台的扁平化网络环境。用户往往将功能相近的终端划分到同一个地址簿中。此时在策略配置时使用地址簿的可维护性要远高于使用IP/mask地址段。
本发明实施例可以将Cluster中的IP地址匹配一个或几个地址簿,以地址簿的形式输出匹配聚类结果,例如:
from
Figure PCTCN2020118964-appb-000013
to
Figure PCTCN2020118964-appb-000014
地址簿匹配算法实现如下:
S1,首先检查Cluster中每一个IP地址是否包含在各地址簿中,表1是根据本发明实施例提供的Cluster中每一个IP地址是否包含在各地址簿中的包含情况示意表,如表1所示:
Figure PCTCN2020118964-appb-000015
表1
S2,计算包含率:α=包含Cluster中IP的个数/addressbook总IP地址个数。
S3,为防止用户存在范围过大的addressbook,根据经验值或用户指定值过滤掉α过低的addressbook。
S4,在余下的addressbook中选择包含IP地址个数最多的addressbook作为第一轮输出,如几个addressbook包含IP地址个数相同,选择α低的。
S5,在Cluster中去除选出addressbook中的IP,使用余下IP重复步骤1-5,直到所有IP都找到,或余下IP不包含在任何addressbook中。
用户的addressbook规划合理时,往往经过1到2轮计算后即可找Cluster中所有IP地址。通过上述算法可为用户以addressbook的形式展示聚类结果,推荐策略配置策略。
需要说明的是,本发明实施例提供的网络流量处理方案能够应用于多种场景,例如,可以应用于传统防火墙设备,公有云,私有云,数据中心的自适应策略推荐,以及其它复杂网络环境的防火墙策略。
通过上述实施例及可选实施例,采用以下处理:提取流量日志中的源/目的IP,将流量日志特征投射到平面直角坐标系中;使用无监督聚类算法分析流量日志,提供策略配置建议;使用切比雪夫距离作为凝聚距离的评估值;使用坐标缩放的方式,优化聚类效果;使用虚拟机的离散型联合概率计算凝聚距离的评估值;聚类结果与用户地址簿匹配后,以地址簿形式输出。能够提供智能的策略建议,方便网络管理员更快捷高效的管理防火墙策略。
实施例2
根据本发明实施例,还提供了一种用于实施上述网络流量处理方法的装置,图7是根据本发明实施例提供的网络流量处理装置的结构框图,如图7所示,该装置包括:获取模块72,聚类模块74和输出模块76,下面对该装置进行说明。
获取模块72,用于获取网络流量,并将获取的所述网络流量作为离散对象;聚类模块74,连接至上述获取模块72,用于对所述离散对象进行聚类,得到聚类结果;输出模块76,连接至上述聚类模块74,用于输出所述聚类结果。
根据本发明可选实施例,基于上述网络流量处理方法,提供了一种网络策略聚类装置,该装置根据流量日志提供的信息,使用无监督聚类算法,实现策略的聚类。并输出更精简,操作性高的策略建议,帮助网络管理员更快捷地进行策略配置。图8是根据本发明实施例提供的网络策略聚类装置的结构示意图,如图8所示,该网络策略聚类装置包括:日志信息提取模块82、特征信息映射模块84、聚类模块86、策略输出模块88,其中,该日志信息提取模块82和特征信息映射模块84实现上述获取模块 72的功能,聚类模块84实现上述聚类模块74的功能,策略输出模块88同上述输出模块76。下面分别说明。
日志信息提取模块82,用于从流量日志中提取每条流量的源/目的IP,源/目的端口号,并进行去重处理。处理后得到多个离散且去重后的待聚类对象,每个对象包含4个维度的参数。
特征信息映射模块84,连接至上述日志信息提取模块82,用于将离散的聚类对象的源/目的IP作为平面直角坐标系的横纵坐标映射成平面直角坐标系中一个点。
以源IP地址为例,首先获取所有对象IP地址的最小值和最大值,再将每个对象的IP地址归一化处理,归一化后的数值映射到一个无符号32位数中,作为这个对象在平面直角坐标系中的横坐标。同理目的IP以相同的方法映射到平面直角坐标系的纵坐标。
聚类模块86,连接至上述特征信息映射模块84,包括:聚类执行单元和聚类模式控制单元。其中,聚类执行单元,用于对映射后的离散对象使用层次凝聚聚类算法(HAC)进行聚类。聚类模式控制单元,用于接收一些用户参数,控制聚类执行单元以求得到更优化的聚类结果。
策略输出模块88,连接至上述聚类模块86,用于输出聚类结果,例如,可以将一个Cluster中所有离散的源/目的IP地址以2种形式输出:IP/mask格式输出,地址簿匹配输出。
本发明的实施例可以提供一种计算机终端,该计算机终端可以是计算机终端群中的任意一个计算机终端设备。可选地,在本实施例中,上述计算机终端也可以替换为移动终端等终端设备。
可选地,在本实施例中,上述计算机终端可以位于计算机网络的多个网络设备中的至少一个网络设备。
可选地,该计算机终端可以包括:一个或多个(图中仅示出一个)处理器、存储器等。
其中,存储器存储有计算机程序,例如,可用于存储软件程序以及模块,如本发明实施例中的图像处理方法和装置对应的程序指令/模块,处理器通过运行存储在存储器内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的图像处理方法。存储器可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器可进一步包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至计 算机终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
处理器可以通过传输装置调用存储器存储的信息及应用程序,以执行所述存储器中存储的计算机程序,所述计算机程序运行时使得所述处理器执行上述任意一项所述的网络流量处理方法。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。
本发明的实施例还提供了一种存储介质。可选地,在本实施例中,上述存储介质可以用于保存上述实施例1所提供的网络流量处理方法所执行的程序代码,在所述程序运行时控制所述存储介质所在设备执行上述任意一项所述的网络流量处理方法。上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一 台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (15)

  1. 一种网络流量处理方法,其特征在于,包括:
    获取网络流量,并将获取的所述网络流量作为离散对象;
    对所述离散对象进行聚类,得到聚类结果;
    输出所述聚类结果。
  2. 根据权利要求1所述的方法,其特征在于,对所述离散对象进行聚类,得到所述聚类结果包括:
    采用层次凝聚聚类HAC方法,对所述离散对象进行聚类,得到所述聚类结果。
  3. 根据权利要求2所述的方法,其特征在于,采用所述HAC方法,对所述离散对象进行聚类,得到所述聚类结果包括:
    根据作为所述离散对象的所述网络流量的源IP地址和目的IP地址,采用所述HAC方法对所述离散对象进行聚类,得到聚类结果。
  4. 根据权利要求3所述的方法,其特征在于,在所述网络流量为在使用物理机的网络中传输的网络流量的情况下,所述HAC方法中所采用的凝聚距离包括:切比雪夫距离。
  5. 根据权利要求3所述的方法,其特征在于,在所述网络流量为使用虚拟机的云端网络中传输的网络流量的情况下,所述HAC方法中所采用的凝聚距离依据第一距离和第二距离,以及第一距离和第二距离分别所占的权重确定,其中,所述第一距离为用于聚类的两条网络流量的源虚拟机之间的距离,所述第二距离为用于聚类的两条网络流量的目的虚拟机之间的距离。
  6. 根据权利要求3所述的方法,其特征在于,输出所述聚类结果包括:
    在所述网络流量为在使用物理机的网络中传输的网络流量的情况下,以IP/Mask格式的方式输出所述聚类结果;
    在所述网络流量为使用虚拟机的云端网络中传输的网络流量的情况下,以地址簿的方式输出所述聚类结果。
  7. 根据权利要求1所述的方法,其特征在于,对所述离散对象进行聚类,得到所述聚类结果包括:
    确定多个凝聚距离阶梯值;
    根据所述多个凝聚距离阶梯值,对所述离散对象进行聚类,得到与所述多个 凝聚距离阶梯值分别对应的多个聚类结果。
  8. 根据权利要求1所述的方法,其特征在于,对所述离散对象进行聚类,得到所述聚类结果包括:
    获取聚类控制参数;
    根据所述聚类控制参数,对所述离散对象进行过滤,得到过滤结果;
    对所述过滤结果进行聚类,得到所述聚类结果。
  9. 根据权利要求8所述的方法,其特征在于,所述聚类控制参数包括:所述网络流量的端口。
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,将获取的所述网络流量作为离散对象包括:
    提取所述网络流量的特征信息;
    对所述特征信息进行预处理,得到预处理结果;
    将所述预处理结果映射成平面直角坐标系中的一个点,并将该点作为所述离散对象。
  11. 根据权利要求10所述的方法,其特征在于,将所述预处理结果映射成所述平面直角坐标系中的一个点包括:
    确定用于聚类的所述特征信息的发散程度是否达到预定阈值;
    在用于聚类的所述特征信息的发散程度达到所述预定阈值的情况下,对所述预处理结果的坐标进行调整,得到调整坐标后的预处理结果;
    将调整坐标后的预处理结果映射成所述平面直角坐标系中的一个点。
  12. 根据权利要求11所述的方法,其特征在于,所述方法应用于防火墙的流量控制。
  13. 一种网络流量处理装置,其特征在于,包括:
    获取模块,用于获取网络流量,并将获取的所述网络流量作为离散对象;
    聚类模块,用于对所述离散对象进行聚类,得到聚类结果;
    输出模块,用于输出所述聚类结果。
  14. 一种计算机可读存储介质,其特征在于,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行权利要求1至12中任意一项所 述的网络流量处理方法。
  15. 一种计算机设备,其特征在于,包括:存储器和处理器,
    所述存储器存储有计算机程序;
    所述处理器,用于执行所述存储器中存储的计算机程序,所述计算机程序运行时使得所述处理器执行权利要求1至12中任意一项所述的网络流量处理方法。
PCT/CN2020/118964 2020-09-29 2020-09-29 网络流量处理方法、装置、存储介质及计算机设备 WO2022067539A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202080002208.0A CN112352412B (zh) 2020-09-29 2020-09-29 网络流量处理方法、装置、存储介质及计算机设备
US17/043,714 US11874901B2 (en) 2020-09-29 2020-09-29 Method, device for processing network flow, storage medium and computer device
PCT/CN2020/118964 WO2022067539A1 (zh) 2020-09-29 2020-09-29 网络流量处理方法、装置、存储介质及计算机设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/118964 WO2022067539A1 (zh) 2020-09-29 2020-09-29 网络流量处理方法、装置、存储介质及计算机设备

Publications (1)

Publication Number Publication Date
WO2022067539A1 true WO2022067539A1 (zh) 2022-04-07

Family

ID=74427562

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118964 WO2022067539A1 (zh) 2020-09-29 2020-09-29 网络流量处理方法、装置、存储介质及计算机设备

Country Status (3)

Country Link
US (1) US11874901B2 (zh)
CN (1) CN112352412B (zh)
WO (1) WO2022067539A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115118466A (zh) * 2022-06-14 2022-09-27 深信服科技股份有限公司 一种策略生成方法、装置、电子设备和存储介质
CN115665286A (zh) * 2022-12-26 2023-01-31 深圳红途科技有限公司 接口聚类方法、装置、计算机设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113507447B (zh) * 2021-06-17 2022-09-13 北京邮电大学 网络流量数据的自适应增强方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107733937A (zh) * 2017-12-01 2018-02-23 广东奥飞数据科技股份有限公司 一种异常网络流量检测方法
CN108462675A (zh) * 2017-02-20 2018-08-28 沪江教育科技(上海)股份有限公司 一种网络访问识别方法及系统
WO2019191666A1 (en) * 2018-03-29 2019-10-03 The United States Of America, As Represented By The Secretary, Department Of Health And Human Services Biomarker analysis for high-throughput diagnostic multiplex data
CN110650058A (zh) * 2019-10-08 2020-01-03 河南省云安大数据安全防护产业技术研究院有限公司 一种网络流量分析方法、装置、存储介质及设备
CN111224990A (zh) * 2020-01-09 2020-06-02 武汉思普崚技术有限公司 一种分布式微隔离网络的流量牵引方法及系统

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7734629B2 (en) * 2006-04-29 2010-06-08 Yahoo! Inc. System and method using hierachical clustering for evolutionary clustering of sequential data sets
US8930365B2 (en) * 2006-04-29 2015-01-06 Yahoo! Inc. System and method for evolutionary clustering of sequential data sets
US10796243B2 (en) * 2014-04-28 2020-10-06 Hewlett Packard Enterprise Development Lp Network flow classification
US11057264B1 (en) * 2015-01-15 2021-07-06 Veritas Technologies Llc Discovery and configuration of disaster recovery information
KR101631242B1 (ko) * 2015-01-27 2016-06-16 한국전자통신연구원 잠재 디리클레 할당을 이용한 악성 트래픽의 시그니처의 자동화된 식별 방법 및 장치
US20180131624A1 (en) * 2016-11-10 2018-05-10 Qualcomm Incorporated Managing Network Traffic
US10891148B2 (en) * 2018-08-15 2021-01-12 Vmware, Inc. Methods and systems for identifying application components in distributed computing facilities
CN109547349B (zh) * 2018-12-06 2021-07-06 郑州云海信息技术有限公司 基于虚拟路由的流量管理方法、装置、终端及存储介质
US11436074B2 (en) * 2019-04-17 2022-09-06 Microsoft Technology Licensing, Llc Pruning and prioritizing event data for analysis
EP3893132A1 (en) * 2020-04-07 2021-10-13 Tata Consultancy Services Limited Method and system for hierarchical time-series clustering with auto encoded compact sequence (aecs)
US11550691B2 (en) * 2021-06-08 2023-01-10 Servicenow, Inc. Computing resources schedule recommendation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108462675A (zh) * 2017-02-20 2018-08-28 沪江教育科技(上海)股份有限公司 一种网络访问识别方法及系统
CN107733937A (zh) * 2017-12-01 2018-02-23 广东奥飞数据科技股份有限公司 一种异常网络流量检测方法
WO2019191666A1 (en) * 2018-03-29 2019-10-03 The United States Of America, As Represented By The Secretary, Department Of Health And Human Services Biomarker analysis for high-throughput diagnostic multiplex data
CN110650058A (zh) * 2019-10-08 2020-01-03 河南省云安大数据安全防护产业技术研究院有限公司 一种网络流量分析方法、装置、存储介质及设备
CN111224990A (zh) * 2020-01-09 2020-06-02 武汉思普崚技术有限公司 一种分布式微隔离网络的流量牵引方法及系统

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115118466A (zh) * 2022-06-14 2022-09-27 深信服科技股份有限公司 一种策略生成方法、装置、电子设备和存储介质
CN115118466B (zh) * 2022-06-14 2024-04-12 深信服科技股份有限公司 一种策略生成方法、装置、电子设备和存储介质
CN115665286A (zh) * 2022-12-26 2023-01-31 深圳红途科技有限公司 接口聚类方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN112352412B (zh) 2023-06-09
CN112352412A (zh) 2021-02-09
US20230252108A1 (en) 2023-08-10
US11874901B2 (en) 2024-01-16

Similar Documents

Publication Publication Date Title
WO2022067539A1 (zh) 网络流量处理方法、装置、存储介质及计算机设备
US10728176B2 (en) Ruled-based network traffic interception and distribution scheme
US20160006650A1 (en) Method, Apparatus, and System for Configuring Flow Table in OpenFlow Network
CN108985954B (zh) 一种建立各标识的关联关系的方法以及相关设备
WO2015096580A1 (zh) 网络流量控制设备及其安全策略配置方法及装置
CN107819891A (zh) 数据处理方法、装置、计算机设备和存储介质
WO2022151654A1 (zh) 一种基于随机贪心算法的横向联邦梯度提升树优化方法
US11240174B2 (en) Systems and methods for intelligent application grouping
CN109218301A (zh) 多协议间软件定义的帧头映射的方法和装置
CN112804081A (zh) 一种虚拟网络拓扑构建及动态变更的方法
WO2020124488A1 (zh) 应用进程映射方法、电子装置及计算机可读存储介质
US10311557B2 (en) Automated tonal balancing
CN105099799B (zh) 僵尸网络检测方法和控制器
WO2021052439A1 (zh) 一种管理网络的方法和一种网管系统
CN108109702A (zh) 应用大小流分类的数据选择方法
US20230038310A1 (en) Devices, Methods, and System for Heterogeneous Data-Adaptive Federated Learning
WO2023065640A1 (zh) 一种模型参数调整方法、装置、电子设备和存储介质
WO2021017907A1 (zh) 一种优化的微服务间通信的方法及装置
CN112738225B (zh) 基于人工智能的边缘计算方法
CN114785692A (zh) 一种虚拟电厂聚合调控通信网络流量均衡方法及装置
CN114116740A (zh) 用于联合学习中确定参与方贡献度的方法及装置
US20210336863A1 (en) A method and device for monitoring host computers
WO2023123171A1 (zh) 基于模式互补的虚拟网络映射的方法、装置和计算机可读存储介质
CN107516106A (zh) 基于z值的分布式密度峰值聚类算法
WO2023098222A1 (zh) 多业务场景的识别方法和决策森林模型的训练方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20955564

Country of ref document: EP

Kind code of ref document: A1