CN114726570A

CN114726570A - Host flow abnormity detection method and device based on graph model

Info

Publication number: CN114726570A
Application number: CN202111667761.9A
Authority: CN
Inventors: 戴诗嘉; 陈晨
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-07-08
Anticipated expiration: 2041-12-31
Also published as: CN114726570B

Abstract

The application provides a host flow abnormity detection method and device based on a graph model, wherein the method can generate a first flow graph model according to collected flow records related to a target host, and establishes a connection relation between external hosts by utilizing the session characteristic similarity between the external hosts and the target host in the first flow graph model to realize the reconstruction of the first flow graph model. And then, clustering and grouping the external hosts with similar session behavior patterns in the reconstructed second traffic graph model based on an improved cluster filtering algorithm, and finding out abnormal external hosts dissociating from normal external host groups. The scheme of the application has good discovery ability to the abnormal external host in the flow, and can effectively improve the accuracy of flow abnormity detection.

Description

Host flow abnormity detection method and device based on graph model

Technical Field

The application relates to the technical field of information security, in particular to a host flow abnormity detection method and device based on a graph model.

Background

With the gradual development and growth of the internet, various network applications provide high-quality and convenient services for users, and meanwhile, network security attacks of different sizes, scales and diversity are hidden in the network all the time. Therefore, how to accurately and rapidly identify and locate the malicious attack behaviors threatening the network security becomes an urgent problem to be solved.

The server generates many flow records in the running process, and in the face of increasing data flow, how to efficiently and accurately detect abnormal flow contained in the data flow becomes more important. Traffic anomaly detection is an effective strategy to identify network attacks, aiming to find connections and nodes in traffic that are significantly different from other normal patterns. However, due to the large data size, complex structure and diversity of the anomaly detection problem of the complex network, the traffic anomaly detection for the complex network in the prior art faces the following challenges: 1) nodes and edges in the network have rich attribute information, and the diversified attribute contents cannot be well utilized; 2) the flow records lack the connection relation between external nodes, so that the behavior characteristics of external abnormal flow are difficult to analyze; 3) the label is lacked, the accuracy rate of supervised learning is usually higher than that of unsupervised learning, but the labeled data used for training the model is very difficult to obtain, and manual labeling is time-consuming and labor-consuming.

Disclosure of Invention

The application provides a host flow abnormity detection method and device based on a graph model, which are used for improving the accuracy of host flow abnormity detection.

In a first aspect, an embodiment of the present application provides a host traffic anomaly detection method based on a graph model, where the method is executable by an apparatus for performing traffic anomaly detection on a target host. The method comprises the following steps: generating a first traffic graph model according to collected traffic records related to a target host within a set time period, wherein the first traffic graph model comprises information of the target host, information of a plurality of external hosts and a first type connection relationship between the external hosts and the target host, and the first type connection relationship is used for indicating session characteristics between the external hosts and the target host; reconstructing the first traffic graph model according to similarity of session features between different external hosts and the target host in the first traffic graph model to generate a second traffic graph model, wherein the second traffic graph model comprises a second type of connection relationship between the external hosts, and the second type of connection relationship is used for indicating that the session behavior patterns of the two external hosts are similar; clustering and grouping the plurality of external nodes according to the similarity of the session behavior patterns of different external nodes in the second traffic graph model to obtain one or more external host groups, wherein each external host group comprises a plurality of external hosts; determining the external host not belonging to any of the external host groups as an abnormal external host communicating with the target host.

According to the technical scheme, the connection relation between the external hosts can be established by utilizing the session characteristic similarity between the external hosts and the target host in the first traffic graph model, so that the reconstruction of the first traffic graph model is realized. And then clustering and grouping the external hosts with similar session behavior patterns in the reconstructed second traffic graph model based on an improved cluster filtering algorithm, and finding out abnormal external hosts dissociating from normal external host groups. The scheme of the application has good discovery ability to the abnormal external host in the flow, and can effectively improve the accuracy of flow abnormity detection.

In one possible design, each of the traffic records includes quintuple information of a session associated with the traffic record, the quintuple information includes a source IP address, a source port number, a destination IP address, a destination port number, and connection time, and the source IP address or the destination IP address in each of the traffic records is an IP address of the target host; the establishing of the first flow chart model according to the collected flow records related to the target host within the set time period comprises the following steps: aggregating the traffic records with the same four-tuple information in the traffic records related to the target host, and determining the total connection times and connection time sequence corresponding to each four-tuple information, wherein the four-tuple information comprises the source IP address, the source port number, the destination IP address and the destination port number; and generating the first traffic graph model according to the total connection times and the connection time sequence corresponding to the four-tuple information.

In one possible design, the information of the target host includes an IP address of the target host, and the information of the external host includes an IP address of the external host; the first type of connection relationship includes the total connection times and the connection time series corresponding to a set of the quadruple information between the external host and the target host.

In one possible design, the reconstructing the first traffic graph model according to the similarity of the session features between the different external hosts and the target host in the first traffic graph model to generate a second traffic graph model includes: determining session similarity distances between the plurality of external hosts according to the first type connection relationship between each external host and the target host, wherein the session similarity distances are used for indicating the similarity degree of the session behavior patterns of the two external hosts; and if the session similarity distance between the two external hosts is greater than the product of the average session similarity distance and a preset threshold, establishing the second type connection relationship between the two external hosts.

In one possible design, one or more of the first type connection relationships exist between each of the external hosts and the target host; the determining a session similarity distance between two external hosts includes: determining session feature similarity of a first type mapping relationship pair composed of a first type mapping relationship between a first external host and the target host and a first type mapping relationship between a second external host and the target host, wherein each first type mapping relationship pair satisfies a set condition; and determining the session similarity distance between the first external host and the second external host according to the session feature similarity of all the first type mapping relation pairs meeting the set condition between the first external host and the second external host.

In one possible design, the first type of mapping relationship pair that satisfies the set condition is: the source IP addresses of the quadruple information corresponding to the two first-type mapping relationships included in the first-type mapping relationship pair are the same and the source port numbers are the same, or the destination IP addresses are the same and the destination port numbers are the same.

In one possible design, the session feature similarity of the first type mapping relationship pair is: and the similarity between the connection time sequences in the quadruple information respectively corresponding to the two first-type mapping relations included in the first-type mapping relation pair.

In one possible design, the first type mapping relationship pair includes an ith first type mapping relationship between the first external host and the target host and a jth first type mapping relationship between the second external host and the target host, a connection time sequence in quadruple information corresponding to the ith first type mapping relationship is a first connection time sequence, a connection time sequence in quadruple information corresponding to the jth first type mapping relationship is a second connection time sequence, and i and j are integers; each connection time sequence comprises one or more time stamps and connection times num corresponding to each time stamp; the determining the session feature similarity of the first type mapping relationship pair includes: if the length of the first connection time sequence is greater than that of the second connection time sequence and is less than or equal to twice of the length of the second connection time sequence, performing length matching on the first connection time sequence and the second connection time sequence; determining similarity between the first connection time series and the second connection time series after the length matching by the following formula:

the method includes the steps of obtaining a first connection time sequence after length matching, obtaining a second connection time sequence after length matching, obtaining similarity (x, y) between the first connection time sequence and the second connection time sequence, obtaining a first time stamp in the first connection time sequence after length matching, obtaining a second time stamp in the second connection time sequence after length matching, obtaining connection times corresponding to the second time stamp in the second connection time sequence after length matching, and obtaining the number of the second connection time sequence and the second connection time sequence after length matching.

In one possible design, the determining the session feature similarity of the first type mapping relationship pair further includes: if the length of the first connection time series is greater than twice the length of the second connection time series, determining that the similarity between the first connection time series and the second connection time series is 0.

In one possible design, the determining a session similarity distance between the first external host and the second external host according to the session feature similarity of all pairs of the first type mapping relations between the first external host and the second external host that satisfy the set condition includes: and determining the session similarity distance between the first external host and the second external host according to the session feature similarity and the corresponding weight of all the first type mapping relation pairs meeting the set condition between the first external host and the second external host.

In one possible design, the session similarity distance between the first external host and the second external host is determined by the following equation:

wherein the v1 is the first external host, the v2 is the second external host, the distance (v1, v2) is a session similarity distance between the first external host and the second external host, the n is a total number of the session feature similarities between the first external host and the second external host, and the dis is_iThe weight is an ith session feature similarity between the first external host and the second external host_iAnd weighting corresponding to the ith session feature similarity.

In one possible design, the session feature similarity of the first type mapping relationship pair satisfying the set condition corresponds to a weight of: and the length of the connection time sequence in the quadruple information respectively corresponding to the two first type mapping relations included in the first type mapping relation pair is the sum of the lengths of the connection time sequences.

In one possible design, clustering and grouping the plurality of external nodes according to similarity of session behavior patterns of different external nodes in the second traffic graph model to obtain one or more external host packets includes: determining N derivatives in the second traffic graph model according to an adjacency list of the second traffic graph model, wherein each derivative comprises at most k external hosts, k is an integer greater than 3, and N is a positive integer; according to the N derivatives. Establishing a derivative relation matrix corresponding to the second traffic graph model, wherein the derivative relation matrix is used for indicating the number of the external hosts included in each derivative and the adjacency relation among different derivatives; and traversing the relationship matrix of the clusters in a breadth-first traversal mode, and combining the communicated different clusters to obtain the one or more external host groups.

In one possible design, the derivative relationship matrix is N rows and N columns, where each of the N rows corresponds to one of the N derivatives and each of the N columns corresponds to one of the N derivatives; the off-diagonal elements in the derivative relation matrix are equal to the number of common external hosts between the derivative corresponding to the row where the off-diagonal elements are located and the derivative corresponding to the row where the off-diagonal elements are located, and the diagonal elements are equal to the number of external hosts included in the derivative corresponding to the diagonal elements.

In one possible design, the determining the external host that will not belong to any of the external host groups as an abnormal external host in communication with the target host includes: determining a set of said external hosts that do not belong to any of said external host groups; and filtering the external hosts in the set according to a preset IP address white list, and determining the external hosts of which the IP addresses in the set do not belong to the IP address white list as the abnormal external hosts.

In a second aspect, the present application provides a graph model-based host traffic anomaly detection apparatus, which may include modules/units for performing any one of the possible design methods of the first aspect. These modules/units may be implemented by hardware, or by hardware executing corresponding software.

Illustratively, the apparatus may include a communication module and a processing module; wherein:

the communication module is used for acquiring flow records related to the target host within a set time period;

the processing module is used for generating a first traffic map model according to the collected traffic records related to the target host within the set time period, wherein the first traffic map model comprises information of the target host, information of a plurality of external hosts and a first type connection relationship between the external hosts and the target host, and the first type connection relationship is used for indicating session characteristics between the external hosts and the target host; reconstructing the first traffic graph model according to similarity of session features between different external hosts and the target host in the first traffic graph model to generate a second traffic graph model, wherein the second traffic graph model comprises a second type of connection relationship between the external hosts, and the second type of connection relationship is used for indicating that the session behavior patterns of the two external hosts are similar; clustering and grouping the plurality of external nodes according to the similarity of the session behavior patterns of different external nodes in the second traffic graph model to obtain one or more external host groups, wherein each external host group comprises a plurality of external hosts; determining the external host not belonging to any of the external host groups as an abnormal external host communicating with the target host.

In one possible design, each of the traffic records includes quintuple information of a session associated with the traffic record, the quintuple information includes a source IP address, a source port number, a destination IP address, a destination port number, and connection time, and the source IP address or the destination IP address in each of the traffic records is an IP address of the target host; the processing module is specifically configured to: aggregating the traffic records with the same four-tuple information in the traffic records related to the target host, and determining the total connection times and connection time sequence corresponding to each four-tuple information, wherein the four-tuple information comprises the source IP address, the source port number, the destination IP address and the destination port number; and generating the first traffic graph model according to the total connection times and the connection time sequence corresponding to the four-tuple information.

In one possible design, the information of the target host includes an IP address of the target host, and the information of the external host includes an IP address of the external host; the first type connection relationship includes the total connection times and the connection time sequence corresponding to a set of the quadruple information between the external host and the target host.

In one possible design, the processing module is specifically configured to: determining session similarity distances between the plurality of external hosts according to the first type connection relationship between each external host and the target host, wherein the session similarity distances are used for indicating the similarity degree of the session behavior patterns of the two external hosts; and if the session similarity distance between the two external hosts is greater than the product of the average session similarity distance and a preset threshold, establishing the second type connection relationship between the two external hosts.

In one possible design, one or more of the first type connection relationships exist between each of the external hosts and the target host; the processing module is specifically configured to: determining session feature similarity of a first type mapping relationship pair composed of a first type mapping relationship between a first external host and the target host and a first type mapping relationship between a second external host and the target host, wherein each first type mapping relationship pair satisfies a set condition; and determining the session similarity distance between the first external host and the second external host according to the session feature similarity of all the first type mapping relation pairs meeting the set condition between the first external host and the second external host.

In one possible design, the first type of mapping relationship pair that satisfies the set condition is: the source IP addresses and the source port numbers of the quadruple information corresponding to the two first-type mapping relationships included in the first-type mapping relationship pair are the same, or the destination IP addresses and the destination port numbers are the same.

In one possible design, the session feature similarity of the first type mapping relationship pair is: and the similarity between the connection time sequences in the quadruple information respectively corresponding to the two first type mapping relations included in the first type mapping relation pair.

In one possible design, the first type mapping relationship pair includes an ith first type mapping relationship between the first external host and the target host and a jth first type mapping relationship between the second external host and the target host, a connection time sequence in quadruple information corresponding to the ith first type mapping relationship is a first connection time sequence, a connection time sequence in quadruple information corresponding to the jth first type mapping relationship is a second connection time sequence, and i and j are integers; each connection time sequence comprises one or more time stamps and connection times num corresponding to each time stamp; the processing module is specifically configured to: if the length of the first connection time sequence is greater than that of the second connection time sequence and is less than or equal to twice of the length of the second connection time sequence, performing length matching on the first connection time sequence and the second connection time sequence; determining similarity between the first connection time series and the second connection time series after the length matching by the following formula:

In one possible design, the processing module is specifically configured to: if the length of the first connection time series is greater than twice the length of the second connection time series, determining that the similarity between the first connection time series and the second connection time series is 0.

In one possible design, the processing module is specifically configured to: and determining the session similarity distance between the first external host and the second external host according to the session feature similarity and the corresponding weight of all the first type mapping relation pairs meeting the set condition between the first external host and the second external host.

In one possible design, the processing module is specifically configured to: determining a session similarity distance between the first external host and the second external host by:

In a possible design, the session feature similarity of the first type mapping relationship pair that satisfies the setting condition is weighted as follows: and the length of the connection time sequence in the quadruple information respectively corresponding to the two first type mapping relations included in the first type mapping relation pair is the sum of the lengths of the connection time sequences.

In one possible design, the processing module is specifically configured to: determining N derivatives in the second traffic graph model according to an adjacency list of the second traffic graph model, wherein each derivative comprises at most k external hosts, k is an integer greater than 3, and N is a positive integer; according to the N derivatives. Establishing a derivative relation matrix corresponding to the second traffic graph model, wherein the derivative relation matrix is used for indicating the number of the external hosts included in each derivative and the adjacency relation among different derivatives; and traversing the relationship matrix of the clusters in a breadth-first traversal mode, and combining the communicated different clusters to obtain the one or more external host groups.

In one possible design, the derivative relationship matrix is N rows and N columns, where each of the N rows corresponds to one of the N derivatives and each of the N columns corresponds to one of the N derivatives; the off-diagonal elements in the party relation matrix are equal to the number of common external hosts between the party corresponding to the row where the off-diagonal elements are located and the party corresponding to the row where the off-diagonal elements are located, and the diagonal elements are equal to the number of external hosts included in the party corresponding to the diagonal elements.

In one possible design, the processing module is specifically configured to: determining a set of said external hosts that do not belong to any of said external host groups; and filtering the external hosts in the set according to a preset IP address white list, and determining the external hosts of which the IP addresses in the set do not belong to the IP address white list as the abnormal external hosts.

In a third aspect, an embodiment of the present application further provides a computer device, including:

a memory for storing program instructions;

a processor for calling the program instructions stored in the memory and for executing the method as described in the various possible designs of the first aspect in accordance with the obtained program instructions.

In a fourth aspect, the present application further provides a computer-readable storage medium, in which computer-readable instructions are stored, and when the computer-readable instructions are read and executed by a computer, the method described in any one of the possible designs of the first aspect is implemented.

In a fifth aspect, this application further provides a computer program product including computer readable instructions that, when executed by a processor, cause the method described in any one of the possible designs of the first aspect to be implemented.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a host traffic anomaly detection method based on a graph model according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a first traffic graph model according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a second traffic map model according to an embodiment of the present disclosure;

fig. 4 is a schematic format diagram of a vertex file in a flow graph model according to an embodiment of the present application;

fig. 5 is a schematic format diagram of an edge file in a flow graph model according to an embodiment of the present application;

fig. 6 is a schematic flowchart illustrating a process of calculating a session similarity distance between two external hosts according to an embodiment of the present application;

fig. 7 is a schematic flowchart of calculating a similarity between session features of two connection time series according to an embodiment of the present application;

fig. 8 is a schematic general flowchart of host traffic anomaly detection according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram illustrating an influence relationship between different k values on the number of external host packets according to an embodiment of the present application;

fig. 10 is a schematic diagram illustrating an influence of different k values on the modularization degree of the external host packet structure according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a host traffic anomaly detection apparatus based on a graph model according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the embodiments of the present application, a plurality means two or more. The terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance, nor order.

Aiming at the problem of poor accuracy of flow anomaly detection in the prior art, the application provides a flow anomaly detection method based on a graph model, and the method can establish the connection relation between external hosts in the flow graph model by utilizing the session characteristic similarity between the external hosts and a target host in the flow graph model so as to reconstruct the flow graph model. And clustering and grouping the external hosts with similar session behavior patterns in the reconstructed flow graph model by using an improved cluster filtering algorithm, and finding out abnormal external hosts dissociating from normal external host groups. The scheme of the application has good discovery ability to the abnormal external host in the flow, and can effectively improve the accuracy of flow abnormity detection.

In this application, the target host may be a server in an intranet, and the external host may be a device outside the intranet and having a connection relationship with the server in the intranet, which is not limited specifically.

In the following description of the present application, the technical solution for performing anomaly detection on host traffic provided in the present application is described by taking anomaly detection on traffic of one target host in an intranet as an example. It is understood that, in practical applications, the method in the present application may also be used to detect an anomaly of traffic of multiple target hosts, and the present application is not limited specifically.

The host flow anomaly detection method in the application can comprise the following two stages:

1. and (5) a graph model construction stage.

In the stage, the flow records of the target host can be collected as original data for constructing the first flow graph model, then quintuple information in each flow record is extracted to establish the first flow graph model, similarity calculation is carried out on the session characteristics of the external hosts of the first flow graph model, the connection relation between the external hosts with different similarities is added based on a set threshold, the reconstruction of the connection relation between the external hosts in the first flow graph model is completed, and the second flow graph model is formed.

In this application, a target host for collecting traffic may be referred to as an internal node, and an external host having a connection relationship with the target host for collecting traffic may be referred to as an external node.

The flow graph model is used for mathematically representing a network formed by complex connection relations between a target host and an external host in flow. Thus, the traffic graph model may also be referred to as a network relationship graph model. A traffic graph model can comprise a plurality of nodes and edges connected among the nodes, wherein each node represents a host, and the edges among the nodes represent connection relations among the hosts.

For example, the first traffic graph model may include an internal node representing the target host, one or more external nodes representing the external hosts, and edges representing connection relationships between the target host and the external hosts. Since the collected flow records are all related to the target host, the first flow graph model does not include an edge representing the connection relationship between the external hosts. By reconstructing the first traffic graph model based on the similarity of session features between different external hosts and the target host, the generated second traffic graph model may include not only an internal node representing the target host, one or more external nodes representing the external hosts, and an edge representing a connection relationship between the target host and the external host, but also an edge representing a connection relationship between the external hosts. The first flow graph model in the present application may also be referred to as an original flow graph model, and the second flow graph model may also be referred to as a reconstructed flow graph model.

2. And an external abnormal node detection phase.

At this stage, the derivatives in the second traffic graph model generated after reconstruction and the attributes and data structures of the external host packets are redefined, the external hosts with similar session behavior patterns in the second traffic graph model are clustered and grouped by using an improved derivative filtering algorithm, and the external hosts dissociating from normal packets are found out. Optionally, the packet structure of the external host packet may be further optimized by using an evaluation function, and a single external host free from the normal packet is marked as an abnormal traffic node in the network traffic.

The two stages are described in detail below with reference to the flow shown in fig. 1.

Fig. 1 is a schematic flow chart illustrating a host traffic anomaly detection method based on a graph model, where as shown in fig. 1, the method includes:

step 101, generating a first traffic graph model according to the collected traffic records related to the target host within the set time period, where the first traffic graph model includes information of the target host, information of multiple external hosts, and a first type connection relationship between the external hosts and the target host, and the first type connection relationship is used to indicate session characteristics between the external hosts and the target host.

In step 101, after collecting the traffic records related to the target host within the set time period, five-tuple information in each traffic record may be extracted first. The five-tuple information includes a source IP address, a source port number, a destination IP address, a destination port number, and connection time, which refers to session establishment time between the target host and the external host and may be represented by a timestamp. Because each flow record is a session connection record between the target host and the external host, the five-tuple information in each flow record includes the IP address and the port number of the target host, for example, it may be that the source IP address is the IP address of the target host, and the source port number is the port number of the target host; alternatively, it is also possible that the destination IP address is the IP address of the target host, while the destination port number is the port number of the target host.

And then, determining the total connection times and the connection time sequence corresponding to each four-tuple information by aggregating the flow records with the same four-tuple information in the flow records related to the target host within the set time period, and generating a first flow chart model according to the total connection times and the connection time sequence corresponding to each four-tuple information.

Wherein the four-tuple information comprises a source IP address, a source port number, a destination IP address and a destination port number. The total connection times corresponding to the quadruple information refers to the total times of occurrence of the flow records of the quadruple information in the set time period. Optionally, the total connection times may be stored in a dictionary (data _ dit), the key is quadruple information (source IP address, source port number, destination IP address, destination port number), and the value is the total times that the key is used to record the flow of the quadruple information in the set time period.

The connection time sequence corresponding to the quadruple information refers to a sequence consisting of connection time and connection times recorded in the flow record of the quadruple information in the set time period, and the sequence may include one or more timestamps and connection times corresponding to each timestamp, that is, a group of (timestamps, connection times) sequences. Optionally, the connection time sequence may also be stored in a dictionary (time _ seq), where the key is four-tuple information (source IP address, source port number, destination IP address, destination port number), and the value is a sequence of connection time and connection times recorded in the flow record of the four-tuple information in the set time period.

The first traffic graph model is used for representing the connection relation between the target host and the external hosts. The first traffic graph model comprises information of a target host, information of a plurality of external hosts and a first type connection relation between the external hosts and the target host. The first type of connection relationship is used to indicate a session characteristic between the external host and the target host, which may be a time, a number of times, a frequency, etc. of session establishment. The first type of connection relation refers to a connection relation between an external host and a target host, the connection relation between the external host and the target host, which has session connection records in the collected flow records, corresponds to an edge connecting two host nodes in a graph structure.

Specifically, the information of the target host may include an IP address of the target host, the information of the external host may include an IP address of the external host, and the first type connection relationship may include the total connection times and the connection time sequence corresponding to a set of quadruple information between the external host and the target host in which the first type connection relationship exists, that is, the two dictionaries mentioned above. Because the first type connection relationship corresponds to a set of quadruplet information, one or more first type connection relationships may exist between one external host and the target host, which indicate that a connection established between one external host and the target host through different source IP addresses, source port numbers, destination IP addresses, and destination port numbers, for example, the source may be different from the destination, or the ports may be different.

Illustratively, the first traffic graph model is a graph structure, and as shown in fig. 2, the structure of the first traffic graph model may include an internal node, an external node, and an edge between the internal node and the external node. Each internal node represents a target host from which a traffic record is extracted, each external node represents an external host having a connection relationship (or having a session connection record) with the target host from which the traffic record is extracted, and an edge connecting each external node and the internal node represents a first type connection relationship between one external host and one target host. The structure of the second traffic graph model shown in fig. 3 below is similar to the above, except that the second traffic graph model further includes edges between external nodes, which are added according to the similarity of session features between the external nodes, and correspond to the second type of connection relationship mentioned below.

It should be noted that fig. 2 and fig. 3 are only schematic diagrams, and in this application, the number of internal nodes, the number of external nodes, and the number of edges included in a traffic graph model (e.g., a first traffic graph model or a second traffic graph model) are not specifically limited. That is, the first traffic graph model may be generated for one target host, or may be created and generated for a plurality of target hosts, for example, as shown in fig. 2, which is not limited in this application.

For example, the first traffic graph model may be described by G ═ V, E, and each node in the node set V represents an IP address and a corresponding boolean value that describes whether the node is the node that extracts traffic, i.e., the target host. Each edge in the edge set E represents a connection relationship between two nodes, and each edge contains the following information: a source IP address, a source port number, a destination IP address, a destination port number, a total number of connections, a sequence of connection times, and a boolean value that identifies whether the edge was original or manually added. The composition format of the connection time sequence is (time, num), wherein time represents the connection time, and num represents the occurrence times of the connection time. The connection time sequence can reflect the behavior pattern of the session features between the nodes, and the session feature similarity mentioned in the following of the application is also calculated according to the connection time sequence.

Specifically, the node set V may be stored in a node file, and the format of the node file is shown in fig. 4.

Each row is as follows: the vertex numbers vertex IP.

The edge set E may be stored in an edge file, the format of which is shown in fig. 5.

Each section is as follows: source vertex number destination vertex number source port number destination port number connection number

#start

Time series in (time, number) format, line feed every ten

#end

It should be noted that the first flow rate map model corresponds to the set time period. The set time period may be, for example, a time slice of 5 minutes or 10 minutes. For example, a continuous flow record of a long time may be collected, and then the continuous time may be divided into time slices at a certain time granularity, and a first flow map model may be generated according to the flow record in each time slice, and subsequent anomaly detection may be performed. By dividing the time slices, the change condition of the network connection relation can be detected, and the abnormal external nodes can be screened. In the embodiment of the present application, a time slice is taken as an example to introduce a process of abnormal traffic detection.

And 102, reconstructing the first traffic graph model according to the similarity of session characteristics between different external hosts and a target host in the first traffic graph model to generate a second traffic graph model, wherein the second traffic graph model comprises a second type connection relationship between the external hosts, and the second type connection relationship is used for indicating that the session behavior modes of the two connected external hosts are similar.

In this step 102, a session similarity distance between two external hosts may be first determined according to a first type of connection relationship between each external host and the target host, where the session similarity distance is used to indicate a similarity degree of session behavior patterns of the two external hosts. The larger the value of the session similarity distance between the two external hosts is, the more similar the session behavior patterns of the two external hosts are.

Taking the first external host and the second external host as an example, determining the session similarity distance between the two external hosts may include two steps shown in fig. 6:

step 601, determining session feature similarity of each first-type mapping relationship pair which is composed of a first-type mapping relationship between the first external host and the target host and a first-type mapping relationship between the second external host and the target host and meets set conditions.

Step 602, determining a session similarity distance between the first external host and the second external host according to the session feature similarity of all the first type mapping relationship pairs satisfying the set condition between the first external host and the second external host.

In step 601, the setting condition means that the source IP addresses of the quadruple information corresponding to the two first type mappings included in the first type mapping pair are the same and the source port numbers are the same, or the destination IP addresses are the same and the destination port numbers are the same. That is, the condition for calculating the session feature similarity of the first-type mapping relationship pair is that the connection relationships between the two external hosts and the target host are both communicated with the same port of the same target host. The session feature similarity of the first type mapping relationship pair refers to a similarity between connection time sequences in quadruple information respectively corresponding to two first type mapping relationships included in the first type mapping relationship pair.

For example, it is assumed that a first type mapping relationship pair includes an ith first type mapping relationship between a first external host and a target host and a jth first type mapping relationship between a second external host and the target host, a connection time sequence in quadruple information corresponding to the ith first type mapping relationship is a first connection time sequence, a connection time sequence in quadruple information corresponding to the jth first type mapping relationship is a second connection time sequence, and i and j are integers and refer to a cyclic control variable. Each connection time series can be represented as one or more (time stamp, num of connection times) combined series.

In this case, determining the session feature similarity of the first type mapping relationship pair may be:

first, the length of two connection time series is determined, and the length of the connection time series may refer to the number of timestamps included in the connection time series. If the length of the first connection time sequence is greater than twice the length of the second connection time sequence, it may be determined that the similarity between the first connection time sequence and the second connection time sequence is 0, that is, the similarity of the session feature similarity of the first type mapping relationship pair is 0, or the session similarity between the ith first type mapping relationship between the first external host and the target host and the jth first type mapping relationship between the second external host and the target host is 0.

If the length of the first connection time series is greater than the length of the second connection time series but less than or equal to twice the length of the second connection time series, the first connection time series and the second connection time series may be length-matched, and the similarity between the first connection time series and the second connection time series may be calculated by the following formula:

wherein x is a first connection time sequence after length matching, y is a second connection time sequence after length matching, and similarity (x, y) is the similarity between the first connection time sequence and the second connection time sequence, and x is_iTime is the ith timestamp in the first connection time sequence after the length match, x_iNum is the connection frequency corresponding to the ith timestamp in the first connection time sequence after the length matching, and y is_iTime is the ith timestamp in the second connection time sequence after the length match, y_iNum is the second after length matchingAnd connecting times corresponding to the ith timestamp in the connecting time sequence, wherein n is the number of the timestamps in the second connecting time sequence and the second connecting time sequence after the length is matched.

In a possible design, the performing length matching refers to matching a first connection time sequence with a longer length with a second connection time sequence with a shorter length according to a principle that start times are closest to each other, and clipping the first connection time sequence so that the two connection time sequences have the same length.

It should be noted that, when the length of the second connection time sequence is greater than the length of the first connection time sequence, the way of calculating the similarity between the two connection time sequences is similar to the above method, but the two connection time sequences need to be interchanged. That is, the method for calculating the similarity between two connection time sequences means that whether the connection time sequence with the longer length is greater than twice the connection time sequence with the shorter length or not is judged, if yes, the similarity is 0, and if not, the two connection time sequences are subjected to length matching, and then the calculation is performed through the formula.

It can be understood that the process of calculating the session feature similarity shown in step 601 is equivalent to traversing a combination of each first-type mapping relationship between the first external host and the target host and each first-type mapping relationship between the second external host and the target host, where the combination of the two first-type mapping relationships may be referred to as a first-type mapping relationship pair. Then, screening all possible combinations of the first type mapping relations between the first external host and the second external host, and when the combinations meet set conditions, calculating the session similarity between the two first type mapping relations in the combinations, otherwise, not calculating the session similarity between the two first type mapping relations in the combinations, or considering the session similarity between the two first type mapping relations in the combinations to be 0.

Fig. 7 is a schematic diagram illustrating a flow chart of calculating a similarity between two connected time series time _ seq in an embodiment of the present application, where an algorithm corresponding to fig. 7 is described as follows:

1) the longer of the two time _ seq is named as a longList, the shorter is named as a shortList, if the longList is greater than 2 short, the similarity (longList, shortList) is 0, which means that the similarity is 0, otherwise, the process goes to (2);

2) according to the principle that the starting time is the closest, matching the longList by using the shortList, and cutting the longList to ensure that the length is equal to the shortList;

3) then calculate similarity (longList, shortList), refer to the above formula.

In step 602, specifically, the session similarity distance between the first external host and the second external host may be determined according to the session feature similarity and the corresponding weight of all the first type mapping relationship pairs that satisfy the setting condition between the first external host and the second external host. The weight corresponding to the similarity of each session feature is as follows: and the length sum of the connection time sequences in the quadruple information respectively corresponding to the two first type mapping relations included in the first type mapping relation pair for calculating the session feature similarity.

For example, the session similarity distance between the first external host and the second external host may be calculated by the following formula:

wherein v1 is a first external host, v2 is a second external host, distance (v1, v2) is a session similarity distance between the first external host and the second external host, n is a total number of the calculated session feature similarities between the first external host and the second external host, and dis is_iThe weight is the ith session feature similarity between the first external host and the second external host_iThe weight corresponding to the ith session feature similarity may be equal to the sum of the lengths of the two connection time series.

In one possible design, when the session feature similarity of all the first type mapping relationship pairs satisfying the set condition between the first external host and the second external host calculated in step 601 may be stored in an array dis, the n value in the above formula may represent the number of elements in the array. Similarly, the weight corresponding to the similarity of each session feature may be stored in another array weight, and the number of elements of the array is the same as that of the array dis.

Further, in step 102, if the session similarity distance between the two external hosts is greater than the product of the average session similarity distance and the preset threshold, a second type connection relationship between the two external hosts may be established, so as to generate a second traffic graph model. Wherein, the average session similarity distance can be obtained by calculating an average value of the session similarity distances between all external hosts in the first traffic graph model. The preset threshold is used for restricting the number of the second type of connection relations between the external hosts in the second traffic graph model, namely the number of edges between the external nodes in the second traffic graph model. The preset threshold value can be specifically designed according to actual needs.

And 103, clustering and grouping the plurality of external nodes according to the similarity of the session behavior patterns of different external nodes in the second traffic graph model to obtain one or more external host groups, wherein each external host group comprises a plurality of external hosts.

In step 103, all the derivatives in the second traffic graph model can be determined according to the adjacency list of the second traffic graph model, so as to obtain a derivative list. Each derivative comprises at most k external hosts and at least 3 external hosts, wherein k is an integer greater than 3, and the value of k is used for restricting the size of the derivative and can be adjusted according to actual needs.

Then, according to all the derivatives (i.e. the derivative list) found from the second traffic graph model, a derivative relationship matrix corresponding to the second traffic graph model is established, wherein the derivative relationship matrix is used for indicating the number of external hosts included in each derivative and the adjacency relationship between different derivatives.

Specifically, assuming that N derivatives are found from the second traffic map, the derivative relationship matrix C_ijIs a matrix of N rows and N columns, each row in the matrix corresponding to one of the N derivatives, and each column also corresponding to one of the N derivatives. The off-diagonal elements in the matrix are equal to the number of common external hosts between the derivative represented by the row in which the element is located and the derivative represented by the row in which the element is located, i.e. C_ij＝(cliques_i&cliques_j). The diagonal element in the matrix is equal to the number of external hosts, i.e., C, included in the corresponding derivative of the element_ii＝cliques_i. It will be appreciated that for a diagonal element in the matrix, the cluster represented by the row in which the element is located is the same as the cluster represented by the column in which the element is located.

Optionally, after the derivative relation matrix is obtained, the obtained derivative relation matrix may be normalized to make an element in the derivative relation matrix be 0 or 1, and then the subsequent processing is performed. For example, the initial constellation matrix may be traversed when i ═ j, if C_ii<k, then set the element to 0, otherwise set it to 1, when i ≠ j, if C_ij<k-1, the element is set to 0, otherwise to 1.

Further, the relationship matrix of the clusters can be traversed in a breadth-first traversal mode, and different connected clusters are combined to obtain one or more external host packets.

Optionally, after one or more external host packets are obtained according to the second traffic graph model, in a possible design, the present application may further evaluate the structure of the obtained external host packets. For example, an EQ value may be calculated and output using an evaluation function — EQ function for evaluating the degree of node overlap between packets.

And 104, determining the external host which does not belong to any external host group as the abnormal external host communicated with the target host.

In this step 104, an external host that does not belong to any one external host packet may be determined to be an anomalous external host that communicates directly with the target host. Alternatively, a set may be formed by external hosts that do not belong to any external host group, the external hosts in the set are considered as suspicious external hosts, then the external hosts in the set are screened in one or more ways, and then the external hosts remaining after screening are determined as abnormal external hosts communicating with the target host.

Exemplary screening methods may include the following:

1) and filtering the suspicious external hosts in the set according to a preset IP address white list. And removing the host represented by the IP address in the IP address white list from the set, and then determining the suspicious external host of which the IP address in the set does not belong to the IP address white list as an abnormal external host.

2) And filtering according to the times of the suspicious external hosts in the set appearing in each time slice. External hosts that occur more than a certain percentage (e.g., 80%) of the number of time slices, and external hosts representing the IP addresses of the local network nodes, are removed from the set.

3) And filtering according to the port of the target host connected with the suspicious external host in the set. If the port of the target host connected with the external host is one or more ports which are normally used, the external host is removed from the set.

According to the content, the method for detecting the abnormal nodes based on the improved pedigree filtering algorithm is used for reconstructing the constructed graph model according to the session characteristic similarity between the external nodes and the server.

Compared with the prior art, the method has the following advantages and technical effects:

because the traffic data captured from the server host only contains the session connection records between the server node and the external nodes and does not contain the session connection records between the external nodes, if the graph model is built only by the original data, the graph model is a bipartite graph model in the aspect of the dimension of a single host, and a uniform connection behavior pattern between different external nodes is difficult to find. The method for calculating the session characteristic similarity reconstructs the connection relation between the external nodes of the traffic graph model by utilizing the rule of the session characteristic similarity of the external nodes and the server, and the invention also designs an improved derivative filtering algorithm, clusters and groups the external nodes with similar session behavior patterns, finds free nodes or groups with less external nodes through a grouping structure, shows that the external nodes or groups have abnormity, and is favorable for improving the accuracy of the host traffic abnormity detection.

The technical solution in the present application is described in detail by a specific example, and the general flow of the example is shown in fig. 8.

1. And (5) a graph model construction stage.

1.1 extracting server flow data, wherein the extracted five-tuple entries are source IP, source port, destination IP, destination port and session establishment time, four-tuple entries with the same source IP, source port, destination IP and destination port are aggregated, and two dictionaries are used for storing connection time information and connection time information respectively. The key in data _ dit is (source IP, destination IP, source port, destination port), and represents the total number of occurrences of the record with this key as the connection quadruple. the key in the time _ dit is (source IP, destination IP, source port, destination port), and represents the connection time and the number of times information time _ seq of the record using the key as the connection quadruple, and the time _ seq is also a dictionary structure, the key is a timestamp, and the value is the number of times of occurrence of the time.

And (3) carrying out flow slicing on the flow data by taking 5 minutes as time granularity, and generating an original flow graph model for each slice. And is described by G ═ V, E. Each node in the node set V represents the IP address of the node and a boolean value that describes whether the node is a server node that extracts traffic. The edge set E contains 5 pieces of information, which are a connection number, a source port, a destination port, a connection time, and a connection number dictionary, respectively, and distinguishes whether the edge is an original or manually added boolean value.

In specific implementation, the attributes of the nodes and edges of the traffic graph model and the data structures thereof can be defined in the module, so that a method for analyzing the node files and the edge files and a method for establishing the graph are realized.

The data structure of the vertex attributes is defined as follows:

the data structure of the edge attribute is defined as follows:

the module can adopt a Spark platform and a GraphX framework, and the codes realize that a Scale language is used for respectively processing a vertex file and an edge file to finally obtain a graph structure. The specific steps of the graph building module are as follows:

(1) spark is started, and a Spark SQL entrance is initialized.

(2) The vertex file is read and analyzed to generate a vertex elastic distributed data set (RDD).

(3) And reading and analyzing the edge file to generate RDD of the edge.

(4) And generating a Graph structure of the flow Graph model by using a construction method Graph (vertexRDD, edgeRDD) according to the RDDs of the nodes and the edges. The vertexRDD and edgeRDD are data input to two points and edges of the Graph method.

The pseudo code of the RDD method for generating the node is as follows:

the pseudo code to generate the edge RDD is as follows:

the data structure of the node V attribute comprises: IP Address of vertex, IP of extracted traffic Server or not

The data structure of the edge E attribute comprises: the number of connections, source port, destination port, time series distinguish whether the edge is an original or a manually added boolean value.

1.2 the connection time and frequency information in the connection time sequence time _ seq reflects the behavior pattern of the session characteristics of the external node and the server, and the session characteristic similarity between different time _ seq can be measured by defining a session characteristic similarity formula.

the time _ seq composition format is (time, num), time is the time of connection establishment, and num represents the number of times this time occurs. The session characteristic similarity calculation method provided by the invention is provided with time _ seq sequences x and y with the same length, and the calculation method of the similarity is shown as a formula:

the following is an algorithmic description of the computation of two time _ seq similarities:

(1) the longer of the two time _ seq is named as a longList, the shorter is named as a shortList, if the longList is greater than 2 short, the similarity (longList, shortList) is 0, which means that the similarity is 0, otherwise, the process goes to (2);

(2) according to the principle that the starting time is the closest, matching the longList by using the shortList, and cutting the longList to ensure that the length is equal to the shortList;

(3) then calculate similarity (longList, shortList). The calculation flow is shown in the figure.

1.3 the complete original graph model is a bipartite graph, and in order to make the graph model better adapt to the anomaly detection algorithm, some edges need to be added between external nodes, and whether to add an edge depends on the distance between two external nodes. The "distance" herein is not a distance in the conventional sense, but a measure of the similarity of sessions between two nodes, i.e., a session similarity distance. For two external nodes v1, v2, a DISTANCE is defined to measure the session similarity between the external nodes. When the DISTANCEs (v1, v2) are larger, their conversation similarity degree is larger.

Assuming that the two nodes of the internal server are denoted as in _ server1 and in _ server2, for the two external vertices v1 and v2, the following method for calculating the distance between them is given:

(1) the double-type array dis is used for storing the similarity obtained through calculation, the int-type array weight is used for storing the weight corresponding to the similarity, and weight (i) represents the weight occupied by dis (i);

(2) for server node in _ server1, the edges of the source IP equal to server node in _ server1, the destination IP equal to node v1, and the source IP equal to node v1, the destination IP equal to server node in _ server1 are extracted and stored in list edgeList 1; the extracted source IP equals server node in _ server1, destination IP equals node v2, and source IP equals node v2, destination IP equals the edges of server node in _ server1, are stored in list edgeList 2.

(3) Traversing the edgeList1, for each edge e1, if there is one edge e2 in the edgeList2, so that e1 source IP is equal to e2 source IP and e1 source port is equal to e2 source port, or e1 destination IP is equal to e2 destination IP and e1 destination port is equal to e2 destination port, the similarity s ═ similarity (e1 time series, e2 time series) of the two edges can be calculated, the weight w of s ═ e1 time series length + e2 time series length, and s and w are respectively put into the arrays dis and weight. The greater the weight, the greater the referential of this similarity to the session similarity distance.

(4) The operations of the internal server node in _ server2 are the same as (2) and (3), and are not described again.

(5) The formula for the distance between the two outer nodes v1 and v2 is:

wherein n is the length of the dis array and the weight array.

(6) The distances of v1 and v2 are output.

1.4 external nodes.

For the two external nodes v1 and v2, whether an edge is added between the two external nodes depends on the distance (v1 and v2) between the two external nodes and a set threshold value threshold, and the number of the added edges can be controlled in a proper range by setting the value of the threshold value, so that the structure of the network connection relation graph can be better matched with an anomaly detection algorithm.

For the established original graph g, the following is the algorithm flow for adding edges:

(1) for all two external nodes v1, v2 of the original graph, the distance between them is calculated, so as to calculate the average value avg of the distances of the external nodes in the whole graph.

(2) For any two external nodes v1 and v2 of g, the distance d between the external nodes is calculated, and if the distance d is larger than or equal to the product of the weighted average of the external node distances and the threshold, namely d is larger than or equal to avg multiplied by threshold, an edge is established between the external nodes, as shown in FIG. 3.

(3) Graph g after the new added edge is output.

2. And an external abnormal node detection phase.

2.1 improved derivative filtering algorithm.

The attributes and data structures of the cluster and clustered external node groups are defined as follows:

the data structure of the derivative comprises the scale of the derivative, the first node of the derivative and the members of the derivative;

the external node packet attributes include: the graph comprises an original graph, an adjacency list of the graph, a relation matrix of the derivatives, and a list of connected derivatives, wherein the list represents an external node group.

The algorithm flow of the improved derivative filtering algorithm is as follows:

(1) the keys of the adjacency table am, am of the new graph g newly constructed at the last stage of the calculation represent the node number v of the current node, and the values represent a list composed of all vertices adjacent to the node v and arranged in the order of the node numbers from small to large.

(2) Using the adjacency table am to find out all the derivatives in the graph g, wherein the minimum number of nodes in the derivatives is 3, the maximum number of the nodes in the derivatives is k, the k is a parameter defined by an algorithm, and the parameter can be dynamically adjusted; in the process, if m-clique (m < k) of the derivative exists and a node v outside the derivative exists and is adjacent to all nodes in the derivative, v is added to form derivative (m +1) -clique, and the like, and finally a list cliques consisting of the derivative is obtained.

(3) Establishing a cliques-based pedigree relation matrix C_ijThe off-diagonal elements of the matrix equal the number of common nodes between the two derivatives, i.e. C_ij＝(cliques_i&cliques_j) The diagonal elements of the matrix equal the number of nodes corresponding to the derivative, i.e. C_ii＝cliques_i。

(4) To a relationship matrix C of the root system_ijAnd (6) carrying out normalization. When i is j, if C_ii<k, set 0, otherwise set 1, when i ≠ j, if C_ij<And k-1 is set to 0, otherwise, to 1.

(5) At this time, a diagonal element of 1 indicates that the corresponding derivative clique is the clique satisfying the condition. An off-diagonal of 1 indicates that the two corresponding derivatives clique have an adjacency relationship.

(6) Traversing the relationship matrix of the clusters in a breadth-first traversal mode, obtaining a maximum connected cluster by combining the connected clusters, namely, grouping the external nodes, and outputting all the grouping structures.

2.2 external node packet evaluation.

The external node grouping evaluation flow comprises the following steps: calculating the number M of edges in the reconstructed graph model; traversing the adjacency list of the reconstructed graph model, calculating the degree of each node and storing the degree as an array k, wherein k_iRepresents the degree of node i; recording the number num of the grouping structures where the nodes v are located, and traversing a pedigree list in the matrix C; finally, an extended measure of quality of modulation (EQ) value is calculated and output.

The EQ value is an evaluation index of a division result of an overlapping group (also called a community), the value range of the EQ value is 0-1, the larger the value is, the better the overlapping structure is, and the more similar the conversation characteristics of the nodes are.

In the process of external node grouping evaluation, an evaluation function-EQ function for evaluating the node overlapping degree among the groups is used, and the calculation formula is as follows:

where M represents the number of edges in the reconstructed graph model, O_vIndicating the number of external node groups to which the node v belongs, matrix A indicating the adjacency matrix of the graph, matrix C indicating the grouping structure of the external nodes, k_vRepresenting the degree of node v.

2.3 abnormal node marking.

For each time slice in the selected time period, the external node grouping structure and the corresponding modularization evaluation index (EQ function value) are obtained through the corresponding network connection relation graph model.

The community filtering algorithm is applied to community discovery of the graph structure in the experiment, the community quantity and the modularization degree index (EQ function value) of the obtained community structure are set with different k values in the experiment for the community filtering algorithm, observation is carried out by combining the EQ function value, and the influence of different k values on the community structure is compared, so that a proper k value is selected.

In each time slice, the nodes which are free, i.e. do not belong to any group are marked as suspicious abnormal nodes, and then the suspicious abnormal nodes are compared and eliminated according to the external white list IP node, and the remaining abnormal nodes can be used as a reference for searching abnormal flow. And if a certain abnormal node appears for multiple times in all time slices, the abnormal node is regarded as a node with abnormal traffic.

The flow of abnormal node marking is as follows:

(1) for each packet in the external node packet structure, find the set s of free nodes in the structure that do not belong to any packet, and take the difference between s and the IP white list, i.e. remove the nodes in the white list.

(2) m is used to record the number of times an IP appears in all time slices.

(3) Taking out the node IP with the occurrence frequency of more than or equal to 80% of the time slice number in m, filtering out the white list node IP, and storing the node IP as a set abnormalIP;

(4) for each IP address in the abnormalIP, finding out the edge of the IP address connected with the internal server in the graph model, checking whether the port used by the internal server is one or more of normal application port numbers, and if so, removing the port;

(5) and outputting the abnormalIP set, wherein the nodes contained in the set are abnormal.

2.4 Community Structure testing.

And (3) a derivative filtering algorithm is used, and when different k values are selected, results found by the community are different. Therefore, the influence of different k values on the community structure needs to be compared through experiments, so that a proper k value is selected.

Fig. 9 and 10 show the community number and the modularization degree index (EQ function value) of the community structure obtained by using the pedigree filtering algorithm to perform community discovery on the graph structure under different k values for the same network connection relationship graph structure. We can find that as the k value increases, the number of communities obtained by the faction filtering algorithm decreases continuously, and the EQ function value also decreases continuously and gradually approaches to 0. When the EQ function value is too low, the modularization degree of the structure of the community is close to 0, the community discovery effect is very weak, and therefore in practical application, the k value is generally selected to be 4-6. The obtained community structure modularization degree is optimal when k is 4, and k is 4 is selected as an actual parameter of the CPM algorithm under the data set.

After the k value is selected, community discovery is performed on a plurality of segments within a period of time, 10 continuous segments are selected, the time granularity of each segment is 5 minutes, community discovery is performed on the segments, and the obtained community number and modularization degree index (EQ function value) of the community structure are shown in fig. 9 and 10. Therefore, in the network connection relation graph structure, the number and the modularization degree of communities fluctuate within a certain range in a short period of time, and if fragments with excessive deviation exist, abnormal flow exists in the fragments.

In summary, the present application provides a graph model construction algorithm, which inputs traffic captured from a server and analyzes connection information and a time sequence, and provides a method for calculating time sequence similarity based on cosine similarity improvement, and calculates similarity between external nodes using time sequence characteristics, so that a distance between external vertices can be calculated, the external vertices can be connected by setting different thresholds, and edges are added between the external nodes with higher similarity to complete a graph model. The method also comprises the steps of redefining the derivative of the reconstructed traffic graph model and the attribute and the data structure of the external association node, clustering and grouping the external nodes with similar conversation behavior patterns by using an improved derivative filtering algorithm, finding out the external abnormal nodes which are dissociated from normal groups, optimizing the result of the external association node by using an evaluation function, and marking the single external node or the group with less nodes dissociated from the group as the abnormal traffic node existing in the network traffic.

Different from the scheme in the prior art, the method is designed and realized based on a method for calculating the similarity of session features, the method can calculate the similarity between unequal length vectors with binary groups as elements and reconstruct the connection relation of external nodes of the traffic graph model, and simultaneously, the method is designed based on an improved classification filtering algorithm, discovers the internal relevance among the external nodes, clusters and groups the external nodes with similar session behavior patterns, finds out external abnormal nodes dissociating from other normal groups, and improves the accuracy of abnormal external node detection.

Based on the same inventive concept, the application also provides a host traffic anomaly detection device based on the graph model, and the device is used for realizing the host traffic anomaly detection method based on the graph model in the method embodiment.

As shown in fig. 11, the apparatus 1100 includes: a communication module 1110 and a processing module 1120.

The communication module 1110 is configured to collect traffic records related to the target host within a set time period;

the processing module 1120 is configured to generate a first traffic map model according to the collected traffic records related to the target host within the set time period, where the first traffic map model includes information of the target host, information of a plurality of external hosts, and a first type connection relationship between the external host and the target host, and the first type connection relationship is used to indicate a session feature between the external host and the target host; reconstructing the first traffic graph model according to similarity of session features between different external hosts and the target host in the first traffic graph model to generate a second traffic graph model, wherein the second traffic graph model comprises a second type of connection relationship between the external hosts, and the second type of connection relationship is used for indicating that the session behavior patterns of the two external hosts are similar; clustering and grouping the plurality of external nodes according to the similarity of the session behavior patterns of different external nodes in the second traffic graph model to obtain one or more external host groups, wherein each external host group comprises a plurality of external hosts; determining the external hosts that do not belong to any of the external host groups as abnormal external hosts communicating with the target host.

In one possible design, each of the traffic records includes quintuple information of a session associated with the traffic record, the quintuple information includes a source IP address, a source port number, a destination IP address, a destination port number, and connection time, and the source IP address or the destination IP address in each of the traffic records is an IP address of the target host; the processing module 1120 is specifically configured to: aggregating the traffic records with the same four-tuple information in the traffic records related to the target host, and determining the total connection times and connection time sequence corresponding to each four-tuple information, wherein the four-tuple information comprises the source IP address, the source port number, the destination IP address and the destination port number; and generating the first traffic map model according to the total connection times and the connection time sequence corresponding to the four-tuple information.

In one possible design, the processing module 1120 is specifically configured to: determining session similarity distances between the plurality of external hosts according to the first type connection relationship between each external host and the target host, wherein the session similarity distances are used for indicating the similarity degree of the session behavior patterns of the two external hosts; and if the session similarity distance between the two external hosts is greater than the product of the average session similarity distance and a preset threshold, establishing the second type connection relationship between the two external hosts.

In one possible design, one or more of the first type connection relationships exist between each of the external hosts and the target host; the processing module 1120 is specifically configured to: determining session feature similarity of a first type mapping relationship pair composed of a first type mapping relationship between a first external host and the target host and a first type mapping relationship between a second external host and the target host, wherein each first type mapping relationship pair satisfies a set condition; and determining the session similarity distance between the first external host and the second external host according to the session feature similarity of all the first type mapping relation pairs meeting the set condition between the first external host and the second external host.

In one possible design, the first type mapping relationship pair includes an ith first type mapping relationship between the first external host and the target host and a jth first type mapping relationship between the second external host and the target host, a connection time sequence in quadruple information corresponding to the ith first type mapping relationship is a first connection time sequence, a connection time sequence in quadruple information corresponding to the jth first type mapping relationship is a second connection time sequence, and i and j are integers; each connection time sequence comprises one or more time stamps and connection times num corresponding to each time stamp; the processing module 1120 is specifically configured to: if the length of the first connection time sequence is greater than that of the second connection time sequence and is less than or equal to twice of the length of the second connection time sequence, performing length matching on the first connection time sequence and the second connection time sequence; determining similarity between the first connection time series and the second connection time series after the length matching by the following formula:

In one possible design, the processing module 1120 is specifically configured to: if the length of the first connection time series is greater than twice the length of the second connection time series, determining that the similarity between the first connection time series and the second connection time series is 0.

In one possible design, the processing module 1120 is specifically configured to: and determining the session similarity distance between the first external host and the second external host according to the session feature similarity and the corresponding weight of all the first type mapping relation pairs meeting the set condition between the first external host and the second external host.

In one possible design, the processing module 1120 is specifically configured to: determining a session similarity distance between the first external host and the second external host by:

wherein the v1 is the first external host, the v2 is the second external host, the distance (v1, v2) is a session similarity distance between the first external host and the second external host, the n is a total number of the session feature similarities between the first external host and the second external host, and the dis is_iFor an ith session feature similarity between the first external host and the second external host,the weight_iAnd weighting corresponding to the ith session feature similarity.

In one possible design, the processing module 1120 is specifically configured to: determining N derivatives in the second traffic graph model according to an adjacency list of the second traffic graph model, wherein each derivative comprises at most k external hosts, k is an integer greater than 3, and N is a positive integer; according to the N derivatives. Establishing a derivative relation matrix corresponding to the second traffic map model, wherein the derivative relation matrix is used for indicating the number of the external hosts included in each derivative and the adjacency relation among different derivatives; and traversing the relationship matrix of the clusters in a breadth-first traversal mode, and combining the communicated different clusters to obtain the one or more external host groups.

In one possible design, the processing module 1120 is specifically configured to: determining a set of said external hosts that do not belong to any of said external host groups; and filtering the external hosts in the set according to a preset IP address white list, and determining the external hosts of which the IP addresses in the set do not belong to the IP address white list as the abnormal external hosts.

Based on the same technical concept, the embodiment of the present application further provides a computer device, as shown in fig. 12, including at least one processor 1201 and a memory 1202 connected to the at least one processor, where a specific connection medium between the processor 1201 and the memory 1202 is not limited in this embodiment, and the processor 1201 and the memory 1202 in fig. 12 are connected through a bus as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.

In this embodiment, the memory 1202 stores instructions executable by the at least one processor 1201, and the at least one processor 1201 may implement the steps of the secret sharing method by executing the instructions stored in the memory 1202.

The processor 1201 is a control center of the computer device, and can connect various parts of the computer device by using various interfaces and lines, and perform resource setting by executing or executing instructions stored in the memory 1202 and calling data stored in the memory 1202. Optionally, the processor 1201 may include one or more processing units, and the processor 1201 may integrate an application processor and a modem processor, wherein the application processor mainly handles an operating system, a user interface, an application program, and the like, and the modem processor mainly handles wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 1201. In some embodiments, the processor 1201 and the memory 1202 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The processor 1201 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, configured to implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

Memory 1202, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 1202 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 1202 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1202 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

Based on the same technical concept, embodiments of the present application further provide a computer-readable storage medium, where computer-readable instructions are stored, and when the computer reads and executes the computer-readable instructions, the method in the foregoing method embodiments is implemented.

Based on the same technical concept, the embodiment of the present application further provides a computer program product, which includes computer readable instructions, and when the computer readable instructions are executed by a processor, the method in the above method embodiment is implemented.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A host flow abnormity detection method based on a graph model is characterized by comprising the following steps:

generating a first traffic graph model according to collected traffic records related to a target host within a set time period, wherein the first traffic graph model comprises information of the target host, information of a plurality of external hosts and a first type connection relationship between the external hosts and the target host, and the first type connection relationship is used for indicating session characteristics between the external hosts and the target host;

reconstructing the first traffic graph model according to similarity of session features between different external hosts and the target host in the first traffic graph model to generate a second traffic graph model, wherein the second traffic graph model comprises a second type of connection relationship between the external hosts, and the second type of connection relationship is used for indicating that the session behavior patterns of the two external hosts are similar;

clustering and grouping the plurality of external nodes according to the similarity of the session behavior patterns of different external nodes in the second traffic graph model to obtain one or more external host groups, wherein each external host group comprises a plurality of external hosts;

determining the external host not belonging to any of the external host groups as an abnormal external host communicating with the target host.

2. The method of claim 1, wherein each of the traffic records includes five-tuple information of a session associated with the traffic record, the five-tuple information including a source IP address, a source port number, a destination IP address, a destination port number, and a connection time, and the source IP address or the destination IP address in each of the traffic records is an IP address of the target host;

the establishing of the first flow chart model according to the collected flow records related to the target host within the set time period comprises the following steps:

aggregating the traffic records with the same four-tuple information in the traffic records related to the target host, and determining the total connection times and connection time sequence corresponding to each four-tuple information, wherein the four-tuple information comprises the source IP address, the source port number, the destination IP address and the destination port number;

and generating the first traffic graph model according to the total connection times and the connection time sequence corresponding to the four-tuple information.

3. The method of claim 2, wherein the information of the target host comprises an IP address of the target host, and the information of the external host comprises an IP address of the external host;

the first type of connection relationship includes the total connection times and the connection time series corresponding to a set of the quadruple information between the external host and the target host.

4. The method according to claim 3, wherein reconstructing the first traffic graph model according to similarity of session features between different external hosts and the target host in the first traffic graph model to generate a second traffic graph model comprises:

determining session similarity distances between the plurality of external hosts according to the first type connection relationship between each external host and the target host, wherein the session similarity distances are used for indicating the similarity degree of the session behavior patterns of the two external hosts;

and if the session similarity distance between the two external hosts is greater than the product of the average session similarity distance and a preset threshold, establishing the second type connection relationship between the two external hosts.

5. The method of claim 4, wherein one or more of the first type connection relationships exist between each of the external hosts and the target host;

the determining the session similarity distance between the two external hosts comprises:

determining session feature similarity of a first type mapping relationship pair composed of a first type mapping relationship between a first external host and the target host and a first type mapping relationship between a second external host and the target host, wherein each first type mapping relationship pair satisfies a set condition;

and determining the session similarity distance between the first external host and the second external host according to the session feature similarity of all the first type mapping relation pairs meeting the set condition between the first external host and the second external host.

6. The method according to claim 5, wherein the first type of mapping relationship pair satisfying the set condition is:

the source IP addresses of the quadruple information corresponding to the two first-type mapping relationships included in the first-type mapping relationship pair are the same and the source port numbers are the same, or the destination IP addresses are the same and the destination port numbers are the same.

7. The method of claim 5, wherein the session feature similarity of the first type mapping relationship pair is:

and the similarity between the connection time sequences in the quadruple information respectively corresponding to the two first type mapping relations included in the first type mapping relation pair.

8. The method according to claim 7, wherein the first type mapping relationship pair includes an i-th first type mapping relationship between the first external host and the target host and a j-th first type mapping relationship between the second external host and the target host, and a connection time sequence in quad information corresponding to the i-th first type mapping relationship is a first connection time sequence, a connection time sequence in quad information corresponding to the j-th first type mapping relationship is a second connection time sequence, and i and j are integers; each connection time sequence comprises one or more time stamps and connection times num corresponding to each time stamp;

the determining the session feature similarity of the first type mapping relationship pair includes:

if the length of the first connection time sequence is greater than that of the second connection time sequence and is less than or equal to twice the length of the second connection time sequence, matching the lengths of the first connection time sequence and the second connection time sequence;

determining similarity between the first connection time series and the second connection time series after the length matching by the following formula:

9. The method of claim 8, wherein the determining the session feature similarity of the first-type mapping relationship pair further comprises:

if the length of the first connection time series is greater than twice the length of the second connection time series, determining that the similarity between the first connection time series and the second connection time series is 0.

10. The method according to claim 5, wherein the determining a session similarity distance between the first external host and the second external host according to the session feature similarity of all the first type mapping relationship pairs satisfying the setting condition between the first external host and the second external host comprises:

and determining the session similarity distance between the first external host and the second external host according to the session feature similarity and the corresponding weight of all the first type mapping relation pairs meeting the set condition between the first external host and the second external host.

11. The method of claim 10, wherein the session similarity distance between the first external host and the second external host is determined by the following formula:

12. The method according to claim 10 or 11, wherein the session feature similarity of the first type mapping relationship pair satisfying the set condition corresponds to a weight:

and the length of the connection time sequence in the quadruple information respectively corresponding to the two first type mapping relations included in the first type mapping relation pair is the sum of the lengths of the connection time sequences.

13. The method of claim 1, wherein clustering the plurality of external nodes into one or more external host packets according to similarity of session behavior patterns of different external nodes in the second traffic graph model comprises:

determining N derivatives in the second traffic graph model according to an adjacency list of the second traffic graph model, wherein each derivative comprises at most k external hosts, k is an integer greater than 3, and N is a positive integer;

according to the N clusters, establishing a cluster relation matrix corresponding to the second traffic graph model, wherein the cluster relation matrix is used for indicating the number of the external hosts included in each cluster and the adjacency relation among different clusters;

and traversing the relationship matrix of the clusters in a breadth-first traversal mode, and combining the communicated different clusters to obtain the one or more external host groups.

14. The method of claim 13, wherein the derivative relationship matrix is N rows and N columns, wherein each of the N rows corresponds to one of the N derivatives, and wherein each of the N columns corresponds to one of the N derivatives;

the off-diagonal elements in the party relation matrix are equal to the number of common external hosts between the party corresponding to the row where the off-diagonal elements are located and the party corresponding to the row where the off-diagonal elements are located, and the diagonal elements are equal to the number of external hosts included in the party corresponding to the diagonal elements.

15. The method of claim 1, wherein determining the external host not belonging to any of the external host groups as an abnormal external host in communication with the target host comprises:

determining a set of said external hosts that do not belong to any of said external host groups;

and filtering the external hosts in the set according to a preset IP address white list, and determining the external hosts of which the IP addresses in the set do not belong to the IP address white list as the abnormal external hosts.

16. A host flow abnormity detection device based on a graph model is characterized by comprising:

the processing module is used for generating a first traffic graph model according to the collected traffic records related to the target host within the set time period, wherein the first traffic graph model comprises information of the target host, information of a plurality of external hosts and a first type connection relationship between the external hosts and the target host, and the first type connection relationship is used for indicating session characteristics between the external hosts and the target host; reconstructing the first traffic graph model according to similarity of session features between different external hosts and the target host in the first traffic graph model to generate a second traffic graph model, wherein the second traffic graph model comprises a second type connection relationship between the external hosts, and the second type connection relationship is used for indicating that the session behavior patterns of the two external hosts are similar; clustering and grouping the plurality of external nodes according to the similarity of the session behavior patterns of different external nodes in the second traffic graph model to obtain one or more external host groups, wherein each external host group comprises a plurality of external hosts; and determining the external host which does not belong to any external host group as an abnormal external host which is communicated with the target host.

17. A computer device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory and for executing the method of any one of claims 1 to 15 in accordance with the obtained program instructions.

18. A computer-readable storage medium comprising computer-readable instructions which, when read and executed by a computer, cause the method of any one of claims 1 to 15 to be carried out.

19. A computer program product comprising computer readable instructions which, when executed by a processor, cause the method of any of claims 1 to 15 to be carried out.