CN114186232A

CN114186232A - Network attack team identification method and device, electronic equipment and storage medium

Info

Publication number: CN114186232A
Application number: CN202111519490.2A
Authority: CN
Inventors: 蒙家晓; 蒋屹新; 匡晓云; 陈晓; 许爱东; 关泽武; 陈霖; 杜金燃; 洪超; 戴涛; 徐传懋; 赖博宇; 黄建理
Original assignee: CSG Electric Power Research Institute; China Southern Power Grid Co Ltd
Current assignee: CSG Electric Power Research Institute; China Southern Power Grid Co Ltd
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-03-15

Abstract

The invention discloses a network attack team identification method, a network attack team identification device, electronic equipment and a storage medium, which are used for solving the technical problem that network attacks cannot be effectively and accurately identified from massive alarm information. The invention comprises the following steps: extracting network attack log data from a preset database; standardizing the network attack log data to obtain a standardized log data set; clustering data objects in the standardized log data set to obtain a plurality of network attack teams; generating a team representation of each of the cyber attack teams; when network abnormal information is received, matching the network abnormal information in the team figures, and determining a target network attack team corresponding to the network abnormal information.

Description

Network attack team identification method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of network attack identification, in particular to a network attack team identification method and device, electronic equipment and a storage medium.

Background

A cyber attack refers to any type of offensive action directed to a computer information system, infrastructure, computer network, or personal computer device. For computers and computer networks, destroying, revealing, modifying, disabling software or services, stealing or accessing data from any computer without authorization, is considered an attack in computers and computer networks.

Intrusion detection is a reasonable supplement of a firewall, helps a system to deal with network attacks, expands the security management capability of a system administrator (including security audit, monitoring, attack identification and response), and improves the integrity of an information security infrastructure. It collects information from several key points in the computer network system and analyzes the information to see if there is a breach of security policy and evidence of an attack in the network. Intrusion detection is considered as a second security gate behind a firewall and can monitor the network without affecting the network performance, thereby providing real-time protection against internal attacks, external attacks and misoperations.

The current intrusion detection system mainly monitors network traffic, compares the traffic with a rule base according to the characteristics of the traffic, and then alarms according to abnormal traffic. Since the scanning of security manufacturers and hackers on the internet occurs from time to time, the intrusion detection system generates a large amount of alarm information, security personnel are tired of processing a large amount of meaningless alarms, and real attack alarms may be submerged in a large amount of alarm information. Therefore, the network attack cannot be effectively and accurately defended.

Disclosure of Invention

The invention provides a network attack team identification method, a network attack team identification device, electronic equipment and a storage medium, which are used for solving the technical problem that network attacks cannot be effectively and accurately identified from massive alarm information.

The invention provides a network attack team identification method, which comprises the following steps:

extracting network attack log data from a preset database;

standardizing the network attack log data to obtain a standardized log data set;

clustering data objects in the standardized log data set to obtain a plurality of network attack teams;

generating a team representation of each of the cyber attack teams;

when network abnormal information is received, matching the network abnormal information in the team figures, and determining a target network attack team corresponding to the network abnormal information.

Optionally, the step of clustering the data objects in the standardized log data set to obtain a plurality of network attack teams includes:

calculating a first Euler distance for each of said data objects in said normalized log data set and an average distance for all of said data objects;

extracting a first sample from the normalized log dataset according to the first euler distance and the average distance, generating a first sample dataset;

counting a total number of data of a first sample in the first sample dataset;

calculating a first clustering number according to the total data number;

clustering first samples in the first sample data set based on the first clustering number to obtain a second sample data set; the second sample data set includes a plurality of first cluster clusters corresponding to the first cluster number; each first clustering cluster corresponds to one second sample;

calculating a second clustering number according to the first clustering number;

extracting two second samples with the minimum second Euler distance from the second sample data set, generating a second cluster, and adding the second cluster into a preset third sample data set;

judging whether the number of the second cluster in the third sample data set is equal to the second cluster number;

if yes, respectively calculating the arithmetic mean of the first samples in each second cluster;

judging whether the difference values of the arithmetic mean of any two second cluster clusters are both larger than a preset threshold value;

and if so, determining each second clustering cluster as a network attack team.

Optionally, the method further comprises:

and if the number of the second cluster in the third sample data set is not equal to the second cluster number, returning to the step of extracting two second samples with the minimum second Euler distance from the second sample data set, generating a second cluster, and adding the second cluster into a preset third sample data set.

Optionally, the method further comprises:

and if the difference value of the arithmetic mean of the two second cluster types is not larger than a preset threshold value, setting the second cluster number as a first cluster number, setting the third sample data set as a second sample data set, and returning to the step of calculating the second cluster number according to the first cluster number.

Optionally, the step of extracting a first sample from the normalized log data set according to the first euler distance and the average distance to generate a first sample data set includes:

and extracting data objects with the first Euler distance not greater than the average distance from the standardized log data set as first samples, and generating a first sample data set.

The invention also provides a network attack team identification device, which comprises:

the extraction module is used for extracting the network attack log data from a preset database;

the standardized processing module is used for carrying out standardized processing on the network attack log data to obtain a standardized log data set;

the clustering module is used for clustering data objects in the standardized log data set to obtain a plurality of network attack teams;

the team portrait generation module is used for generating a team portrait of each network attack team;

and the target network attack team determining module is used for matching the network abnormal information in the team figures when the network abnormal information is received, and determining a target network attack team corresponding to the network abnormal information.

Optionally, the clustering module includes:

a first euler distance and average distance calculation sub-module for calculating a first euler distance for each of said data objects in said standardized log data set and an average distance for all of said data objects;

a first sample data set generation submodule, configured to extract a first sample from the normalized log data set according to the first euler distance and the average distance, and generate a first sample data set;

a data total generation submodule, configured to count a data total of a first sample in the first sample data set;

the first clustering number calculating submodule is used for calculating a first clustering number according to the total data number;

a second sample data set generation submodule, configured to cluster first samples in the first sample data set based on the first cluster number, so as to obtain a second sample data set; the second sample data set includes a plurality of first cluster clusters corresponding to the first cluster number; each first clustering cluster corresponds to one second sample;

the second clustering number calculating submodule is used for calculating a second clustering number according to the first clustering number;

a second cluster generation sub-module, configured to extract two second samples with a minimum second euler distance from the second sample data set, generate a second cluster, and add the second cluster into a preset third sample data set;

a first determining sub-module, configured to determine whether the number of the second cluster in the third sample data set is equal to the second cluster number;

an arithmetic mean calculation sub-module, configured to calculate an arithmetic mean of the first samples in each of the second cluster if yes;

the second judgment submodule is used for judging whether the difference values of the arithmetic mean of any two second clustering clusters are both larger than a preset threshold value;

and the network attack team determining submodule is used for determining each second clustering cluster as a network attack team if the second clustering cluster is determined to be the network attack team.

Optionally, the method further comprises:

and a first returning sub-module, configured to, if the number of the second cluster in the third sample data set is not equal to the second cluster number, return to the step of extracting two second samples with a minimum second euler distance from the second sample data set, generating a second cluster, and adding the second cluster to a preset third sample data set.

The invention also provides an electronic device comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the network attack team identification method according to any one of the above instructions in the program code.

The present invention also provides a computer-readable storage medium for storing program code for executing the network attack team identification method as described in any one of the above.

According to the technical scheme, the invention has the following advantages: the network attack system and the network attack method have the advantages that a plurality of network attack teams with the same attack behaviors are obtained by clustering network attack log data, and then the network abnormal information received in real time is matched with the network attack teams to quickly judge whether the network abnormal information is the network attack launched by the network attack teams, so that the network attack is effectively and accurately identified.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a flowchart illustrating steps of a network attack team identification method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of a network attack team identification method according to another embodiment of the present invention;

fig. 3 is a block diagram of a network attack team identification apparatus according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a network attack team identification method, a network attack team identification device, electronic equipment and a storage medium, which are used for solving the technical problem that network attacks cannot be effectively and accurately identified from massive alarm information.

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a network attack team identification method according to an embodiment of the present invention.

The network attack team identification method provided by the invention specifically comprises the following steps:

step 101, extracting network attack log data from a preset database;

in the embodiment of the invention, a database storing the network behavior log data can be connected, and the network attack log data needing clustering can be selected from the database.

The network attack log data mainly comprises field information such as source IP, destination IP, generation time, log abstract and the like.

Step 102, standardizing the network attack log data to obtain a standardized log data set;

in the embodiment of the present invention, the normalization processing is to determine whether the IP address format of the cyber attack log data is the standard IP address format, and if not, modify the IP address format of the cyber attack log data into the standard IP address format, such as 120.23.44.55, so as to obtain a normalized log data set.

103, clustering data objects in the standardized log data set to obtain a plurality of network attack teams;

after the standardized log data set is obtained, clustering can be performed on data objects in the standardized log data set to obtain a plurality of abnormal data sets, wherein each abnormal data set corresponds to a network attack team.

It is to be understood that clustering can group together attacks having the same manner of behavior, preferred attack methods, and characteristics. Generally speaking, the same behavior, preferred attack methods and features are likely to be derived from the same network attack team. Thus, the standardized log data set is partitioned by a network attack team by clustering.

In one example, for clustering of standardized log data sets, a K-means algorithm may be employed.

The K-means clustering algorithm is an iterative solution clustering algorithm, and the steps are that data are divided into K groups in advance, K objects are randomly selected to serve as initial clustering centers, then the distance between each object and each seed clustering center is calculated, and each object is allocated to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster.

Step 104, generating a team portrait of each network attack team;

after the data objects in the standardized log data set are clustered to obtain a plurality of network attack teams, a team representation of each network attack team can be generated. The team representation is used to characterize the behavior patterns, preferences, and attack methods and features of each network attack team.

And 105, matching the network abnormal information in the plurality of team figures when the network abnormal information is received, and determining a target network attack team corresponding to the network abnormal information.

After the team figures of each network attack team are formed, the network attacks of each network attack team can be accurately positioned in the massive network abnormal information, and therefore the real network attacks can be effectively and accurately identified in the massive network abnormal information.

The network attack system and the network attack method have the advantages that a plurality of network attack teams with the same attack behaviors are obtained by clustering network attack log data, and then the network abnormal information received in real time is matched with the network attack teams to quickly judge whether the network abnormal information is the network attack launched by the network attack teams, so that the network attack is effectively and accurately identified.

Referring to fig. 2, fig. 2 is a flowchart illustrating steps of a network attack team identification method according to another embodiment of the present invention. The method specifically comprises the following steps:

step 201, extracting network attack log data from a preset database;

step 202, carrying out standardization processing on the network attack log data to obtain a standardized log data set;

the steps 201-202 are similar to the steps 101-102, and for details, reference may be made to the description of the steps 101-102, which is not repeated herein.

Step 203, calculating a first Euler distance of each data object in the standardized log data set and an average distance of all data objects;

step 204, extracting a first sample from the standardized log data set according to the first Euler distance and the average distance to generate a first sample data set;

in the embodiment of the invention, before clustering, isolated points can be found out from the standardized log data set to eliminate the interference of the isolated points on clustering, so as to obtain a first sample data set without the isolated points.

In particular implementations, isolated points can be filtered by computing a first euler distance of each data object and an average distance of all data objects in the normalized log dataset to filter isolated points by the first euler distance and the average distance of each data object.

In one example, the step of extracting a first sample from the normalized log dataset based on the first euler distance and the average distance, generating a first sample dataset, may comprise:

In an embodiment of the present invention, when the first euler distance of the data object is greater than the average distance, the point may be considered to be an isolated point in the normalized log data set and should be deleted. A first sample data set may be generated with data objects for which the first euler distance is not greater than the average distance as first samples.

Step 205, counting the total data number of the first sample in the first sample data set;

step 206, calculating a first cluster number according to the total data number;

after the isolated points are eliminated, a total number n of data for the first sample in the first sample data set may be calculated, and a first clustering number k may be calculated from the total number n of data:

k＝n^0.5

step 207, clustering the first samples in the first sample data set based on the first clustering number to obtain a second sample data set; the second sample data set contains a plurality of first cluster clusters corresponding to the first cluster number; each first clustering cluster corresponds to one second sample;

and then inputting the first sample data set into a K-means algorithm, obtaining K first cluster clusters through operation, and taking each first cluster as a second sample to carry out next clustering.

Step 208, calculating a second clustering number according to the first clustering number;

after the first sample data set is clustered to obtain K first cluster clusters, the second cluster number K of re-clustering can be calculated based on the first cluster number:

step 209, extracting two second samples with the minimum second euler distance from the second sample data set, generating a second cluster, and adding the second cluster into a preset third sample data set;

and adding two second samples with the minimum second Euler distance in the second sample data set as a second cluster into the third sample data set, and deleting the two second samples from the second sample data set.

Step 210, judging whether the number of the second cluster in the third sample data set is equal to the number of the second cluster;

step 211, if yes, calculating the arithmetic mean of the first samples in each second cluster respectively;

and then judging whether the number of second clustering clusters in the third sample data set is equal to that of the second clustering clusters, if so, representing that clustering is finished, and at the moment, respectively calculating the arithmetic mean of the first samples in each second clustering cluster so as to judge whether to finish clustering according to the arithmetic mean.

It should be noted that, in the embodiment of the present invention, the method further includes: and if the number of the second cluster in the third sample data set is not equal to the second cluster number, returning to the step of extracting two second samples with the minimum second Euler distance from the second sample data set, generating the second cluster, and adding the second cluster into a preset third sample data set.

Specifically, when the number of second cluster clusters in the third sample data set is not equal to the second cluster number, which indicates that the clustering is not completed, step 209 may be repeated until the number of second cluster clusters in the third sample data set is equal to the second cluster number.

Step 212, judging whether the difference values of the arithmetic mean of any two second cluster are both larger than a preset threshold value;

step 213, if yes, determining each second cluster as a network attack team;

in the embodiment of the present invention, whether to end clustering may be determined by determining whether the difference between the arithmetic mean of any two second clustering clusters is greater than a preset threshold. If yes, representing that any two second cluster clusters are not similar, wherein data in each second cluster can form an abnormal data set, and the data in the abnormal data set is sent out by the same network attack team.

It is noted that, in the embodiment of the present invention, the method further includes:

and if the difference value of the arithmetic mean of the two second cluster types is not larger than the preset threshold value, setting the second cluster number as the first cluster number, setting the third sample data set as the second sample data set, and returning to the step of calculating the second cluster number according to the first cluster number.

Specifically, when the difference between the arithmetic means of two second cluster clusters is not greater than the preset threshold, the two second cluster clusters become similar, and re-clustering can be performed, so that the second cluster number can be set as the first cluster number, the third sample data set can be set as the second sample data set, and the process returns to step 208. Until the difference of the arithmetic mean of any two second cluster clusters is larger than the preset threshold. It should be noted that the size of the preset threshold may be flexibly set according to an actual application situation, and this is not specifically limited in the embodiment of the present invention.

Step 214, generating a team representation of each network attack team;

Step 215, when the network abnormal information is received, matching the network abnormal information in the team figures, and determining a target network attack team corresponding to the network abnormal information.

Referring to fig. 3, fig. 3 is a block diagram illustrating a network attack team identification apparatus according to an embodiment of the present invention.

The embodiment of the invention provides a network attack team identification device, which comprises:

an extracting module 301, configured to extract network attack log data from a preset database;

the standardization processing module 302 is used for standardizing the network attack log data to obtain a standardization log data set;

the clustering module 303 is configured to cluster data objects in the standardized log data set to obtain a plurality of network attack teams;

a team representation generation module 304, configured to generate a team representation of each cyber attack team;

and the target network attack team determining module 305 is used for matching the network abnormal information in the plurality of team figures when the network abnormal information is received, and determining a target network attack team corresponding to the network abnormal information.

In this embodiment of the present invention, the clustering module 303 includes:

a first euler distance and average distance calculation submodule for calculating a first euler distance of each data object and an average distance of all data objects in the normalized log data set;

the first sample data set generation submodule is used for extracting a first sample from the standardized log data set according to the first Euler distance and the average distance and generating a first sample data set;

the data total generation submodule is used for counting the data total of the first sample in the first sample data set;

the second sample data set generation submodule is used for clustering the first samples in the first sample data set based on the first clustering number to obtain a second sample data set; the second sample data set contains a plurality of first cluster clusters corresponding to the first cluster number; each first clustering cluster corresponds to one second sample;

the second cluster generation sub-module is used for extracting two second samples with the minimum second Euler distance from the second sample data set, generating a second cluster and adding the second cluster into a preset third sample data set;

the first judgment submodule is used for judging whether the number of the second clustering clusters in the third sample data set is equal to the second clustering number or not;

the arithmetic mean calculating submodule is used for calculating the arithmetic mean of the first samples in each second cluster if the first samples in each second cluster are the same as the arithmetic mean;

In this embodiment of the present invention, the clustering module 303 further includes:

and the first returning submodule is used for returning to the steps of extracting two second samples with the minimum second Euler distance from the second sample data set, generating a second cluster and adding the second cluster into a preset third sample data set if the number of the second cluster in the third sample data set is not equal to the second cluster number.

and the second returning submodule is used for setting the second clustering number as the first clustering number, setting the third sample data set as the second sample data set and returning to the step of calculating the second clustering number according to the first clustering number if the difference value of the arithmetic mean of the two second clustering clusters is not larger than the preset threshold value.

In an embodiment of the present invention, the first sample data set generation submodule includes:

and a first sample data set generating unit for extracting a data object of which the first euler distance is not more than the average distance from the normalized log data set as a first sample, and generating a first sample data set.

An embodiment of the present invention further provides an electronic device, where the device includes a processor and a memory:

the memory is used for storing the program codes and transmitting the program codes to the processor;

the processor is used for executing the network attack team identification method according to the embodiment of the invention according to the instructions in the program codes.

The embodiment of the invention also provides a computer-readable storage medium which is used for storing the program codes, and the program codes are used for executing the network attack team identification method of the embodiment of the invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A network attack team identification method is characterized by comprising the following steps:

extracting network attack log data from a preset database;

generating a team representation of each of the cyber attack teams;

2. The cyber attack team identification method according to claim 1, wherein the step of clustering the data objects in the standardized log data set to obtain a plurality of cyber attack teams comprises:

counting a total number of data of a first sample in the first sample dataset;

calculating a first clustering number according to the total data number;

and if so, determining each second clustering cluster as a network attack team.

3. The network attack team identification method of claim 2, further comprising:

4. The network attack team identification method of claim 2, further comprising:

5. The cyber attack team identifying method as claimed in claim 2, wherein the step of extracting a first sample from the standardized log data set according to the first Euler distance and the average distance to generate a first sample data set comprises:

6. A cyber attack team identifying apparatus, comprising:

7. The cyber attack team identifying device according to claim 6, wherein the clustering module comprises:

8. The cyber attack team identifying device according to claim 7, further comprising:

9. An electronic device, comprising a processor and a memory:

the processor is configured to execute the network attack team identification method of any of claims 1-5 according to instructions in the program code.

10. A computer-readable storage medium characterized in that the computer-readable storage medium is configured to store a program code for executing the network attack team identification method according to any one of claims 1 to 5.