CN111049839A

CN111049839A - Abnormity detection method and device, storage medium and electronic equipment

Info

Publication number: CN111049839A
Application number: CN201911299455.7A
Authority: CN
Inventors: 董叶豪
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2020-04-21
Anticipated expiration: 2039-12-16
Also published as: CN111049839B

Abstract

The application relates to the technical field of network security, and provides an abnormality detection method, an abnormality detection device, a storage medium and electronic equipment. The abnormality detection method includes: sampling records in the safety warning data set to obtain a sample data set; preprocessing each record of the sample data set, and converting the value of the attribute which cannot be subjected to size comparison in the plurality of recorded attributes into a value which can be subjected to size comparison; and constructing an isolated forest based on the preprocessed sample data set, and determining whether the record in the safety warning data set is abnormal or not by using the isolated forest. Even if the safety warning data set has massive and high-dimensional records, the method can still quickly and effectively complete the abnormal detection, and has high detection precision and wide application range.

Description

Abnormity detection method and device, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of network security, in particular to an anomaly detection method, an anomaly detection device, a storage medium and electronic equipment.

Background

Anomalies are data patterns with different data characteristics than normal instances, and the detection of anomalies is of great significance and often provides important operational information in the field of network security. For example, an abnormal computer network traffic pattern may indicate unauthorized access, an abnormal URL link may indicate an illegal intrusion into a server, and so forth.

Most existing model-based anomaly detection methods construct a profile of normal instances and then identify instances that do not meet the normal profile as anomalous. The drawbacks of this approach are two: first, the anomaly detector is trained to look for normal instances rather than abnormal instances, resulting in excessive false positives; secondly, the method has high computational complexity, so the application of the method is limited to low-dimensional data and small-scale data.

Disclosure of Invention

An object of the embodiments of the present application is to provide an abnormality detection method, an abnormality detection apparatus, a storage medium, and an electronic device, so as to solve the above technical problems.

In order to achieve the above purpose, the present application provides the following technical solutions:

in a first aspect, an embodiment of the present application provides an anomaly detection method, including: sampling records in the safety warning data set to obtain a sample data set; preprocessing each record of the sample data set, and converting the value of the attribute which cannot be subjected to size comparison in the plurality of recorded attributes into a value which can be subjected to size comparison; and constructing an isolated forest based on the preprocessed sample data set, and determining whether the record in the safety warning data set is abnormal or not by using the isolated forest.

The method at least comprises the following beneficial effects:

firstly, the records in the safety warning data set are sampled, then an isolated forest (an abnormal detection model) is constructed according to the sampled sample data set, rather than modeling according to the records in the whole safety warning data set, the number of the records in the sample data set can be far smaller than the number of the records in the safety warning data set, so that the model construction can be rapidly completed on the premise of consuming less storage resources, and the rapid and effective detection can still be realized for the condition that a large number of records exist in the safety warning data set.

Secondly, before the isolated forest is constructed, the values of some attributes (such as text attributes) which cannot be subjected to size comparison in the records are preprocessed, so that the sizes of the attributes can be compared, and the processed records can meet the algorithm requirement of constructing the isolated forest, thereby being beneficial to improving the practicability and the application range of the scheme.

Thirdly, each node of each tree in the soliton forest uses only one attribute (i.e., one dimension) to classify the records, so the method can effectively process even if the records contain multiple attributes (i.e., multiple dimensions).

According to the principle of the isolated forest algorithm, the abnormal data are closer to the root node, the normal data are farther from the root node, and the model is essentially a model directly focusing on an abnormal instance for detection, namely, the isolated forest is utilized to directly detect the abnormal record without paying attention to the data characteristics of the normal record, so that the false alarm rate is reduced, and the detection precision is improved.

In an implementation manner of the first aspect, preprocessing each record of the sample data set, and converting a value of any attribute that is not comparable in size in the record into a value that is comparable in size includes: selecting one of the characteristic values of the attribute of all records of the sample data set as a reference characteristic value; wherein, the characterization value of an attribute is the value of the attribute, or the quantitative representation of the value of the attribute; and calculating a similarity measure between the characteristic value of the attribute in each record of the sample data set and the reference characteristic value, and replacing the value of the attribute in the record with the calculated similarity measure.

The above implementation provides a general method for preprocessing attributes, that is, converting attributes that are not comparable in size into values, and obviously, the values can be compared in size. The attribute of each record needs to consider not only its own characteristic (i.e. the characteristic value of the attribute) but also the relationship (i.e. similarity measurement) with the same attribute in other records of the sample data set during conversion, so that a reasonable numerical value can be obtained after conversion.

In an implementation manner of the first aspect, the determining, by the computing device, a size of each record in the sample data set, where the size of each record is smaller than a size of each record in the sample data set, and the determining, by the computing device, includes: converting the value of the description information attribute in each record of the sample data set into a vector by using a bag-of-words model and a TF-IDF weighting algorithm; selecting one vector from all vectors obtained by conversion as a reference vector; and calculating a cosine value between a vector obtained by converting the attribute of the description information in each record of the sample data set and the reference vector, and replacing the value of the attribute of the description information in the record with the calculated cosine value.

The content of the descriptive information attribute may be a description of the record, in text form, which is converted into numerical form in the above implementation.

In an implementation manner of the first aspect, the determining, by the computing device, a value of a time attribute in a record of the sample data set, where the time attribute is included in the plurality of attributes, and the determining, by the computing device, a value of the time attribute in the record is converted into a value that can be compared in size, where the determining includes: converting the value of the time attribute in each record of the sample data set into a corresponding accumulated duration; selecting one of all the converted accumulated time lengths as a reference accumulated time length; calculating the difference between the accumulated time length obtained by converting the time attribute in each record of the sample data set and the reference accumulated time length, dividing the difference by the difference between two different accumulated time lengths selected from all the converted accumulated time lengths, and replacing the value of the time attribute in the record with the calculated ratio.

The content of the time attribute may be the time of generation of the record, typically a formatted string. Although the time attribute can be processed in the same way as the description information, the inventor researches and discovers that the difference of two time strings on the text cannot reflect the difference between two represented moments completely and truly, so that a different way is adopted when the time attribute is preprocessed. The accumulated time duration may be defined as time elapsed from a certain preset time point to a time represented by the string corresponding to the time attribute, and the accumulated time duration may be measured in units of seconds, milliseconds, and the like.

In an implementation manner of the first aspect, the preprocessing each record of the sample data set to convert a value of an IP address attribute in the record into a value capable of performing size comparison includes: selecting one of the values of the source IP address attribute and the values of the destination IP address attribute of all records of the sample data set as a reference IP address; calculating the shortest distance between each selected reference IP address and the corresponding node of the value of the similar IP address attribute in each record of the sample data set in the first classification tree, and replacing the value of the similar IP address attribute in the record with the calculated shortest distance; the first classification tree is a tree data structure for hierarchically classifying IP addresses, and a value of any IP address attribute can be classified to one leaf node in the first classification tree.

In an implementation manner of the first aspect, the preprocessing each record of the sample data set, and converting a value of a port attribute in the record into a value that can be subjected to size comparison includes: selecting one of the values of the source port attribute and the values of the destination port attribute of all records of the sample data set as a benchmark port; calculating the shortest distance between the selected benchmark port and the corresponding node of the value of the attribute of the same type port in each record of the sample data set in the second classification tree, and replacing the value of the attribute of the same type port in the record with the calculated shortest distance; the second classification tree is a tree data structure for performing hierarchical classification on the ports, and the value of any port attribute can be classified to one leaf node in the second classification tree.

In an implementation manner of the first aspect, the step of converting a value of the warning information category attribute in the record into a value that can be compared in size includes: selecting one from the values of the warning information category attributes of all records of the sample data set as a reference warning information category; calculating the shortest distance between the value of the attribute of the warning information category in each record of the sample data set and the reference warning information category in the third classification tree, and replacing the value of the attribute of the warning information category in the record with the calculated shortest distance; the third classification tree is a tree data structure for hierarchically classifying the warning information categories, and the value of any warning information category attribute can be classified to one leaf node in the third classification tree.

The preprocessing modes for the IP address attribute, the port attribute and the warning information category attribute in the above three implementation modes are similar, the shortest distance between leaf nodes is calculated on a pre-constructed classification tree, and the constructed classification trees are different only for different attributes.

In one implementation of the first aspect, the determining whether an anomaly exists in a record in the safety warning dataset using the orphan forest includes: preprocessing each record in the safety warning data set, and converting the value of the attribute which cannot be subjected to size comparison in the plurality of recorded attributes into a value which can be subjected to size comparison; and calculating an abnormal score for each preprocessed record by utilizing the isolated forest, and determining whether the record is abnormal according to the calculated abnormal score and a preset judgment rule.

The abnormal score represents the possibility that the corresponding record has abnormality, whether the record has abnormality can be quickly determined according to the relation between the abnormal score and the preset judgment rule, the abnormal positioning mode is simple and quick, and the judgment on all records in the safety warning data set can be completed only by linear time complexity.

In one implementation manner of the first aspect, the anomaly score is calculated by the following formula:

wherein s (x, n) represents the anomaly score, x represents a record corresponding to the anomaly score, n represents the number of records in the sample dataset, h (x) represents the path length of record x in the trees making up the isolated forest, E (h (x)) represents the average path length of record x in all trees making up the isolated forest, c (n) represents the average path length of a binary search tree constructed with n records; the judgment rule includes: if s (x, n) is greater than a first threshold value T1, determining x as an abnormal record, and if s (x, n) is less than a second threshold value T2, determining x as a normal record; wherein 0.5< T1<1, 0< T2< 0.5.

If the calculation result of s (x, n) is located between T1 and T2, it can be considered that there is no obvious abnormality in the record, and it can be further determined whether there is an abnormality in other ways, for example, a manual determination is made.

In a second aspect, an embodiment of the present application provides an abnormality detection apparatus, including: the sampling module is used for sampling records in the safety warning data set to obtain a sample data set; the preprocessing module is used for preprocessing each record of the sample data set and converting the value of the attribute which cannot be subjected to size comparison in the plurality of recorded attributes into a value which can be subjected to size comparison; and the detection module is used for constructing an isolated forest based on the preprocessed sample data set and determining whether the record in the safety warning data set is abnormal or not by utilizing the isolated forest.

In a third aspect, an embodiment of the present application provides a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and executed by a processor, the computer program instructions perform the method provided by the first aspect or any one of the possible implementation manners of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a memory in which computer program instructions are stored, and a processor, where the computer program instructions are read and executed by the processor to perform the method provided by the first aspect or any one of the possible implementation manners of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart illustrating an anomaly detection method provided in an embodiment of the present application;

FIG. 2 is a diagram illustrating a structure of a first classification tree provided by an embodiment of the present application;

FIG. 3 is a diagram illustrating a structure of a second classification tree provided by an embodiment of the present application;

FIG. 4 is a diagram illustrating a structure of a third classification tree provided by an embodiment of the present application;

FIG. 5 is a functional block diagram of an anomaly detection apparatus according to an embodiment of the present application;

fig. 6 shows a schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The application aims to provide an anomaly detection method based on an isolated Forest model (iForest for short), and rapid and accurate anomaly detection is carried out on records in a safety alarm data set. The records in the security alarm dataset may be derived from security logs generated by some network security systems or devices, such as alarm logs generated by Snort intrusion detection systems.

Fig. 1 shows a flowchart of an anomaly detection method provided in an embodiment of the present application. The steps of the method may be performed by an electronic device, one possible configuration of which may be referred to in the following description of fig. 6, without undue experimentation. Referring to fig. 1, the method includes:

step S100: and sampling records in the safety warning data set to obtain a sample data set.

The safety alarm data set can be regarded as a set of a large number of records, each record can be understood as a piece of alarm information or data, sampling is to select a part of records from the safety alarm data set, the set formed by the selected records is called a sample data set, and the sample data set is used for constructing an isolated forest in the subsequent steps.

There are many different implementations of sampling, for example, after determining the sampling size n (i.e., the number of records in the sample data set), n records may be randomly selected from the safety warning data set, or n records may be selected at equal intervals, etc. In a common implementation, the number of records in the security alarm data set is much larger than the number of records in the sample data set, e.g., by more than several orders of magnitude.

Each record contains multiple attributes, including, but not limited to, description information attribute, time attribute, source IP address attribute, destination IP address attribute, source port attribute, destination port attribute, warning information category attribute, priority attribute, etc., and a record containing multiple attributes may also be regarded as an item of high-dimensional data, where each attribute corresponds to a dimension.

Step S110: preprocessing each record of the sample data set, and converting the value of the attribute which cannot be subjected to size comparison in the plurality of attributes of the record into the value which can be subjected to size comparison.

As mentioned above, the records in the sample dataset are used to construct an orphan forest in the subsequent steps, and according to the principle of the general orphan forest algorithm, when constructing a tree (abbreviated as iTree) in the orphan forest, the characteristic values (corresponding to the attribute values recorded in the present application) of the samples are compared, so that it is required that the value of each attribute in the records should be comparable in size, for example, most commonly, the value is comparable in size, and such attributes can be directly used to construct the iTree. However, there are some attributes whose values are not comparable in size, for example, the attribute values are text, and such attributes cannot be directly used to construct the iTree, and these attributes are first processed in step S110 to give a reasonable value with comparable size. Rational, as referred to herein, may refer to having certain characteristics or meanings that give the new value of an attribute some degree of continuation of the old value.

The way in which preprocessing is performed is different for different attributes, and a general processing method is given below, which can be used to process any attribute in the record that is not comparable in size, and convert it into a value (so that a size comparison can be performed). For simplicity, it is not assumed that the attribute to be preprocessed is attribute Attr, and it is obvious that attribute Attr has generality:

the first step is as follows: one of the token values of the attribute Attr of all records of the sample data set is selected as a reference token value. The attribute Attr may be a value of the attribute, or may be a quantized representation (the quantized representation may be in the form of a numerical value, a vector, a matrix, or the like) of the attribute, and if the attribute Attr is a quantized representation, the attribute Attr is quantized to obtain a corresponding attribute. The mode of selecting the reference token value is not limited, and a common mode includes, for example, randomly selecting, and fixedly selecting a certain token (e.g., the token value of the attribute Attr of the first record in the sample data set).

The second step is that: and calculating a similarity measure between the characteristic value of the attribute Attr in each record of the sample data set and the reference characteristic value, and replacing the value of the attribute Attr in the record with the calculated similarity measure. Wherein, the similarity measure between two characteristic values is a quantitative representation of the similarity degree between the two characteristic values (the form of the similarity measure comprises a correlation coefficient, an included angle cosine value, a distance and the like).

As a specific example, assuming that there are 256 records in the sample data set, corresponding to the values a1, a2, …, a256 of 256 attributes Attr, and also corresponding to the token values a1 ', a 2', …, a256 'of 256 attributes Attr, now selecting one of the values as the reference token value, assuming that a 100' is selected, then calculating the similarity measures of a1 ', a 2', …, a256 'and a 100', resulting in c1, c2, …, c256, and finally replacing the old values a1, a2, …, a256 of the attributes Attr in the 256 records with the new values c1, c2, …, c 256.

When the attribute of each record is converted, the preprocessing method not only needs to consider the characteristic of the attribute (namely, the characteristic value of the attribute), but also needs to consider the relationship (namely, similarity measurement) between the attribute and the same attribute in other records of the sample data set, so that a reasonable numerical value can be obtained after the attribute value is converted. The specific form of the attribute characterization value and the specific form of the similarity metric depend on the characteristics of different attributes, and the following lists some specific attribute preprocessing processes for detailed description, and if the record further includes attributes with different attribute values that are not relatively large, reference may be made to the processing:

(1) describing information attributes

The value describing the information attribute may be a description of the record, in text form. The pretreatment method comprises the following steps:

first, the bag of words (Bagofword) model and the TF-IDF weighting algorithm, which are both prior art and not specifically set forth herein, are used to convert the value describing the information attribute in each record of the sample data set into a vector (i.e., the above-mentioned token).

Then, one of all the transformed vectors is selected as a reference vector (i.e., the reference token value mentioned above), and the selection may include random selection, fixed position selection, and the like.

Finally, the cosine value (i.e. the similarity measure) between the vector converted from the descriptive information attribute in each record of the sample data set and the reference vector is calculated, and the calculated cosine value is used to replace the value of the descriptive information attribute in the record.

(2) Time attribute

The value of the time attribute may be the time of generation of the record, typically a formatted string, e.g., "2019-12-0209: 05: 20". Although the value of the time attribute can also be regarded as text, and the same preprocessing manner as the attribute of the description information is adopted, the inventor researches show that the difference of two time strings on the text cannot completely and truly reflect the difference between two moments represented by the two time strings, such as the string S1 "08: 59: 59", the string S2 "09: 00: 00", and the string S3 "09: 01: 00", the time represented by the strings S1 and S2 is only 1 second (higher in similarity in time), but 5-bit difference (lower in similarity in text) is confirmed on the text, the time represented by the strings S2 and S3 is only 1 minute (lower in similarity in time), but only 1-bit difference (higher in similarity in text) is confirmed on the text. Therefore, if the vector is converted and the similarity measure is calculated directly from the text, the true similarity between the values of the two time attributes cannot be reflected very accurately. Therefore, the present application proposes the following method to handle the time attribute:

first, the value of the time attribute in each record of the sample data set is converted into a corresponding accumulated time length (i.e., the above-mentioned token value). The accumulated time duration may be defined as time elapsed from a certain preset time point to a time represented by the string corresponding to the time attribute, and the accumulated time duration may be measured in units of seconds, milliseconds, and the like. For example, if the preset time point is 1970, 1, 8 hours (GMT +8 time zone), the time attribute has a value of "2019-12-0209: 05: 20" (GMT +8 time zone), and the accumulated time length is 1575248720 (unit of millisecond), in some programming implementations, the accumulated time length is also referred to as a timestamp, and a corresponding method is provided for converting the formatted time string and the timestamp, which is not specifically described herein.

Secondly, one of all the converted accumulated time lengths is selected as a reference accumulated time length (i.e. the reference characteristic value), and the selection mode can include random selection, fixed position selection and the like.

Finally, the difference (which may be an absolute value) between the converted cumulative time length of the time attribute in each record of the sample data set and the reference cumulative time length is calculated, the difference is divided by the difference (which may be an absolute value) between two different cumulative time lengths selected from all the converted cumulative time lengths, and the calculated ratio (i.e., the similarity measure mentioned above) is used instead of the value of the time attribute in the record. The two selected accumulated durations can be selected according to a preset rule: for example, the accumulated duration obtained by converting the time attribute of the first record and the last record in the sample data set is selected; as another example, the maximum and minimum values of the accumulated time lengths converted from all records in the sample data set are selected, and so on.

(3) Source IP address attribute, destination IP address attribute

The two types of attribute processing are basically the same, and the following description mainly takes the source IP address attribute as an example. The value of the source IP address attribute is an IP address, typically a formatted string, e.g., "10.23.0.6". Although the value of the source IP attribute can also be regarded as a text, and a preprocessing manner similar to that of the description information attribute is adopted, the inventor researches and discovers that the difference of two IP address word strings in the text cannot completely and truly reflect the similarity between the represented IP addresses, and in order to better reflect the meaning of the IP addresses, the application proposes the following method to process the source IP address attribute:

first, one of the values of the source IP address attribute (i.e., the above-mentioned token value) of all records of the sample data set is selected as a base source IP address (i.e., the above-mentioned base token value).

Then, the shortest distance between the value of the source IP address attribute in each record of the sample data set and the corresponding node of the reference source IP address in the first classification tree (i.e., the similarity measure mentioned above) is calculated, and the calculated shortest distance is used to replace the value of the source IP address attribute in the record.

The first classification tree is a tree data structure for hierarchically classifying the IP addresses, and the value of any IP address attribute can be classified to one leaf node in the first classification tree. The structure of the first classification tree can be set according to requirements, and fig. 2 shows a possible structure of the first classification tree. Referring to fig. 2, fig. 2 is a first classification tree that can classify IP addresses in an intranet of a certain organization, the first classification tree is constructed by referring to a network topology of the intranet, a root node of the first classification tree is a core router of the intranet, a second layer node of the first classification tree is each department of the organization (which may also be understood as a switch used by each department), and a third layer node (leaf node) of the first classification tree is an IP address of a terminal device in each department. The first classification tree in fig. 2 is modified with reference to the topology formed by the network devices in the public network, and can be fully used for classifying the IP addresses in the public network, which is not described in detail herein.

Regarding the shortest distance between two nodes a and B in the first classification tree, the following method can be adopted:

the first step is as follows: acquiring a path P1 from the node A to a root node;

the second step is that: acquiring a path P2 from the node B to the root node;

the third step: determining the earliest occurring common parent node F of P1 and P2;

the fourth step: and calculating the distance between the node A and the node F and the distance between the node B and the node F, and adding the two distance values to obtain the shortest path between the node A and the node B.

According to the above principle, in fig. 2, the shortest distance between two nodes of IP1 and IP4 is the distance from IP1 to department 1 plus the distance from IP4 to department 1, and the result is 2. The shortest distance between two nodes, IP1 and IP8, is the distance of IP1 to the core router plus the distance of IP8 to the core router, resulting in 4.

For the pre-processing of the destination IP address attributes, a first classification tree may also be used, since the tree data structure is only used for classifying IP addresses and does not distinguish between source IP addresses and destination IP addresses.

(4) Source port attribute, destination port attribute

The two types of attribute processing are basically the same, and the following description mainly takes the source port attribute as an example. The value of the source port attribute is a port number, e.g., 8080, which, although numerical, cannot be compared to each other. Although the value of the source port attribute can also be regarded as text, and the same preprocessing mode as that of the description information attribute is adopted, the inventor researches and discovers that the difference of two port numbers in the text cannot completely and truly reflect the similarity between the ports characterized by the port numbers, and in order to better reflect the significance of the ports, the following method is proposed to process the source port attribute:

first, one of the values of the source port attribute of all records of the sample data set (i.e., the characterizing value referred to above) is selected as a reference source port (i.e., the characterizing value referred to above).

Then, the shortest distance between the value of the source port attribute in each record of the sample data set and the corresponding node of the reference source port in the second classification tree (i.e., the similarity measure mentioned above) is calculated, and the calculated shortest distance is used to replace the value of the source port attribute in the record.

The second classification tree is a tree data structure for performing hierarchical classification on the ports, and the value of any port attribute can be classified to one leaf node in the second classification tree. The structure of the second classification tree can be set according to requirements, and fig. 3 shows a possible structure of the second classification tree. Referring to FIG. 3, the second classification tree can classify port numbers between 0 and 65535, wherein the port numbers 0 to 1023 are reserved port numbers and are generally used by system processes, and the port numbers 1024 to 65535 are generally used by application software, so as to avoid conflict with the system processes. Each leaf node of the second classification tree corresponds to a range or set of port numbers: under the node with port numbers of 0 to 1023, the system common port can be a fixed port number used by an operating system or common network services (for example, DNS service corresponds to port number 53, FTP service corresponds to port number 21, SSH service corresponds to port number 22, HTTP service corresponds to port number 80, etc.), the malware common port can be determined according to actual detection experience, and the port numbers between 0 and 1023 except the two are all classified as private ports; under the node with the port numbers of 1024 to 65535, most of the common ports of the application software are fixed port numbers used by the application software (corresponding to the port number 4000 than the QQ of the chat software), but the common ports of the application software do not exclude that the common ports of the application software may include fixed port numbers used by some network services (corresponding to the port number 8080 than the QQ of the chat software), the common ports of the malware may be determined according to actual detection experience, and the port numbers between 1024 and 65535 except the two are classified as common ports. For example, a source port attribute having a value of 22 is easily located on the node of the system common port using the second classification tree.

The calculation of the shortest distance between two nodes in the second classification tree is similar to the calculation of the shortest distance in the first classification tree, and a description thereof will not be repeated. For the pre-processing of destination port attributes, a second classification tree may also be used, since the tree data structure is only used to classify ports and does not distinguish between source ports and destination ports.

(5) Alert information category attributes

The value of the warning information class attribute may be a string of codes generated by a particular network security device or system, the class characterizing the warning information (i.e., record), the meaning of the code being known by querying a manual or database of the particular network security device or system. The application proposes the following method for handling the warning information type attribute:

first, one of the values of the attribute of the warning information category (i.e., the above-mentioned token value) of all records of the sample data set is selected as the reference warning information category (i.e., the above-mentioned reference token value).

Then, the shortest distance between the value of the attribute of the warning information category in each record of the sample data set and the corresponding node of the reference warning information category in the third classification tree (i.e. the similarity measure mentioned above) is calculated, and the calculated shortest distance is used to replace the value of the attribute of the warning information category in the record.

The third classification tree is a tree data structure for hierarchically classifying the warning information categories, and the value of any warning information category attribute can be classified to one leaf node in the third classification tree. The structure of the third classification tree can be set according to requirements, and fig. 4 shows a possible structure of the third classification tree. Referring to fig. 4, the third classification tree first classifies the warning information categories into two categories: normal behavior and abnormal behavior. The normal behavior is divided into two types, namely host-based behavior and network-based behavior, for example, a document is opened on a host based on the normal behavior of the host, and for example, a request is sent to a certain server based on the normal behavior of the network; the abnormal behaviors are also divided into two types, namely host-based behaviors and network-based behaviors, wherein the host-based abnormal behaviors are such as executing a script with an unknown source on a host, and the network-based abnormal behaviors are such as implementing remote identity brute force on others. In short, based on the codes corresponding to the attributes of the warning information categories, after the warning reasons represented by the warning information categories are inquired according to a manual or a database, the warning information categories can be positioned on one of 4 leaf nodes of the third classification tree, and for some behaviors, the behaviors with multiple attributes can be preferentially classified as network-based behaviors.

The calculation of the shortest distance between two nodes in the third classification tree is similar to the calculation of the shortest distance in the first classification tree, and a description thereof will not be repeated. It should be noted that, if the value of the warning information category attribute of a record is classified into an abnormal behavior in the third classification tree shown in fig. 4, it does not mean that the record is determined to be abnormal in step S120, the abnormal behavior in fig. 4 refers to a behavior with a certain safety risk, and the determination of the abnormal behavior in step S120 indicates that the record is likely to cause or already causes a safety problem, which needs to be considered by the user. The former range is much broader than the latter, and in practice, most of the behaviors with security risks may be only an extremely low level of risk, and do not require special attention from the user nor determine it as abnormal in step S120.

(6) Priority attributes

The value of the priority attribute may be the priority of the warning information (i.e., the record), and since the priority inherently has a high-low score, i.e., can be compared in size, the priority attribute does not need to be preprocessed.

Step S120: and constructing an isolated forest based on the preprocessed sample data set, and determining whether the record in the safety warning data set is abnormal or not by using the isolated forest.

And constructing the isolated forest, namely constructing all the iTrees forming the isolated forest, wherein the number of the iTrees can be preset, for example 100. Each iTree was constructed in the same way, with the following steps:

the first step is as follows: randomly selecting an attribute Attr from the records;

the second step is that: randomly selecting a Value in the Value range (between the minimum Value and the maximum Value) of the attribute Atrr;

the third step: classifying the records in the sample data set according to the attribute Attr, placing the records of which the Value of the attribute Attr is smaller than Value into a left sub-tree, and placing the records of which the Value of the attribute Attr is not smaller than Value into a right sub-tree; wherein, because the records are preprocessed, the value of any attribute (including Attr attribute) can be compared in size;

the fourth step: the left and right subtrees are constructed recursively (i.e. the execution is performed recursively starting from the first step) until the following condition is satisfied:

1) only one record or a plurality of same records exist in the subtree;

2) the height of the tree reaches a defined height log₂(n), n is the size of the sample data set.

The height of the tree is limited to improve the algorithm efficiency, but it is needless to say that the tree is not limited. Because the abnormal records are few in all records and the routes to the root node are few according to the effective precondition of the isolated forest algorithmThe path length is also short (the difference between the value of the attribute of the abnormal record and the value of the same attribute of the normal record is larger), as long as the current constructed tree can effectively distinguish the normal record from the abnormal record, the height of the tree does not need to be too high, and the average height log of the binary tree of n nodes is reached₂(n) can be discontinued.

It can be seen that, because the construction algorithm of the iTree has randomness, even if a plurality of itrees are constructed by repeatedly using records in the sample data set, the probability of duplication of the itrees is low.

After the isolated forest is constructed, the isolated forest can be used for determining whether the records in the safety warning data set are abnormal or not. One specific method is as follows: first, each record in the safety warning data set is preprocessed (the preprocessing method is the same as described above). And then, calculating an abnormal score for each preprocessed record by utilizing the isolated forest, wherein the abnormal score represents the possibility of abnormality of the corresponding record, and whether the record is abnormal or not can be quickly determined according to the relationship between the calculated abnormal score and a preset judgment rule. The abnormal positioning mode is simple, and according to the property of abnormal point detection of the isolated forest, the judgment of all records in the safety warning data set can be completed only by linear time complexity. In this case, since it is necessary to compare the values of the attributes when calculating the abnormality score, advanced preprocessing of the values of the attributes is necessary. A possible calculation formula for the anomaly scores and a possible definition of the judgment rules are given below for reference, although other calculation formulas or other judgment rules are not excluded:

the calculation formula of the abnormal score is as follows:

wherein s (x, n) represents an abnormal score, x represents a record corresponding to the abnormal score, n represents the number of records in the sample data set, H (x) represents the path length of the record x in a certain iTree, and since each node of the iTree classifies the record according to a value of a certain attribute, the record x can be necessarily classified into one leaf node of the iTree, the distance from the leaf node to the root node is H (x), and the record x is preprocessed in the classification process of the record x, E (H (x)) represents the average path length of the record x in all the itrees forming an isolated forest, c (n) represents the average path length of a binary search tree constructed by n records and is used for normalizing E (H) (x), c (n) can be estimated by formulas 2H (n-1) - (2(n-1)/n, and H (n) in the formulas (n) is equal to n) (24, wherein n is equal to 8652, which can be equal to + 8652.

The judgment rule may include:

if s (x, n) is larger than a first threshold value T1, determining x as an abnormal record; if s (x, n) is smaller than a second threshold value T2, determining x as a normal record; if the calculation result of s (x, n) is located between T1 and T2, it can be considered that there is no obvious abnormality in the record, and it can be further determined whether there is an abnormality in other ways, for example, a manual determination is made. Wherein 0.5< T1<1, 0< T2<0.5, for example, T1 may be 0.9, 0.8, etc., and T2 may be 0.3, 0.2, etc., as required. And for the judgment result, corresponding information can be output, and for the abnormal judgment result, an alarm can be given to remind the user of paying attention.

To sum up, the anomaly detection method provided by the embodiment of the present application can effectively detect network data anomalies, and has the following advantages:

firstly, the records in the safety warning data set are sampled, then an isolated forest is constructed according to the sampled sample data set to realize detection, rather than modeling according to the records in the whole safety warning data set, and as the number of the records in the sample data set can be far smaller than that of the records in the safety warning data set, the model construction can be quickly completed on the premise of consuming less storage resources (only the records in the sample data set need to be loaded to a memory instead of loading all the records during modeling), and the quick and effective detection can still be realized for the condition that a large number of records exist in the safety warning data set.

Secondly, before the isolated forest is constructed, a set of method for preprocessing the values of some attributes which cannot be subjected to size comparison in the records is set, the processed records can meet the algorithm requirements for constructing the isolated forest, the practicability and the application range of the scheme are favorably improved, and even if the records contain diversified attributes, the abnormality can be effectively detected after preprocessing.

Thirdly, each node of each iTree classifies the records by using only one attribute (i.e., one dimension), so that the method can effectively process the records even if the records contain multiple attributes (i.e., multiple dimensions), without causing too high algorithm complexity, thereby improving the problem that many existing algorithms cannot process high-dimensional data.

Fourthly, according to the principle of the isolated forest algorithm, the abnormal data are closer to the root node, and the normal data are farther from the root node, so that the model is essentially a model directly focusing on an abnormal instance for detection, namely, the isolated forest is utilized to directly detect the abnormal record without paying attention to the data characteristics of the normal record, thereby being beneficial to reducing the false alarm rate and improving the detection precision.

Fig. 5 is a functional block diagram of an abnormality detection apparatus 200 according to an embodiment of the present application. Referring to fig. 5, the abnormality detection apparatus 200 includes:

the sampling module 210 is configured to sample records in the safety warning data set to obtain a sample data set;

a preprocessing module 220, configured to preprocess each record of the sample data set, and convert a value of an attribute that cannot be subjected to size comparison among multiple attributes of the record into a value that can be subjected to size comparison;

a detecting module 230, configured to construct an isolated forest based on the preprocessed sample data set, and determine whether a record in the safety warning data set is abnormal by using the isolated forest.

In an implementation manner of the anomaly detection apparatus 200, the preprocessing module 220 performs preprocessing on each record of the sample data set, and converts a value of any attribute that is not capable of performing size comparison in the record into a value capable of performing size comparison, including: selecting one of the characteristic values of the attribute of all records of the sample data set as a reference characteristic value; wherein, the characterization value of an attribute is the value of the attribute, or the quantitative representation of the value of the attribute; and calculating a similarity measure between the characteristic value of the attribute in each record of the sample data set and the reference characteristic value, and replacing the value of the attribute in the record with the calculated similarity measure.

In an implementation manner of the anomaly detection apparatus 200, the plurality of attributes includes a description information attribute, and the preprocessing module 220 preprocesses each record of the sample data set, and converts a value of the description information attribute in the record into a value capable of performing size comparison, including: converting the value of the description information attribute in each record of the sample data set into a vector by using a bag-of-words model and a TF-IDF weighting algorithm; selecting one vector from all vectors obtained by conversion as a reference vector; and calculating a cosine value between a vector obtained by converting the attribute of the description information in each record of the sample data set and the reference vector, and replacing the value of the attribute of the description information in the record with the calculated cosine value.

In an implementation manner of the anomaly detection apparatus 200, the plurality of attributes includes a time attribute, and the preprocessing module 220 performs preprocessing on each record of the sample data set, and converts a value of the time attribute in the record into a value capable of performing size comparison, including: converting the value of the time attribute in each record of the sample data set into a corresponding accumulated duration; selecting one of all the converted accumulated time lengths as a reference accumulated time length; calculating the difference between the accumulated time length obtained by converting the time attribute in each record of the sample data set and the reference accumulated time length, dividing the difference by the difference between two different accumulated time lengths selected from all the converted accumulated time lengths, and replacing the value of the time attribute in the record with the calculated ratio.

In an implementation manner of the anomaly detection apparatus 200, the multiple attributes include a source IP address attribute and a destination IP address attribute, and the preprocessing module 220 performs preprocessing on each record of the sample data set, and converts a value of the IP address attribute in the record into a value capable of performing size comparison, including: selecting one of the values of the source IP address attribute and the values of the destination IP address attribute of all records of the sample data set as a reference IP address; calculating the shortest distance between each selected reference IP address and the corresponding node of the value of the similar IP address attribute in each record of the sample data set in the first classification tree, and replacing the value of the similar IP address attribute in the record with the calculated shortest distance; the first classification tree is a tree data structure for hierarchically classifying IP addresses, and a value of any IP address attribute can be classified to one leaf node in the first classification tree.

In an implementation manner of the anomaly detection apparatus 200, the multiple attributes include a source port attribute and a destination port attribute, and the preprocessing module 220 performs preprocessing on each record of the sample data set, and converts a value of the port attribute in the record into a value capable of performing size comparison, including: selecting one of the values of the source port attribute and the values of the destination port attribute of all records of the sample data set as a benchmark port; calculating the shortest distance between the selected benchmark port and the corresponding node of the value of the attribute of the same type port in each record of the sample data set in the second classification tree, and replacing the value of the attribute of the same type port in the record with the calculated shortest distance; the second classification tree is a tree data structure for performing hierarchical classification on the ports, and the value of any port attribute can be classified to one leaf node in the second classification tree.

In an implementation manner of the anomaly detection apparatus 200, the plurality of attributes includes a warning information category attribute, and the preprocessing module 220 performs preprocessing on each record of the sample data set, and converts a value of the warning information category attribute in the record into a value that can be compared in size, including: selecting one from the values of the warning information category attributes of all records of the sample data set as a reference warning information category; calculating the shortest distance between the value of the attribute of the warning information category in each record of the sample data set and the corresponding node of the reference warning information category in the third classification tree, and replacing the value of the attribute of the warning information category in the record with the calculated shortest distance; the third classification tree is a tree data structure for hierarchically classifying the warning information categories, and the value of any warning information category attribute can be classified to one leaf node in the third classification tree.

In one implementation of the anomaly detection apparatus 200, the detection module 230 determining whether there is an anomaly in a record in the safety warning dataset using the isolated forest includes: preprocessing each record in the safety warning data set, and converting the value of the attribute which cannot be subjected to size comparison in the plurality of recorded attributes into a value which can be subjected to size comparison; and calculating an abnormal score for each preprocessed record by utilizing the isolated forest, and determining whether the record is abnormal according to the calculated abnormal score and a preset judgment rule.

In one implementation of the anomaly detection apparatus 200, the anomaly score is calculated by the formula:

The implementation principle and the resulting technical effects of the anomaly detection apparatus 200 provided in the embodiment of the present application have been introduced in the foregoing method embodiments, and for the sake of brief description, reference may be made to corresponding contents in the method embodiments where no part of the apparatus embodiments is mentioned.

Fig. 6 shows a schematic diagram of an electronic device provided in an embodiment of the present application. Referring to fig. 6, the electronic device 300 includes: a processor 310, a memory 320, and a communication interface 330, which are interconnected and in communication with each other via a communication bus 340 and/or other form of connection mechanism (not shown).

The Memory 320 includes one or more (Only one is shown in the figure), which may be, but not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Read Only Memory (EPROM), an electrically Erasable Read Only Memory (EEPROM), and the like. The processor 310, as well as possibly other components, may access, read, and/or write data to the memory 320.

The processor 310 includes one or more (only one shown) which may be an integrated circuit chip having signal processing capabilities. The Processor 310 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Micro Control Unit (MCU), a Network Processor (NP), or other conventional processors; or a special-purpose Processor, including a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, and a discrete hardware component.

Communication interface 330 includes one or more (only one shown) that may be used to communicate directly or indirectly with other devices for the purpose of data interaction. For example, the communication interface 330 may be an ethernet interface; may be a high-speed network interface (such as an Infiniband network); may be a mobile communications network interface (such as an interface to a 3G, 4G, 5G network); or may be other types of interfaces having data transceiving functions.

One or more computer program instructions may be stored in the memory 320 and read and executed by the processor 310 to implement the anomaly detection methods provided by the embodiments of the present application, as well as other desired functions.

It will be appreciated that the configuration shown in fig. 6 is merely illustrative and that electronic device 300 may include more or fewer components than shown in fig. 6 or have a different configuration than shown in fig. 6. The components shown in fig. 6 may be implemented in hardware, software, or a combination thereof. For example, when implemented in hardware, the electronic device 300 may be a personal computer, a mobile phone, a tablet computer, a server, or the like; when implemented in software, the electronic device 300 may be a virtual machine. The electronic device 300 is not limited to a single device, and may be a combination of a plurality of devices or a cluster including a large number of devices.

The embodiment of the present application further provides a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and executed by a processor of a computer, the method for detecting an abnormality provided in the embodiment of the present application is executed. The computer-readable storage medium may be implemented as, for example, memory 320 in electronic device 300 in FIG. 6.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An abnormality detection method characterized by comprising:

sampling records in the safety warning data set to obtain a sample data set;

preprocessing each record of the sample data set, and converting the value of the attribute which cannot be subjected to size comparison in the plurality of recorded attributes into a value which can be subjected to size comparison;

and constructing an isolated forest based on the preprocessed sample data set, and determining whether the record in the safety warning data set is abnormal or not by using the isolated forest.

2. The anomaly detection method according to claim 1, characterized in that preprocessing each record of said sample data set to convert the value of any attribute of the records that is not comparable in size into a value that is comparable in size comprises:

selecting one of the characteristic values of the attribute of all records of the sample data set as a reference characteristic value; wherein, the characterization value of an attribute is the value of the attribute, or the quantitative representation of the value of the attribute;

and calculating a similarity measure between the characteristic value of the attribute in each record of the sample data set and the reference characteristic value, and replacing the value of the attribute in the record with the calculated similarity measure.

3. The anomaly detection method according to claim 2, wherein the plurality of attributes include a descriptive information attribute, each record of the sample data set is preprocessed, and a value of the descriptive information attribute in the record is converted into a value that can be compared in size, and the method includes:

converting the value of the description information attribute in each record of the sample data set into a vector by using a bag-of-words model and a TF-IDF weighting algorithm;

selecting one vector from all vectors obtained by conversion as a reference vector;

and calculating a cosine value between a vector obtained by converting the attribute of the description information in each record of the sample data set and the reference vector, and replacing the value of the attribute of the description information in the record with the calculated cosine value.

4. The anomaly detection method according to claim 2, wherein the plurality of attributes include a time attribute, each record of the sample data set is preprocessed, and a value of the time attribute in the record is converted into a value that can be compared in size, and the method includes:

converting the value of the time attribute in each record of the sample data set into a corresponding accumulated duration;

selecting one of all the converted accumulated time lengths as a reference accumulated time length;

calculating the difference between the accumulated time length obtained by converting the time attribute in each record of the sample data set and the reference accumulated time length, dividing the difference by the difference between two different accumulated time lengths selected from all the converted accumulated time lengths, and replacing the value of the time attribute in the record with the calculated ratio.

5. The anomaly detection method of claim 2, wherein said plurality of attributes includes a source IP address attribute and a destination IP address attribute, and wherein preprocessing each record of said sample data set to convert the value of the IP address attribute in the record into a value that can be compared in size comprises:

selecting one of the values of the source IP address attribute and the values of the destination IP address attribute of all records of the sample data set as a reference IP address;

calculating the shortest distance between each selected reference IP address and the corresponding node of the value of the similar IP address attribute in each record of the sample data set in the first classification tree, and replacing the value of the similar IP address attribute in the record with the calculated shortest distance;

the first classification tree is a tree data structure for hierarchically classifying IP addresses, and a value of any IP address attribute can be classified to one leaf node in the first classification tree.

6. The anomaly detection method of claim 2, wherein the plurality of attributes include a source port attribute and a destination port attribute, each record of the sample data set is preprocessed, and a value of the port attribute in the record is converted into a value that can be compared in size, and the method includes:

selecting one of the values of the source port attribute and the values of the destination port attribute of all records of the sample data set as a benchmark port;

calculating the shortest distance between the selected benchmark port and the corresponding node of the value of the attribute of the same type port in each record of the sample data set in the second classification tree, and replacing the value of the attribute of the same type port in the record with the calculated shortest distance;

the second classification tree is a tree data structure for performing hierarchical classification on the ports, and the value of any port attribute can be classified to one leaf node in the second classification tree.

7. The anomaly detection method according to claim 2, wherein the plurality of attributes include a warning information category attribute, each record of the sample data set is preprocessed, and a value of the warning information category attribute in the record is converted into a value that can be compared in size, and the method includes:

selecting one from the values of the warning information category attributes of all records of the sample data set as a reference warning information category;

calculating the shortest distance between the value of the attribute of the warning information category in each record of the sample data set and the corresponding node of the reference warning information category in the third classification tree, and replacing the value of the attribute of the warning information category in the record with the calculated shortest distance;

the third classification tree is a tree data structure for hierarchically classifying the warning information categories, and the value of any warning information category attribute can be classified to one leaf node in the third classification tree.

8. The anomaly detection method according to any one of claims 1-7, wherein said determining whether an anomaly exists in a record in the safety warning dataset using the orphan forest comprises:

preprocessing each record in the safety warning data set, and converting the value of the attribute which cannot be subjected to size comparison in the plurality of recorded attributes into a value which can be subjected to size comparison;

and calculating an abnormal score for each preprocessed record by utilizing the isolated forest, and determining whether the record is abnormal according to the calculated abnormal score and a preset judgment rule.

9. The abnormality detection method according to claim 8, characterized in that the abnormality score is calculated by the formula:

wherein s (x, n) represents the anomaly score, x represents a record corresponding to the anomaly score, n represents the number of records in the sample dataset, h (x) represents the path length of record x in the trees making up the isolated forest, E (h (x)) represents the average path length of record x in all trees making up the isolated forest, c (n) represents the average path length of a binary search tree constructed with n records;

the judgment rule includes: if s (x, n) is greater than a first threshold value T1, determining x as an abnormal record, and if s (x, n) is less than a second threshold value T2, determining x as a normal record; wherein 0.5< T1<1, 0< T2< 0.5.

10. An abnormality detection device characterized by comprising:

the sampling module is used for sampling records in the safety warning data set to obtain a sample data set;

the preprocessing module is used for preprocessing each record of the sample data set and converting the value of the attribute which cannot be subjected to size comparison in the plurality of recorded attributes into a value which can be subjected to size comparison;

and the detection module is used for constructing an isolated forest based on the preprocessed sample data set and determining whether the record in the safety warning data set is abnormal or not by utilizing the isolated forest.

11. A computer-readable storage medium having computer program instructions stored thereon, which when read and executed by a processor, perform the method of any one of claims 1-9.

12. An electronic device, comprising: a memory having stored therein computer program instructions which, when read and executed by the processor, perform the method of any of claims 1-9.