CN113746780B

CN113746780B - Abnormal host detection method, device, medium and equipment based on host image

Info

Publication number: CN113746780B
Application number: CN202010463538.1A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Jike Xin'an Beijing Technology Co ltd
Current assignee: Jike Xin'an Beijing Technology Co ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2023-06-20
Anticipated expiration: 2040-05-27
Also published as: CN113746780A

Abstract

The invention provides an abnormal host detection method, device, medium and equipment based on host image. Comprising the following steps: collecting first flow data of a plurality of hosts to be tested IP based on an unsupervised learning method in a first set time segment; extracting a flow characteristic value and a host associated characteristic value in the flow data; based on a graph segmentation method, carrying out IP clustering on the to-be-detected hosts by using the similarity of flow characteristic values and the relevance of host correlation characteristic values to form a plurality of groups of to-be-detected hosts; vectorizing each group of collected flow characteristic values and host associated characteristic values to form characteristic vectors; normalizing to form feature vector sets of each group of hosts to be tested; training the feature vector set respectively to construct a corresponding detection model of each group of host computers to be detected; and detecting the behavior abnormality of the host to be detected based on the detection model. The invention does not need extensive training set, but defines the characteristics based on two dimensions of time and space, and the detection dimension is more comprehensive.

Description

Abnormal host detection method, device, medium and equipment based on host image

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a medium, and a device for detecting an abnormal host based on a host image.

Background

With the development of internet technology, network anomalies are commonplace, and irregular means such as hacking can generate network anomalies.

Therefore, monitoring of a network anomaly host, which is a host having an abnormal network behavior, becomes necessary when a network anomaly occurs. If external scanning suddenly appears, an abnormal service port is opened, and a host computer which attacks other host behaviors is attacked. The network abnormal behavior host is often the host invaded or controlled by an attacker, and the discovery of the abnormal behavior host has important significance for tracking the network attacker and eliminating the network malicious behavior.

The existing mode of discovering the abnormal host is mainly based on the detection of abnormal traffic: filtering and detecting abnormal traffic based on a set of rules, and reversely tracking and finding a corresponding abnormal behavior host based on malicious traffic matched with the rules; aiming at the scenes of difficult extraction of characteristics such as encrypted traffic, the method of machine learning and the like is adopted to train and obtain a detection model to detect abnormal traffic, and the method still backtracks an abnormal host with the marks of the traffic.

On one hand, for the rule-based method, the effective rule features are less and less in the aspect of the encryption protocol (such as TLS) which is more commonly used at present, and the effectiveness of detection is extremely challenged; on the other hand, for the model-based method, the training of the model requires a large amount of black and white flow for comparison training, and the effectiveness of the model has a large relationship with the training method, the quality of the set to be tested and the like. Therefore, the existing method essentially detects the flow to find the abnormality and presume the abnormal host, rather than defining and detecting the abnormality of the host itself, and has a great defect in accuracy.

Disclosure of Invention

The invention aims to provide an abnormal host detection method, device, medium and equipment based on host image, which can solve at least one technical problem. The specific scheme is as follows:

according to a first aspect of the present invention, there is provided a method for detecting abnormal host based on host behavior, which forms a representation of daily host behavior based on a long-term unsupervised learning method, and performs fine-grained clustering. Host behavior is modeled starting from multiple features in both the temporal and spatial dimensions. In the detection process, various host image models formed by unsupervised learning are used for detecting whether the host is abnormal or not. Comprising the following steps:

Collecting first flow data of a plurality of hosts to be tested IP based on an unsupervised learning method in a first set time segment;

extracting a flow characteristic value and a host associated characteristic value from the flow data, wherein the flow characteristic value comprises the uplink and downlink quantity of data and the uplink and downlink quantity of data; the host association characteristic value includes: access port sequence, access IP sequence, access domain name sequence, IP access breadth, accessed IP breadth, access domain name set, and digital certificate set;

based on a graph segmentation method, carrying out IP clustering on the to-be-detected hosts to form a plurality of groups of to-be-detected hosts;

in the first set time segment, carrying out vectorization processing on the collected flow characteristic values and host correlation characteristic values of each group of hosts to be tested to form characteristic vectors;

combining the feature vectors, and carrying out normalization processing on the combined feature vectors to respectively form feature vector sets of each group of hosts to be tested;

training the feature vector set corresponding to each group of the hosts to be tested through the feature vector sets of other groups of the hosts to be tested respectively, and constructing a corresponding detection model of each group of the hosts to be tested;

And detecting the behavior abnormality of the host to be detected based on the detection model.

Optionally, the extracting a flow characteristic value and a host association characteristic value in the flow data includes:

collecting first flow data of a plurality of hosts to be tested IP in the first set time segment;

recombining the flow data and extracting a flow characteristic value in the flow data;

and based on the flow characteristic value extracted from the flow data, carrying out statistics, and extracting the host associated characteristic value.

Optionally, the graph cut method includes:

based on the flow data, obtaining the association relation of the host association characteristic values of the IP of the host to be detected;

based on the association relation of the host association characteristic values, respectively and independently clustering the client IP and the server IP;

based on the individual clustering results, a bipartite graph is formed.

Optionally, the training the feature vector set corresponding to each group of hosts to be tested through the feature vector sets of other groups of hosts to be tested includes:

setting the characteristic vector set of each group of hosts to be tested as a positive sample set;

setting other host characteristic vector sets as training sets as negative sample sets;

And performing supervised training on the positive sample set through the negative sample set to obtain the corresponding data with normal IP behaviors of the host.

Optionally, the detecting the behavior abnormality of the host to be detected based on the detection model includes:

after the detection model is constructed, collecting second flow data of any host to be detected in a second set time segment;

detecting the second flow data by using the detection model;

and when the characteristic values of the second flow data and the host IP data in the detection model have large behavior differences, alarming to be abnormal IP.

Optionally, after the detecting model detects the behavior abnormality of the host to be detected, the method further includes:

adding the characteristic values which are acquired in the second set time segment and are not alarmed into the characteristic vector set of the corresponding host set;

and training and correcting the detection model again by using the characteristic vector set added to the corresponding host group.

Optionally, said constructing said detection model of each said set of hosts further includes:

selecting the number m of decision trees, and sampling and combining the feature vector and the training set data;

Constructing a data set based on the result of each sampling combination, wherein each data set comprises m training subsets;

T＝{(X1,Y1),(X2,Y2),(X3,X3)…(Xm,Ym)}；

wherein X is the set of all the feature vectors of the main unit to be detected, Y is the set of the feature vectors of other main units;

based on each data set, establishing a corresponding decision tree:

selecting features in the training subset at each split node in a top-down recursion manner by taking the training subset as a root node;

calculating an information gain ratio for each of the features;

based on the result of calculating the information gain rate, selecting the characteristics with large gain to split until all training subsets reach leaf nodes, and establishing the corresponding decision tree, wherein the calculation formula of the decision tree is as follows:

where zi is the voting result of the decision tree and Pi is the probability of occurrence of a certain feature value;

generating a final result of a random forest model by combining votes of a plurality of decision trees, wherein the random forest model is a training model trained by using feature vector sets of different groups of hosts;

where m is the number of decision trees, m is a natural number greater than 1, wi decisions are the weights of the trees, and RF is the voting result of the random forest model.

According to a second aspect of the present invention, there is provided an abnormal host detecting apparatus based on a host image, comprising: the device comprises an acquisition unit, an extraction unit, a clustering unit, a processing unit, a merging unit, a training unit and a detection unit;

the acquisition unit is used for acquiring first flow data of a plurality of hosts to be tested IP based on an unsupervised learning method in a first set time segment;

the extraction unit is used for extracting a flow characteristic value and a host associated characteristic value in the flow data, wherein the flow characteristic value comprises the uplink and downlink data quantity and the uplink and downlink data quantity; the host association characteristic value includes: access port sequence, access IP sequence, access domain name sequence, IP access breadth, accessed IP breadth, access domain name set, and digital certificate set;

the clustering unit is used for carrying out IP clustering on the hosts to be detected on the basis of the graph segmentation method, and forming a plurality of groups of hosts to be detected;

the processing unit is used for carrying out vectorization processing on the collected flow characteristic values and host associated characteristic values of each group of hosts to be detected in the first set time segment to form characteristic vectors;

The merging unit is used for merging the feature vectors, and normalizing the merged feature vectors to form a feature vector set;

the training unit is used for training each group of hosts to be tested through the feature vector set and other groups of host feature vector sets, and constructing a corresponding detection model of each group of hosts to be tested;

the detection unit is used for detecting the behavior abnormality of the host to be detected based on the detection model.

According to a third aspect of the present invention, there is provided an electronic device comprising: one or more processors; storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1 to 7.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method of editing content in a document as claimed in any of the preceding claims.

Compared with the prior art, the scheme provided by the embodiment of the invention has at least the following beneficial effects:

The invention establishes a host normal mode portrait based on an unsupervised learning method, and extracts a flow characteristic value and a host association characteristic value in the flow data; based on a graph segmentation method, clustering the to-be-detected hosts I P by the similarity of the flow characteristic values and the relevance of the host relevance characteristic values to form a plurality of groups of to-be-detected hosts; carrying out vectorization processing on the flow characteristic value and the host correlation characteristic value of each group of hosts to be tested to form characteristic vectors; combining the feature vectors, and carrying out normalization processing on the combined feature vectors to respectively form feature vector sets of each group of hosts to be tested; training respectively through feature vector sets of other groups of hosts to be tested, and constructing respective detection models of each group of hosts to be tested; and detecting the behavior abnormality of the host to be detected based on the detection model. The extensive training set is not needed, the characteristics are defined based on two dimensions of time and space, and the detection dimension is more comprehensive;

the method carries out real-time learning and updating along with the characteristics of the detected network, and the attenuation period of the accuracy is longer;

the feature values of the feature dimensions are not defined by expert, but are obtained based on unsupervised learning, and the feature values are more effective.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

FIG. 1 is a flowchart of an abnormal host detection method based on host images according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of extracting flow characteristic values and host associated characteristic values according to an embodiment of the present invention;

FIG. 3 illustrates a schematic diagram of forming multiple sets of hosts under test according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a test model for constructing each respective set of hosts under test according to an embodiment of the present invention;

FIG. 5 is a schematic diagram showing the detection of behavior anomalies of a host under test based on the detection model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an abnormal host detection device based on host image according to an embodiment of the present invention;

Fig. 7 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, the "plurality" generally includes at least two.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

It should be understood that although the terms first, second, third, etc. may be used to describe … … in embodiments of the present invention, these … … should not be limited to these terms. These terms are only used to distinguish … …. For example, the first … … may also be referred to as the second … …, and similarly the second … … may also be referred to as the first … …, without departing from the scope of embodiments of the present invention.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such product or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a commodity or device comprising such element.

Alternative embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Example 1

As shown in fig. 1, according to a specific embodiment of the present invention, in a first aspect, the present invention provides a method for detecting an abnormal host based on a host image, including:

step S101, collecting first flow data of a plurality of hosts to be tested IP based on an unsupervised learning method in a first set time segment;

the non-supervision learning method is a method for solving various problems in pattern recognition according to training samples with unknown categories. In this embodiment, the data acquisition device autonomously acquires the IP traffic data of the host to be tested.

A first set time slice, which is a time period of an indefinite length set by people, for example, including but not limited to, daytime, nighttime, workday, holiday, business peak period, fixed time period of the morning or afternoon of each day, etc.; each time segment may be set according to a period of general behavior of the host of the network being tested, e.g., collecting traffic data for the IP for a series of associated behaviors, typically for a period of time that can be completed, e.g., 10 minutes. Collecting traffic data of a plurality of hosts to be tested IP means that traffic data in at least one typical time segment of the plurality of hosts to be tested can be collected.

Step S102, extracting a flow characteristic value and a host associated characteristic value in the flow data; as shown in the figure 2 of the drawings,

the flow characteristic value comprises the uplink and downlink quantity of data and the uplink and downlink quantity of data; the host association characteristic value includes: access port sequence, access IP sequence, access domain name sequence, IP access breadth, accessed IP breadth, access domain name set, and digital certificate set; the specific characteristic values are shown in the table.

Step S1021: collecting first flow data of a plurality of hosts to be tested IP in the first set time segment;

step S1022: recombining the flow data and extracting a flow characteristic value in the flow data;

the flow reorganization means that all data in the collected first flow data of the plurality of hosts to be tested IP are grouped in a first set time segment.

Step S1023: and based on the flow characteristic value extracted from the flow data, carrying out statistics, and extracting the host associated characteristic value.

Wherein, the flow characteristic value in the flow data is counted so as to extract the host associated characteristic value.

The specific flow characteristic value and the host associated characteristic value are shown in the following table:

Step 103, based on a graph segmentation method, carrying out IP clustering on the to-be-detected hosts to form a plurality of groups of to-be-detected hosts; as shown in the figure 3 of the drawings,

the method for clustering the IP through the graph segmentation method comprises the following steps: based on the characteristics of each host IP, a bipartite graph of the server IP and the client IP is formed, and based on the correlation relation of the host correlation characteristics and the similarity of the characteristic values, host IP clustering is performed to form a plurality of groups of hosts.

Step S1031: based on the flow data, obtaining an association relation of the host association characteristic values of the IP of the host to be detected, wherein the association relation comprises average lengths of a plurality of data streams, IP distribution of the plurality of data streams and domain name distribution of the plurality of data streams;

wherein, the association relation between host computer association characteristic value host computer IP is obtained through the collected flow data

Step S1032: based on the association relation of the host association characteristic values, respectively and independently clustering the client IP and the server IP;

the graph segmentation clustering refers to clustering of two layers of the IPs, wherein a bipartite graph is formed based on a client IP and a server IP based on a host IP association relation obtained by all data. The server side IP performs independent clustering, and the client side IP performs independent clustering;

On the basis, for each host IP, performing preliminary clustering according to the degree of association according to a graph segmentation method, namely dividing the host IP with the degree of association smaller than a certain threshold value into different groups;

finally, the statistical features extracted in the step S102 are utilized to further cluster the IP of each group of host computers, the IP with similar popular feature is divided into a cluster, and the clustering algorithm can adopt a plurality of modes such as DBSCAN, K-means and the like.

Step S1033: based on the individual clustering results, a bipartite graph is formed.

The bipartite graph refers to a graph formed based on a client IP and a server IP.

Step S104, in the first set time segment, vectorizing the collected flow characteristic values and host correlation characteristic values of each group of hosts to be tested to form characteristic vectors;

and (3) carrying out vectorization processing on the characteristic values extracted in the step (S103), wherein the characteristics obtained by each group of hosts to be tested in each time segment form a characteristic vector.

Step 105, merging the feature vectors, and carrying out normalization processing on the merged feature vectors to respectively form feature vector sets of each group of hosts to be tested;

wherein, each group of hosts to be tested forms a characteristic vector set in a plurality of time slices. The feature vector set is a feature vector set obtained by combining feature vectors collected in a plurality of time periods and performing normalization processing so as to use feature data in later training.

Step S106: training the feature vector set corresponding to each group of the hosts to be tested through the feature vector sets of other groups of the hosts to be tested respectively, and constructing a corresponding detection model of each group of the hosts to be tested; as shown in figure 4 of the drawings,

wherein, training is performed by using the feature vector set among different groups of hosts.

Step S1061: setting the characteristic vector set of each group of hosts to be tested as a positive sample set;

step S1062: setting other host characteristic vector sets as training sets as negative sample sets;

each group of hosts uses the own characteristic vector set and other groups of host characteristic vector sets as positive and negative sample sets respectively.

Step S1063: and performing supervised training on the positive sample set through the negative sample set to obtain the corresponding data with normal IP behaviors of the host.

The method comprises the steps of performing supervised training on a positive sample set through a negative sample set, wherein the supervised training refers to training through a preset training model, obtaining data of normal IP behaviors of corresponding hosts, and forming a detection model of each group of hosts. Wherein the other sets of host feature vector sets are training sets.

As an implementation mode, a random forest model can be adopted as the training model, and the specific method is as follows:

(1) After the decision tree number m (m is a natural number greater than 1) is selected, the feature vector and training set data are sample-combined, and a data set is constructed for each sample result.

T＝{(X1,Y1),(X2,Y2),(X3,X3)…(Xm,Ym)}

For the main set A to be detected, all feature vector sets of A are defined as X, and feature vector sets of other main sets are defined as Y.

(2) Based on each dataset constructed in step (1), a corresponding decision tree is built: with the training subsets as root nodes, features in the training subsets are selected at each split node in a top-down recursive manner. Calculating information gain rate for each feature, and selecting the feature with the largest gain as a splitting attribute to split until all to-be-detected leaf nodes are reached, wherein each training subset in the step (1) constructs a decision tree, and the method specifically comprises the following steps:

where zi is the voting result of the decision tree, pi is the probability of occurrence of a certain feature value, and the specific calculation is performed during model training, and the specific manner is as follows: dividing all times that the symptom value occurs in a set of training samples by the total times that the feature occurs in the set of samples;

(3) The final result of the training model is generated by combining votes from a number of decision trees, where m is the number of decision trees, wi is the weight of the tree, zi is the voting result of the tree, and RF is the voting result of the random forest:

(4) Through the steps, the model of each host is trained respectively, namely, N hosts are assumed, the characteristic set of N-1 host sets is used as a negative sample set, and the characteristic set of the host set is used as a positive sample set for training. In training, the hosts with larger attribute differences can be combined together for comparison training based on priori knowledge (such as host attribute planning, network planning and the like), so that the training speed can be increased.

And 107, detecting the behavior abnormality of the host to be detected based on the detection model. As shown in figure 5 of the drawings,

wherein, utilize the detection model of every group host computer to carry out the anomaly detection, include: after a detection model is generated, collecting host IP data for a period of time, detecting any host IP data in the data by using the detection model, and alarming to be an abnormal IP when the host IP data has larger difference behavior characteristics with other host IP data in the group, namely detecting newly collected flow data in a period of time by using the data in the detection model, and alarming to be an abnormal host IP when the host IP data have larger difference of behaviors.

Step 1071, after the detection model is constructed, collecting second flow data of any host to be detected in a second set time segment;

Wherein the second set time segment, which is also a time segment with an indefinite length set by people, for example, includes but is not limited to daytime, nighttime, workday, holiday, business peak period, fixed time segment of daily morning or afternoon, etc.; the second flow data is the flow data of any host to be detected after the detection model is constructed;

step 1072, detecting the second flow data by using the detection model;

step 1073, when the second flow data has large behavior difference with the characteristic value of the host IP data in the detection model, alarming is abnormal IP.

The statistical flow without alarm in the detection process can be considered to be normal in the time period, and the sequence can be added into the set to be trained due to the latest normal behavior data flow. After accumulating the to-be-trained set for a period of time, the historical data can be collected by the method of the steps S102-S106 for retraining, so that the detection model is continuously evolved. Namely, after the host behavior abnormality is detected by using the detection model, the method further comprises the following steps: and updating the training feature vector set by using the non-alarm feature, adding the non-alarm feature value into the feature vector set of the corresponding host set, and retraining and correcting by using the detection model through the new feature vector set within a second set time. The invention establishes a host normal mode portrait based on an unsupervised learning method, and extracts a flow characteristic value and a host association characteristic value in the flow data; based on a graph segmentation method, carrying out IP clustering on the to-be-detected hosts to form a plurality of groups of to-be-detected hosts; carrying out vectorization processing on the flow characteristic value and the host correlation characteristic value of each group of hosts to be tested to form characteristic vectors; combining the feature vectors, and carrying out normalization processing on the combined feature vectors to respectively form feature vector sets of each group of hosts to be tested; training respectively through feature vector sets of other groups of hosts to be tested, and constructing respective detection models of each group of hosts to be tested; and detecting the behavior abnormality of the host to be detected based on the detection model. The extensive training set is not needed, the characteristics are defined based on two dimensions of time and space, and the detection dimension is more comprehensive; the method carries out real-time learning and updating along with the characteristics of the detected network, and the attenuation period of the accuracy is longer; the feature values of the feature dimensions are not dependent on expert definition, but are obtained based on unsupervised learning, and the feature values are according to effectiveness.

Example 2

As shown in fig. 6, according to a second aspect of the present invention, there is provided an apparatus for detecting an abnormal host based on a host image, including: the device comprises an acquisition unit 601, an extraction unit 602, a clustering unit 603, a processing unit 604, a merging unit 605, a training unit 606 and a detection unit 607;

the collecting unit 601 is configured to collect, in a first set time period, first traffic data of a plurality of hosts IP to be tested based on an unsupervised learning method;

the extracting unit 602 is configured to extract a flow characteristic value and a host-associated characteristic value in the flow data, where the flow characteristic value includes a number of data upstream and downstream and a number of upstream and downstream; the host association characteristic value includes: access port sequence, access IP sequence, access domain name sequence, IP access breadth, accessed IP breadth, access domain name set, and digital certificate set;

the clustering unit 603 is configured to perform IP clustering on the hosts to be tested based on a graph segmentation method, where the similarity of the flow feature values and the relevance of the host association feature values form multiple groups of hosts to be tested;

the processing unit 604 is configured to perform vectorization processing on the collected flow characteristic values and host-related characteristic values of each group of hosts to be tested in the first set time slice, so as to form a characteristic vector;

The merging unit 605 is configured to merge the feature vectors, and normalize the feature vectors after merging to form a feature vector set;

the training unit 606 is configured to train each group of hosts to be tested through the feature vector set and the feature vector sets of other groups of hosts, and construct a corresponding detection model of each group of hosts to be tested;

the detection unit 607 is configured to detect abnormal behavior of the host to be detected based on the detection model.

Optionally, the graph cut method includes:

Based on the individual clustering results, a bipartite graph is formed.

detecting the second flow data by using the detection model;

T＝{(X1,Y1),(X2,Y2),(X3,X3)…(Xm,Ym)}；

based on each data set, establishing a corresponding decision tree:

calculating an information gain ratio for each of the features;

The invention establishes a host normal mode portrait based on an unsupervised learning method, and extracts a flow characteristic value and a host association characteristic value in the flow data; based on a graph segmentation method, carrying out IP clustering on the to-be-detected hosts to form a plurality of groups of to-be-detected hosts; carrying out vectorization processing on the flow characteristic value and the host correlation characteristic value of each group of hosts to be tested to form characteristic vectors; combining the feature vectors, and carrying out normalization processing on the combined feature vectors to respectively form feature vector sets of each group of hosts to be tested; training respectively through feature vector sets of other groups of hosts to be tested, and constructing respective detection models of each group of hosts to be tested; and detecting the behavior abnormality of the host to be detected based on the detection model. The extensive training set is not needed, the characteristics are defined based on two dimensions of time and space, and the detection dimension is more comprehensive;

the feature values of the feature dimensions are not dependent on expert definition, but are obtained based on unsupervised learning, and the feature values are according to effectiveness.

Example 3

As shown in fig. 7, according to a third aspect of the present invention, an electronic device is provided, and the electronic device is used for a method for detecting an abnormal host based on a host image, where the method includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the one processor, the instructions being executable by the at least one processor to enable the at least one processor to: an abnormal host detection based on host image.

Referring now to fig. 7, a schematic diagram of an electronic device 700 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 7 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 7, the electronic device 700 may include a processing means (e.g., a central processor, a graphics processor, etc.) 701, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage means 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 300 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

In general, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 shows an electronic device 700 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 709, or installed from storage 708, or installed from ROM 702. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 701.

Example 4

According to a fourth aspect of the present disclosure, there is provided a non-volatile computer storage medium storing computer-executable instructions for performing a method for detecting an abnormal host based on a host image in any of the above-described method embodiments.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, enable the electronic device to perform an abnormal host detection based on host images.

Alternatively, the computer-readable medium carries one or more programs that, when executed by the electronic device, enable the electronic device to perform an abnormal host detection based on the host image.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

Claims

1. An abnormal host detection method based on host image is characterized by comprising the following steps:

Detecting behavior abnormality of the host to be detected based on the detection model;

the constructing a corresponding detection model of each group of hosts to be detected comprises the following steps:

T = {(X1,Y1), (X2,Y2), (X3,X3)… (Xm, Ym)}；

based on each data set, establishing a corresponding decision tree:

calculating an information gain ratio for each of the features;

based on the result of the calculation of the information gain ratio, selecting the characteristics with large gain to split until all training subsets reach leaf nodes, establishing the corresponding decision tree,

the graph segmentation method comprises the following steps:

based on the flow data, obtaining an association relation of the host association characteristic values of the IP of the host to be detected, wherein the association relation comprises average lengths of a plurality of data streams, IP distribution of the plurality of data streams and domain name distribution of the plurality of data streams;

forming a bipartite graph based on the independent clustering result,

the training the feature vector set corresponding to each group of the hosts to be tested through the feature vector sets of other groups of hosts to be tested respectively includes:

2. The method of claim 1, wherein the extracting the traffic characteristic value and the host-associated characteristic value in the traffic data comprises:

3. The method of claim 1, wherein detecting host behavior anomalies under test based on the detection model comprises:

detecting the second flow data by using the detection model;

4. The method of claim 1, wherein after the detecting model detects the behavior abnormality of the host under test, further comprising:

5. The method of claim 1, wherein the decision tree is calculated as follows:

the method of claim 1, wherein the decision tree is calculated as follows: />

6. An abnormal host detecting device based on host image, comprising: the device comprises an acquisition unit, an extraction unit, a clustering unit, a processing unit, a merging unit, a training unit and a detection unit;

The detection unit is used for detecting the behavior abnormality of the host to be detected based on the detection model;

T = {(X1,Y1), (X2,Y2), (X3,X3)… (Xm, Ym)}；

based on each data set, establishing a corresponding decision tree:

calculating an information gain ratio for each of the features;

the graph segmentation method comprises the following steps:

forming a bipartite graph based on the independent clustering result,

7. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 5.

8. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1 to 5.