CN113746780A

CN113746780A - Abnormal host detection method, device, medium and equipment based on host image

Info

Publication number: CN113746780A
Application number: CN202010463538.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Jike Xin'an Beijing Technology Co ltd
Current assignee: Jike Xin'an Beijing Technology Co ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2021-12-03
Anticipated expiration: 2040-05-27
Also published as: CN113746780B

Abstract

The invention provides a method, a device, a medium and equipment for detecting an abnormal host based on a host image. The method comprises the following steps: collecting first flow data of a plurality of hosts to be tested IP based on an unsupervised learning method in a first set time segment; extracting a flow characteristic value and a host correlation characteristic value in the flow data; based on a graph segmentation method, carrying out IP clustering on the hosts to be tested according to the similarity of the flow characteristic values and the relevance of the host correlation characteristic values to form a plurality of groups of hosts to be tested; vectorizing each group of collected flow characteristic values and host correlation characteristic values to form characteristic vectors; normalization processing, namely respectively forming a feature vector set of each group of hosts to be tested; respectively training the feature vector sets to construct respective detection models of each group of the host to be detected; and detecting the abnormal behavior of the host to be detected based on the detection model. The invention does not need a wide training set, but defines the characteristics based on two dimensions of time and space, and the detection dimension is more comprehensive.

Description

Abnormal host detection method, device, medium and equipment based on host image

Technical Field

The invention relates to the technical field of computers, in particular to a method, a device, a medium and equipment for detecting an abnormal host based on a host image.

Background

With the development of internet technology, network abnormal phenomena are common, and irregular means such as hacking and the like can generate network abnormal behaviors.

Therefore, it becomes necessary to monitor the abnormal network host when a network abnormality occurs, wherein the abnormal network host refers to a host having abnormal network behavior. For example, external scanning suddenly occurs, an abnormal service port is opened, and a host attacking the behaviors of other hosts and the like. The network abnormal behavior host is often a host invaded or controlled by an attacker, and finding the abnormal behavior host has important significance for tracking the network attacker and eliminating the network malicious behavior.

The existing method for discovering abnormal hosts is mainly based on the detection of abnormal flow: abnormal traffic is filtered and detected based on a group of rules, and malicious traffic matched based on the rules can be traced back to discover a corresponding abnormal behavior host; aiming at scenes such as encrypted flow and the like which are difficult to extract features, a detection model is trained by methods such as machine learning and the like to detect abnormal flow, and the method is still used for tracing back an abnormal host to the mark of the flow.

On one hand, for a rule-based method, in the face of using a more and more popular encryption protocol (such as TLS) at present, effective rule features are less and less, and the detection effectiveness is greatly challenged; on the other hand, for the model-based method, the model training needs a large amount of black and white flow for comparison training, and the effectiveness of the model has a large relationship with the training method, the quality of the to-be-measured set and the like. Therefore, the existing method is essentially to detect the flow and find the abnormality to further estimate the abnormal host, but not to define and detect the abnormality of the host itself, which has a great defect in accuracy.

Disclosure of Invention

The present invention is directed to a method, an apparatus, a medium, and a device for detecting an abnormal host based on a host image, which can solve at least one of the above-mentioned technical problems. The specific scheme is as follows:

according to the specific implementation mode of the invention, in a first aspect, the invention provides an abnormal host detection method based on a host image, which is used for forming an image of daily host behaviors and carrying out fine-grained clustering based on a long-time unsupervised learning method. Host behavior is modeled starting from multiple features in both the temporal and spatial dimensions. And detecting whether the host is abnormal or not by using various host image models formed by unsupervised learning in the detection process. The method comprises the following steps:

collecting first flow data of a plurality of hosts to be tested IP based on an unsupervised learning method in a first set time segment;

extracting a flow characteristic value and a host correlation characteristic value in the flow data, wherein the flow characteristic value comprises the uplink and downlink quantity of the data and the uplink and downlink quantity; the host association feature values include: an access port sequence, an access IP sequence, an access domain name sequence, an IP access extent, an access IP extent, an access domain name set and a digital certificate set;

based on a graph segmentation method, carrying out IP clustering on the hosts to be tested according to the similarity of the flow characteristic values and the relevance of the host correlation characteristic values to form a plurality of groups of hosts to be tested;

vectorizing the acquired flow characteristic values and host correlation characteristic values of each group of hosts to be tested in the first set time segment to form characteristic vectors;

merging the feature vectors, and performing normalization processing on the merged feature vectors to respectively form a feature vector set of each group of hosts to be tested;

respectively training the feature vector set corresponding to each group of the hosts to be tested through the feature vector sets of other groups of the hosts to be tested, and constructing respective detection models of each group of the hosts to be tested;

and detecting the abnormal behavior of the host to be detected based on the detection model.

Optionally, the extracting the traffic characteristic value and the host association characteristic value in the traffic data includes:

acquiring first flow data of a plurality of hosts to be tested IP within the first set time segment;

recombining the flow data, and extracting a flow characteristic value in the flow data;

and counting based on the extracted flow characteristic values in the flow data, and extracting the host correlation characteristic value.

Optionally, the graph cut method includes:

acquiring the incidence relation of the host incidence characteristic value of the IP of the host to be detected based on the flow data;

based on the incidence relation of the host incidence characteristic values, independently clustering the client IP and the server IP respectively;

forming a bipartite graph based on the individual clustering results.

Optionally, the training the feature vector set corresponding to each group of hosts to be tested through the feature vector sets of other groups of hosts to be tested respectively includes:

setting the feature vector set of each group of hosts to be tested as a positive sample set;

setting other groups of host feature vector sets as training sets as negative sample sets;

and carrying out supervised training on the positive sample set through the negative sample set to obtain the corresponding data with normal host IP behaviors.

Optionally, the detecting the abnormal behavior of the host to be detected based on the detection model includes:

after the detection model is constructed, collecting second flow data of any one host to be detected in a second set time slice;

detecting the second streaming data by using the detection model;

and when the second flow data has large behavior difference with the characteristic value of the host IP data in the detection model, alarming as abnormal IP.

Optionally, after the detecting model detects the abnormal behavior of the host to be detected, the method further includes:

adding the characteristic values which are collected in a second set time segment and are not alarmed into a characteristic vector set corresponding to the host computer group;

and performing retraining and correction on the detection model by using the feature vector set added into the corresponding host group.

Optionally, the constructing the detection model of each group of corresponding hosts further includes:

selecting the number m of decision trees, and sampling and combining the feature vectors and training set data;

constructing a data set based on the result of each sampling combination, wherein each data set comprises m training subsets;

T＝{(X1,Y1),(X2,Y2),(X3,X3)…(Xm,Ym)}；

wherein, X is a set of all feature vectors of the host group to be detected, and Y is a set of feature vectors of other host groups;

establishing, based on each of the data sets, the corresponding decision tree:

selecting features in the training subsets at each split node in a top-down recursive manner by taking the training subsets as root nodes;

calculating an information gain ratio for each of said features;

based on the result of calculating the information gain rate, selecting the features with large gain to split until all training subsets reach leaf nodes, and establishing the corresponding decision tree, wherein a calculation formula of the decision tree is as follows:

wherein zi is the voting result of the decision tree, and Pi is the probability of occurrence of a certain characteristic value;

generating a final result of a random forest model by combined voting of a plurality of decision trees, wherein the random forest model is a training model trained by using feature vector sets of different groups of hosts;

where m is the number of decision trees, m is a natural number greater than 1, wi is the weight of the trees, and RF is the voting result of the random forest model.

According to a second aspect of the present invention, there is provided an abnormal host detection apparatus based on a host image, comprising: the device comprises an acquisition unit, an extraction unit, a clustering unit, a processing unit, a merging unit, a training unit and a detection unit;

the acquisition unit is used for acquiring first flow data of a plurality of hosts to be detected IP based on an unsupervised learning method in a first set time segment;

the extraction unit is configured to extract a traffic characteristic value and a host association characteristic value in the traffic data, where the traffic characteristic value includes a data uplink and downlink quantity and a data uplink and downlink quantity; the host association feature values include: an access port sequence, an access IP sequence, an access domain name sequence, an IP access extent, an access IP extent, an access domain name set and a digital certificate set;

the clustering unit is used for carrying out IP clustering on the hosts to be tested according to the similarity of the flow characteristic values and the relevance of the host relevance characteristic values based on a graph segmentation method to form a plurality of groups of hosts to be tested;

the processing unit is used for vectorizing the acquired flow characteristic values and host correlation characteristic values of each group of hosts to be tested in the first set time segment to form characteristic vectors;

the merging unit is used for merging the feature vectors and normalizing the merged feature vectors to form a feature vector set;

the training unit is used for training each group of the host to be tested through the feature vector set and the feature vector sets of other groups of hosts to construct a corresponding detection model of each group of the host to be tested;

and the detection unit is used for detecting the abnormal behavior of the host to be detected based on the detection model.

According to a third aspect of the present invention, there is provided an electronic apparatus including: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1 to 7.

According to a fourth aspect, the present invention provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements a method of editing content in a document as described in any one of the above.

Compared with the prior art, the scheme of the embodiment of the invention at least has the following beneficial effects:

establishing a host normal mode portrait based on an unsupervised learning method, and extracting a flow characteristic value and a host association characteristic value in the flow data; based on a graph segmentation method, clustering I P clusters of the similarity of the flow characteristic values and the correlation of the host correlation characteristic values to form a plurality of groups of hosts to be tested; vectorizing the flow characteristic value and the host correlation characteristic value of each group of hosts to be tested to form a characteristic vector; merging the feature vectors, and performing normalization processing on the merged feature vectors to respectively form a feature vector set of each group of hosts to be tested; respectively training through the feature vector sets of other groups of hosts to be detected, and constructing respective detection models of each group of hosts to be detected; and detecting the abnormal behavior of the host to be detected based on the detection model. An extensive training set is not needed, and the characteristics are defined based on two dimensions of time and space, so that the detection dimension is more comprehensive;

the invention carries out real-time learning and updating along with the characteristics of the detected network, and the attenuation period of the accuracy rate is longer;

the characteristic values of all characteristic dimensions are obtained based on unsupervised learning without depending on expert definition, and the characteristic values are more effective.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 is a flow chart illustrating a method for detecting abnormal host images based on host images according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an extracted traffic eigenvalue and a host associated eigenvalue according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a method for forming multiple groups of hosts under test according to an embodiment of the invention;

FIG. 4 is a schematic diagram illustrating a respective testing model for constructing each set of corresponding hosts under test according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating a detection of abnormal behavior of a host under test based on the detection model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an abnormal host detection apparatus based on a host image according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an electronic device according to an embodiment of the invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be understood that although the terms first, second, third, etc. may be used to describe … … in embodiments of the present invention, these … … should not be limited to these terms. These terms are used only to distinguish … …. For example, the first … … can also be referred to as the second … … and similarly the second … … can also be referred to as the first … … without departing from the scope of embodiments of the present invention.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in the article or device in which the element is included.

Alternative embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Example 1

As shown in fig. 1, according to a specific embodiment of the present invention, in a first aspect, the present invention provides a method for detecting an abnormal host based on a host image, including:

s101, collecting first flow data of a plurality of hosts to be tested IP based on an unsupervised learning method in a first set time segment;

among them, the unsupervised learning method is a method for solving various problems in pattern recognition based on a training sample whose category is unknown. In this embodiment, the data collection device autonomously collects IP traffic data of the host to be tested.

A first set time segment, which is a time segment with an indefinite length set by human, for example, including but not limited to day, night, workday, holiday, business peak, morning or afternoon fixed time segment of each day, etc.; each time slice can be set according to the host general behavior period of the detected network, for example, the traffic data of the IP of a series of associated behaviors is collected, and the time can be generally completed, such as 10 minutes. The collection of the traffic data of the IPs of the hosts to be tested means that the traffic data of at least one typical time segment of the hosts to be tested can be collected.

Step S102, extracting a flow characteristic value and a host correlation characteristic value in the flow data; as shown in figure 2 of the drawings, in which,

the flow characteristic value comprises the uplink and downlink quantity of data and the uplink and downlink flow quantity; the host association feature values include: an access port sequence, an access IP sequence, an access domain name sequence, an IP access extent, an access IP extent, an access domain name set and a digital certificate set; the specific characteristic values are shown in the table.

Step S1021: acquiring first flow data of a plurality of hosts to be tested IP within the first set time segment;

step S1022: recombining the flow data, and extracting a flow characteristic value in the flow data;

the step of recombining the flow refers to grouping each data in the collected first flow data of the plurality of hosts to be tested IP within a first set time segment.

Step S1023: and counting based on the extracted flow characteristic values in the flow data, and extracting the host correlation characteristic value.

And counting the flow characteristic values in the flow data so as to extract host correlation characteristic values from the flow characteristic values.

The specific flow characteristic values and the host association characteristic values are shown in the following table:

s103, carrying out IP clustering on the hosts to be tested according to the similarity of the flow characteristic values and the relevance of the host relevance characteristic values based on a graph segmentation method to form a plurality of groups of hosts to be tested; as shown in figure 3 of the drawings,

the IP clustering method based on the graph segmentation method comprises the following steps: and forming bipartite graphs of the server IP and the client IP based on the characteristics of the host IPs, and clustering the host IPs based on the incidence relation and the characteristic value similarity of the host incidence characteristics to form a plurality of groups of hosts.

Step S1031: based on the flow data, obtaining the incidence relation of the host incidence characteristic value of the IP of the host to be detected, wherein the incidence relation comprises the average length of a plurality of data streams, the IP distribution of the plurality of data streams and the domain name distribution of the plurality of data streams;

wherein, the incidence relation between host correlation characteristic values and host IPs is obtained through the collected flow data

Step S1032: based on the incidence relation of the host incidence characteristic values, independently clustering the client IP and the server IP respectively;

the graph cut clustering refers to clustering two levels of IP, and a bipartite graph is formed based on a host IP incidence relation obtained by all data and based on a client IP and a server IP. The server side IP is clustered independently, and the client side IP is clustered independently;

on the basis, for each host IP, according to the magnitude of the association degree, carrying out preliminary clustering according to a graph segmentation method, namely dividing the host IPs with the association degree smaller than a certain threshold value into different groups;

finally, each group of host computer IPs is further clustered by using the statistical characteristics extracted in the step S102, IPs with similar stream behavior characteristics are classified into a cluster, and a clustering algorithm can adopt various modes such as DBSCAN, K-mean and the like.

Step S1033: forming a bipartite graph based on the individual clustering results.

The bipartite graph refers to a graph formed based on the client IP and the server IP.

Step S104, vectorizing the acquired flow characteristic value and host correlation characteristic value of each group of hosts to be tested in the first set time segment to form a characteristic vector;

vectorizing the feature values extracted in step S103, and forming a feature vector by the features obtained by each group of hosts to be tested in each time segment.

Step S105, merging the feature vectors, and carrying out normalization processing on the merged feature vectors to respectively form a feature vector set of each group of hosts to be tested;

and each group of hosts to be tested forms a feature vector set in a plurality of time segments. The feature vector set is obtained by merging feature vectors acquired from a plurality of time segments and performing normalization processing to facilitate use of feature data in later training.

Step S106: respectively training the feature vector set corresponding to each group of the hosts to be tested through the feature vector sets of other groups of the hosts to be tested, and constructing respective detection models of each group of the hosts to be tested; as shown in figure 4 of the drawings,

wherein, training is carried out between different groups of hosts by utilizing the feature vector set.

Step S1061: setting the feature vector set of each group of hosts to be tested as a positive sample set;

step S1062: setting other groups of host feature vector sets as training sets as negative sample sets;

and each group of hosts respectively uses the feature vector set of the host and the feature vector sets of other groups of hosts as a positive sample set and a negative sample set.

Step S1063: and carrying out supervised training on the positive sample set through the negative sample set to obtain the corresponding data with normal host IP behaviors.

And carrying out supervised training on the positive sample set through the negative sample set, wherein the supervised training refers to training through a preset training model to obtain data with normal IP behaviors of the corresponding hosts, and forming a detection model of each group of hosts. Wherein, the other group of host feature vector sets are training sets.

As an implementation, the training model may adopt a random forest model, and the specific method is as follows:

(1) after the number m of decision trees (m is a natural number greater than 1) is selected, sampling combination is carried out on the feature vectors and the training set data, and a data set is constructed for each sampling result.

T＝{(X1,Y1),(X2,Y2),(X3,X3)…(Xm,Ym)}

For the host group A needing to be detected, all feature vector sets of A are defined as X, and feature vector sets of other host groups are defined as Y.

(2) Establishing a corresponding decision tree based on each data set constructed in the step (1): and selecting the features in the training subsets at each split node in a top-down recursive mode by taking the training subsets as root nodes. Calculating information gain rate of each feature, selecting the feature with the maximum gain as a splitting attribute to split until all to-be-detected features reach leaf nodes, and constructing a decision tree for each training subset in the step (1), wherein the specific steps are as follows:

wherein zi is the voting result of the decision tree, Pi is the probability of occurrence of a certain characteristic value, and the specific calculation is performed during model training, and the specific mode is as follows: dividing all of the number of occurrences of the feature in a set of training samples by the total number of occurrences of the feature in the set of samples;

(3) the final result of training the model is generated by combining votes from a plurality of decision trees, where m is the number of decision trees, wi is the weight of the tree, zi is the voting result of the tree, and RF is the voting result of a random forest:

(4) through the steps, the model of each host is trained respectively, namely N hosts are assumed, the feature sets of N-1 host groups are used as a negative sample set, and the feature set of the host group is used as a positive sample set for training. In the training process, the hosts with larger attribute difference can be put together for comparison training based on prior knowledge (such as host attribute planning, network planning and the like), so that the training speed can be increased.

And 107, detecting the abnormal behavior of the host to be detected based on the detection model. As shown in figure 5 of the drawings,

wherein, utilize the detection model of every group host computer to carry out anomaly detection, include: after a detection model is generated, host IP data of a period of time are collected, the detection model is used for detecting any host IP data in the data, when the host IP data has larger difference behavior characteristics with other host IP data of the group where the host IP data is located, an alarm is given as an abnormal IP, namely the data in the detection model is used for detecting newly collected flow data in a period of time, and if the host IP data and the flow data have larger host behavior differences, the alarm is given as the abnormal host IP.

Step 1071, after the detection model is constructed, collecting second flow data of any one host to be detected in a second set time slice;

wherein, the second time segment is also a time segment with an indefinite length set by human, for example, including but not limited to day, night, working day, holiday, business peak, morning or afternoon fixed time segment of each day, etc.; the second flow data is obtained by collecting the flow data of any host to be detected after the detection model is constructed;

step 1072, detecting the second streaming data by using the detection model;

and 1073, when the characteristic values of the second flow data and the host IP data in the detection model have large behavior difference, alarming as abnormal IP.

For the statistical flow without alarm in the detection process, it can be considered that the time period is normal, and the sequence can be added into the set to be trained due to the latest normal behavior data stream. After a period of time of the set to be trained is accumulated, the historical data can be collected by the method of steps S102-S106 for retraining, so that the detection model is continuously evolved. After the detection model is used for detecting the abnormal behavior of the host, the method further comprises the following steps: and updating the training feature vector set by using the non-alarm features, adding the non-alarm feature values into the feature vector set of the corresponding host group, and performing retraining and correction by using the detection model through the new feature vector set within a second set time. Establishing a host normal mode portrait based on an unsupervised learning method, and extracting a flow characteristic value and a host association characteristic value in the flow data; based on a graph segmentation method, carrying out IP clustering on the hosts to be tested according to the similarity of the flow characteristic values and the relevance of the host correlation characteristic values to form a plurality of groups of hosts to be tested; vectorizing the flow characteristic value and the host correlation characteristic value of each group of hosts to be tested to form a characteristic vector; merging the feature vectors, and carrying out normalization processing on the merged feature vectors to respectively form a feature vector set of each group of hosts to be tested; respectively training through the feature vector sets of other groups of hosts to be detected, and constructing respective detection models of each group of hosts to be detected; and detecting the abnormal behavior of the host to be detected based on the detection model. An extensive training set is not needed, and the characteristics are defined based on two dimensions of time and space, so that the detection dimension is more comprehensive; the invention carries out real-time learning and updating along with the characteristics of the detected network, and the attenuation period of the accuracy rate is longer; the characteristic values of all characteristic dimensions are obtained based on unsupervised learning without depending on expert definition, and the characteristic values are obtained according to effectiveness.

Example 2

As shown in fig. 6, according to a second aspect of the present invention, there is provided an apparatus for abnormal host detection based on a host image, comprising: the system comprises an acquisition unit 601, an extraction unit 602, a clustering unit 603, a processing unit 604, a merging unit 605, a training unit 606 and a detection unit 607;

the acquisition unit 601 is configured to acquire first traffic data of multiple hosts to be detected IP based on an unsupervised learning method within a first set time slice;

the extracting unit 602 is configured to extract a traffic characteristic value and a host associated characteristic value in the traffic data, where the traffic characteristic value includes an uplink and downlink quantity of data and an uplink and downlink quantity; the host association feature values include: an access port sequence, an access IP sequence, an access domain name sequence, an IP access extent, an access IP extent, an access domain name set and a digital certificate set;

the clustering unit 603 is configured to perform IP clustering on the hosts to be tested based on a graph segmentation method, where the similarity between the traffic characteristic values and the correlation between the host correlation characteristic values are used to form multiple groups of hosts to be tested;

the processing unit 604 is configured to perform vectorization processing on the acquired traffic characteristic values and host association characteristic values of each group of hosts to be tested within the first set time slice to form a characteristic vector;

the merging unit 605 is configured to merge the feature vectors, and perform normalization processing on the merged feature vectors to form a feature vector set;

the training unit 606 is configured to train each group of hosts to be detected through the feature vector set and feature vector sets of other groups of hosts, and construct a corresponding detection model of each group of hosts to be detected;

the detection unit 607 is configured to detect a behavior anomaly of the host to be detected based on the detection model.

Optionally, the graph cut method includes:

forming a bipartite graph based on the individual clustering results.

detecting the second streaming data by using the detection model;

T＝{(X1,Y1),(X2,Y2),(X3,X3)…(Xm,Ym)}；

establishing, based on each of the data sets, the corresponding decision tree:

calculating an information gain ratio for each of said features;

Establishing a host normal mode portrait based on an unsupervised learning method, and extracting a flow characteristic value and a host association characteristic value in the flow data; based on a graph segmentation method, carrying out IP clustering on the hosts to be tested according to the similarity of the flow characteristic values and the relevance of the host correlation characteristic values to form a plurality of groups of hosts to be tested; vectorizing the flow characteristic value and the host correlation characteristic value of each group of hosts to be tested to form a characteristic vector; merging the feature vectors, and performing normalization processing on the merged feature vectors to respectively form a feature vector set of each group of hosts to be tested; respectively training through the feature vector sets of other groups of hosts to be detected, and constructing respective detection models of each group of hosts to be detected; and detecting the abnormal behavior of the host to be detected based on the detection model. An extensive training set is not needed, and the characteristics are defined based on two dimensions of time and space, so that the detection dimension is more comprehensive;

the characteristic values of all characteristic dimensions are obtained based on unsupervised learning without depending on expert definition, and the characteristic values are obtained according to effectiveness.

Example 3

As shown in fig. 7, according to a third aspect of the present invention, this embodiment provides an electronic device for a method of abnormal host detection based on a host image, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the one processor to cause the at least one processor to: an abnormal host detection based on host image.

Referring now to FIG. 7, shown is a schematic diagram of an electronic device 700 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 300 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

Example 4

According to a fourth aspect, embodiments of the present disclosure provide a non-volatile computer storage medium storing computer-executable instructions that may perform a method for abnormal host detection based on a host image in any of the above method embodiments.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, enable the electronic device to perform an abnormal host detection based on a host image.

Alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, enable the electronic device to perform an abnormal host detection based on the host image.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".

Claims

1. An abnormal host detection method based on host image is characterized by comprising the following steps:

2. The method of claim 1, wherein extracting the traffic characteristic values and the host-associated characteristic values from the traffic data comprises:

3. The method of claim 1, wherein the graph cut method comprises:

based on the flow data, obtaining the incidence relation of the host incidence characteristic value of the IP of the host to be detected, wherein the incidence relation comprises the average length of a plurality of data streams, the IP distribution of the plurality of data streams and the domain name distribution of the plurality of data streams;

forming a bipartite graph based on the individual clustering results.

4. The method according to claim 1, wherein the training the feature vector set corresponding to each group of hosts under test through the feature vector sets of other groups of hosts under test respectively comprises:

5. The method according to claim 1, wherein the detecting the abnormal behavior of the host under test based on the detection model comprises:

detecting the second streaming data by using the detection model;

6. The method of claim 1, wherein after the detecting module detects the abnormal behavior of the host under test, the method further comprises:

7. The method of claim 4, wherein said constructing the inspection model for each respective set of hosts further comprises:

T＝{(X1,Y1),(X2,Y2),(X3,X3)…(Xm,Ym)}；

establishing, based on each of the data sets, the corresponding decision tree:

calculating an information gain ratio for each of said features;

wherein zi is the voting result of the decision tree; pi is the probability of occurrence of a certain characteristic value;

8. An abnormal host detection device based on host image, comprising: the device comprises an acquisition unit, an extraction unit, a clustering unit, a processing unit, a merging unit, a training unit and a detection unit;

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 7.

10. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1 to 7.