CN110138745B

CN110138745B - Abnormal host detection method, device, equipment and medium based on data stream sequence

Info

Publication number: CN110138745B
Application number: CN201910326907.XA
Authority: CN
Inventors: 江斌
Original assignee: Jike Xin'an Beijing Technology Co ltd
Current assignee: Geek Xin'an (Chengdu) Technology Co.,Ltd.
Priority date: 2019-04-23
Filing date: 2019-04-23
Publication date: 2021-08-24
Anticipated expiration: 2039-04-23
Also published as: CN110138745A

Abstract

The embodiment of the disclosure provides a method, a device, equipment and a medium for detecting an abnormal host based on a data flow sequence, wherein the method comprises the following steps: collecting various data flows of a plurality of time slices for each host, performing stream recombination, and establishing a data stream sequence; extracting the self-characteristics of each data stream sequence and the associated characteristics among the data stream sequences; forming a feature vector set for each of the hosts; taking the feature vector set of each host as a positive sample, taking the feature vector sets of the other N-1 hosts as a negative sample set, and training to form a detection model of each host; detecting through the detection model, and alarming as an abnormal feature vector set for a feature vector set with large difference; and (4) training and correcting the characteristic vectors which are not alarmed again to form a final detection model, and detecting the abnormal host by adopting the final detection model.

Description

Abnormal host detection method, device, equipment and medium based on data stream sequence

Technical Field

The disclosure relates to the technical field of flow data detection, in particular to a data flow sequence-based abnormal host detection method and device, electronic equipment and a storage medium.

Background

Network communication is an information application that is currently involved by almost all businesses and individuals. With the increasing importance of enterprises and personal users on information security, the use scenarios of encryption technologies in network communication are increasing. Namely, the communication content can not be identified by other users except the two communication parties on the network through an encryption method. This causes a problem that normal encrypted traffic cannot be distinguished from malicious encrypted traffic. And great challenges are brought to network security detection.

In the existing scheme, besides the effective decryption of encrypted traffic, the current detection of encrypted malicious traffic mainly adopts artificial intelligence methods such as machine learning and the like. The currently adopted method mainly detects whether a data stream is malicious traffic, and the main methods can be divided into two types:

(1) the supervised learning method comprises the following steps: carrying out detection model training by utilizing known malicious traffic, and then using the model for detection;

(2) unsupervised learning method: defining a group of clustering rules, carrying out clustering analysis on the flow, and screening out malicious flow clusters in the flow

Problems existing in the prior scheme

(1) The supervised model training relies on a large number of black samples, and the detection model obtained by training is likely to be inaccurate due to the insufficient number of the samples;

(2) unsupervised cluster analysis cannot exactly distinguish which traffic is malicious traffic, and can only carry out speculation through quantity proportion among different clusters and the like, so that the accuracy rate is low;

(3) for a single data stream as an analysis object, the incidence relation among the data streams is lost for some malicious behaviors, so that the detection accuracy is reduced;

(4) the attacker can easily bypass the detection according to the feature set, namely once the attacker finds the feature set used for detection, the features can be avoided through certain technical means.

Disclosure of Invention

The present disclosure provides an abnormal host detection method, an abnormal host detection device, an abnormal host detection electronic device, and a storage medium based on a data stream sequence, which can quickly detect malicious encrypted traffic in traffic information.

In a first aspect, the present disclosure provides a method for detecting an abnormal host based on a data stream sequence, including the following steps:

step S101: selecting N sampling hosts, collecting various data flows of a plurality of time slices for each host, performing stream recombination, and establishing a data stream sequence, wherein N is a natural number greater than 3;

step S102: extracting the self-characteristics of each data stream sequence and the associated characteristics among the data stream sequences; vectorizing the own characteristics and the associated characteristics to form a characteristic vector set of each host;

step S103, taking the feature vector set of each host as a positive sample, taking the feature vector sets of other N-1 hosts as a negative sample set, and training to form a detection model of each host;

s104, detecting through the detection model, and alarming as an abnormal feature vector set for a feature vector set with large difference;

step S105: and adding the characteristic vectors which are not alarmed into the characteristic vector set of the corresponding host, performing retraining and correction on the detection model by using the new characteristic vector set within a certain time to form a final detection model, and detecting the abnormal host by using the final detection model.

Optionally, the self-characteristics include:

a port of a flow destination, a flow destination IP, a flow destination domain name, a DNS query domain name, a flow length, or a digital certificate.

Optionally, the associated features include:

access port sequence, access IP sequence, access domain name sequence, IP access breadth, access IP breadth, access domain name set, data uplink and downlink quantity, uplink and downlink flow quantity, digital certificate set, open service port set or connection initiation frequency.

Optionally, the detection model is a random forest model, and the specific training method includes:

step S1031: defining the feature vector set of a detection host A as X, the feature vector sets of other hosts as Y, and randomly sampling and combining the feature vector set X and the feature vector set Y after selecting the number m of decision trees to construct a data set T:

T＝{(x1,y1),(x2,y2),(x3,y3)…(xm,ym)}；

step S1032: establishing a corresponding decision tree based on the data set T:

wherein, h (t) represents the voting result of any decision tree, and Pi represents the probability that any data vector passes (xi, yi) detection;

step S1033: constructing a final result of the random forest model by combining votes from a plurality of the decision trees:

where m is the number of decision trees, wi is the weight of the trees, zi is the voting result of the trees, and RF is the voting result of the random forest.

In a second aspect, the present disclosure provides an abnormal host detection apparatus based on a data flow sequence, including:

a sequence construction unit: the system comprises a data flow sequence establishing module, a data flow control module and a data flow control module, wherein the data flow control module is used for selecting N sampling hosts, collecting various data flows of a plurality of time slices for each host, performing flow recombination and establishing a data flow sequence, and N is a natural number more than 3;

a vector extraction unit: extracting the self-characteristics of each data stream sequence and the associated characteristics among the data stream sequences; vectorizing the own characteristics and the associated characteristics to form a characteristic vector set of each host;

the model construction unit is used for training the feature vector set of each host as a positive sample and the feature vector sets of the other N-1 hosts as a negative sample set to form a detection model of each host;

the model detection unit is used for detecting through the detection model and alarming the feature vector set with larger difference as an abnormal feature vector set;

a result output unit: and adding the characteristic vectors which are not alarmed into the characteristic vector set of the corresponding host, performing retraining and correction on the detection model by using the new characteristic vector set within a certain time to form a final detection model, and detecting the abnormal host by using the final detection model.

Optionally, the self-characteristics include:

Optionally, the associated features include:

Optionally, the detection model is a random forest model, and specifically includes:

a definition unit: defining the feature vector set of a detection host A as X, the feature vector sets of other hosts as Y, and randomly sampling and combining the feature vector set X and the feature vector set Y after selecting the number m of decision trees to construct a data set T:

T＝{(x1,y1),(x2,y2),(x3,y3)…(xm,ym)}；

the establishing unit: establishing a corresponding decision tree based on the data set T:

a result unit: constructing a final result of the random forest model by combining votes from a plurality of the decision trees:

In a third aspect, the present disclosure provides an electronic device comprising a processor and a memory, wherein the memory stores computer program instructions executable by the processor, and the processor implements the method steps of any one of the first aspect when executing the computer program instructions.

In a fourth aspect, the present disclosure provides a computer readable storage medium storing computer program instructions which, when invoked and executed by a processor, implement the method steps of any of the first aspects.

Compared with the prior art, the beneficial effects of the embodiment of the disclosure are that:

the method and the device can more comprehensively utilize the host network behavior information to discover abnormal behaviors therein by utilizing the detection model based on the behavior sequence, and can utilize normal samples discovered in the later stage to correct the detection model, thereby reducing the attenuation speed of the detection model, adapting to different detection scenes and improving the stability of the model.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings can be obtained according to the drawings without creative efforts for those skilled in the art.

Fig. 1 is a schematic flow chart of an abnormal host detection method based on a data flow sequence according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an abnormal host detection apparatus based on a data flow sequence according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a training model provided in the embodiments of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The terminology used in the embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in the presently disclosed embodiments and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two, but does not exclude the presence of at least one.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be understood that although the terms first, second, third, etc. may be used to describe technical names in embodiments of the present disclosure, the technical names should not be limited to the terms. These terms are only used to distinguish between technical names. For example, a first check signature may also be referred to as a second check signature, and similarly, a second check signature may also be referred to as a first check signature, without departing from the scope of embodiments of the present disclosure.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.

In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

The scheme proposes a machine learning method based on behavior sequences (data flow sequences) to detect the attacked host (including a client or a server). Firstly, normal state portrayal is carried out on each host on the basis of the general network behavior sequence state of the detected host in the current network, and then the behavior sequence characteristics at the later stage are continuously compared with the difference of the portrayal to judge whether the host is attacked or lost.

It has the advantages that:

(1) based on the context multi-stream behavior feature set, the attack is difficult to bypass the detection strategy;

(2) the method does not depend on expert knowledge to extract features, but automatically establishes a host feature model based on the current host state;

(3) the real-time learning and updating can be carried out along with the characteristics of the detected network, and the attenuation period of the accuracy rate is longer.

Example 1

Referring to fig. 1, in a first aspect, the present disclosure provides a method for detecting an abnormal host based on a data stream sequence, including the following steps:

step S101: selecting N sampling hosts, collecting various data flows of a plurality of time slices for each host, performing flow recombination, and establishing a data flow sequence, wherein N is a natural number greater than 3.

To collect traffic for a number of typical time segments, including but not limited to day, night, work day, holiday, rush hour, etc., each time segment may be defined according to the host's general behavior cycle of the detected network, i.e. the time a series of associated behaviors can be generally completed, such as 10 minutes. Carrying out stream recombination on various collected data to form a data stream; and for the UDP data without connection, merging the data according to the response relation.

Step S102: extracting the self-characteristics of each data stream sequence and the associated characteristics among the data stream sequences; and vectorizing the own characteristics and the associated characteristics to form a characteristic vector set of each host.

The feature sequence is to extract features of all data streams of a host in a time slice, and needs to include two main features, namely, the attribute features of the data streams; the second is the data flow association feature, which is shown in the following table.

the feature vector set is obtained by merging feature vectors acquired from a plurality of time segments and performing normalization processing to facilitate use of feature data in later training.

T＝{(x1,y1),(x2,y2),(x3,y3)…(xm,ym)}；

step S1032: establishing a corresponding decision tree based on the data set T:

wherein, h (t) represents the detection result of any decision tree, and Pi represents the probability that any data vector passes (xi, yi) detection;

By the method, the model of each host is trained, namely N hosts are assumed, the feature sets of the N-1 hosts are used as negative samples, and the feature set of the host is used as a positive sample for training. In the training process, hosts with larger attribute difference can be put together for comparison training based on prior knowledge (such as host attribute planning, network planning and the like), so that the training speed can be increased.

Example 2

As shown in fig. 2, in a second aspect, the present disclosure provides an abnormal host detection apparatus based on a data flow sequence, including:

sequence construction unit 202: the system comprises a data flow sequence establishing module, a data flow control module and a data flow control module, wherein the data flow control module is used for selecting N sampling hosts, collecting various data flows of a plurality of time slices for each host, performing flow recombination and establishing a data flow sequence, and N is a natural number more than 3;

the vector extraction unit 204: extracting the self-characteristics of each data stream sequence and the associated characteristics among the data stream sequences; vectorizing the own characteristics and the associated characteristics to form a characteristic vector set of each host;

a model construction unit 206, which takes the feature vector set of each host as a positive sample and takes the feature vector sets of other N-1 hosts as a negative sample set to train and form a detection model of each host;

the model detection unit 208 detects the feature vector set with large difference through the detection model, and alarms the feature vector set as an abnormal feature vector set;

the result output unit 210: and adding the characteristic vectors which are not alarmed into the characteristic vector set of the corresponding host, performing retraining and correction on the detection model by using the new characteristic vector set within a certain time to form a final detection model, and detecting the abnormal host by using the final detection model.

Optionally, the self-characteristics include:

Optionally, the associated features include:

As shown in fig. 3, the detection model is a random forest model, and specifically includes:

the definition unit 302: defining the feature vector set of a detection host A as X, the feature vector sets of other hosts as Y, and randomly sampling and combining the feature vector set X and the feature vector set Y after selecting the number m of decision trees to construct a data set T:

T＝{(x1,y1),(x2,y2),(x3,y3)…(xm,ym)}；

the establishing unit 304: establishing a corresponding decision tree based on the data set T:

the result unit 306: constructing a final result of the random forest model by combining votes from a plurality of the decision trees:

Example 3

The present disclosure provides a computer readable storage medium storing computer program instructions which, when invoked and executed by a processor, implement the method steps of any of the first aspects.

The method has the advantages that the unsupervised learning model based on the graph is utilized, the encrypted flow detection can be directly carried out under the condition that no priori knowledge and no labeled sample set exist, different types of families are obtained by dividing the graph into two parts, the large family is converted into the small family, and then the malicious flow is detected and identified through the flow characteristics.

Example 4

As shown in fig. 4, the present disclosure provides an electronic device comprising a processor and a memory, wherein the memory stores computer program instructions executable by the processor, and the processor implements the method steps of any one of the first aspect when executing the computer program instructions.

Referring now to FIG. 4, a block diagram of an electronic device 400 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 4, electronic device 400 may include a processing device (e.g., central processing unit, graphics processor, etc.) 401 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic apparatus 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Generally, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to communicate wirelessly or by wire with other devices to exchange data. While fig. 4 illustrates an electronic device 400 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 409, or from the storage device 408, or from the ROM 402. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 401.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring at least two internet protocol addresses; sending a node evaluation request comprising the at least two internet protocol addresses to node evaluation equipment, wherein the node evaluation equipment selects the internet protocol addresses from the at least two internet protocol addresses and returns the internet protocol addresses; receiving an internet protocol address returned by the node evaluation equipment; wherein the obtained internet protocol address indicates an edge node in the content distribution network.

Alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: receiving a node evaluation request comprising at least two internet protocol addresses; selecting an internet protocol address from the at least two internet protocol addresses; returning the selected internet protocol address; wherein the received internet protocol address indicates an edge node in the content distribution network.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".

Claims

1. An abnormal host detection method based on a data flow sequence is characterized by comprising the following steps:

step S102: extracting the self-characteristics of each data stream sequence and the associated characteristics among the data stream sequences; vectorizing the own characteristics and the associated characteristics to form a characteristic vector set of each host; the associated features include: an access port sequence, an access IP sequence, an access domain name sequence, an IP access breadth, an accessed IP breadth, an access domain name set, data uplink and downlink quantity, uplink and downlink flow quantity, a digital certificate set, an open service port set or connection initiation frequency;

2. The method of claim 1, wherein the self-characterization comprises:

3. The method as claimed in claim 2, wherein the detection model is a random forest model, and the specific training method comprises:

T＝{(x1,y1),(x2,y2),(x3,y3)…(xm,ym)}；

step S1032: establishing a corresponding decision tree based on the data set T:

4. An abnormal host detection device based on a data flow sequence, comprising:

a vector extraction unit: extracting the self-characteristics of each data stream sequence and the associated characteristics among the data stream sequences; vectorizing the own characteristics and the associated characteristics to form a characteristic vector set of each host; the associated features include: an access port sequence, an access IP sequence, an access domain name sequence, an IP access breadth, an accessed IP breadth, an access domain name set, data uplink and downlink quantity, uplink and downlink flow quantity, a digital certificate set, an open service port set or connection initiation frequency;

5. The apparatus of claim 4, wherein the self-contained features comprise:

6. The apparatus according to claim 5, wherein the detection model is a random forest model, and specifically comprises:

T＝{(x1,y1),(x2,y2),(x3,y3)…(xm,ym)}；

where H (T) represents the voting result of any decision tree, P_iRepresents the probability that any data vector is detected by (xi, yi);

7. An electronic device comprising a processor and a memory, the memory storing computer program instructions executable by the processor, the processor implementing the method steps of any of claims 1-3 when executing the computer program instructions.

8. A computer-readable storage medium, characterized in that computer program instructions are stored which, when called and executed by a processor, implement the method steps of any of claims 1-3.