CN115514537A

CN115514537A - Method and system for judging suspicious traffic in encrypted traffic

Info

Publication number: CN115514537A
Application number: CN202211070466.XA
Authority: CN
Inventors: 卢国鸣
Original assignee: Shanghai Xingrong Information Technology Co ltd
Current assignee: Shanghai Xingrong Information Technology Co ltd
Priority date: 2022-09-02
Filing date: 2022-09-02
Publication date: 2022-12-23

Abstract

An embodiment of the present specification provides a method and a system for determining suspicious traffic in encrypted traffic, where the method includes: collecting encrypted flow to be detected, and extracting encrypted flow characteristics of the encrypted flow to be detected; the encrypted traffic characteristics comprise first traffic characteristics, and the first traffic characteristics comprise access characteristic information, protocol characteristic information and transfer characteristic information; and determining the flow type of the encrypted flow to be detected based on the encrypted flow characteristics of the encrypted flow to be detected, wherein the flow type comprises normal flow and suspicious flow, and the suspicious flow is used for subsequent decryption analysis of the encrypted flow to be detected.

Description

Method and system for judging suspicious traffic in encrypted traffic

Technical Field

The present disclosure relates to the field of network security, and in particular, to a method and a system for determining suspicious traffic in encrypted traffic.

Background

With the development of the internet, people are increasing their privacy awareness, and thus their demand for traffic encryption is increasing. However, encrypting traffic also facilitates the hiding of malicious traffic while protecting privacy. Encrypting malicious traffic hides many known or unknown threats. It is desirable to provide a method for determining suspicious traffic in encrypted traffic, which can improve the network security protection capability.

Disclosure of Invention

One or more embodiments of the present description provide a method for determining suspicious traffic in encrypted traffic. The method comprises the following steps: acquiring encrypted flow to be detected, and extracting encrypted flow characteristics of the encrypted flow to be detected; wherein the encrypted traffic characteristics include first traffic characteristics including access characteristic information, protocol characteristic information, and transfer characteristic information; and determining the flow type of the encrypted flow to be detected based on the encrypted flow characteristics of the encrypted flow to be detected, wherein the flow type comprises normal flow and the suspicious flow, and the suspicious flow is used for subsequent decryption analysis of the encrypted flow to be detected.

One or more embodiments of the present specification provide a system for determining suspicious traffic in encrypted traffic, where the system includes: the flow acquisition module is used for acquiring encrypted flow to be detected and extracting encrypted flow characteristics of the encrypted flow to be detected; wherein the encrypted traffic characteristics include first traffic characteristics including access characteristic information, protocol characteristic information, and transfer characteristic information; the type determining module is used for determining the traffic type of the encrypted traffic to be detected based on the encrypted traffic characteristics of the encrypted traffic to be detected, wherein the traffic type comprises normal traffic and suspicious traffic, and the suspicious traffic is used for subsequent decryption analysis of the encrypted traffic to be detected.

One or more embodiments of the present specification provide an apparatus for determining suspicious traffic in encrypted traffic, including a processor, where the processor is configured to execute at least a part of the computer instructions to implement a method for determining suspicious traffic in encrypted traffic.

One or more embodiments of the present description provide a computer-readable storage medium storing computer instructions that, when executed by a processor, implement a method for determining suspicious traffic in encrypted traffic.

Drawings

The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

fig. 1 is a schematic view of an application scenario of a system for determining suspicious traffic in encrypted traffic according to some embodiments of the present disclosure;

FIG. 2 is an exemplary block diagram of a system for determining suspicious traffic among encrypted traffic in accordance with certain embodiments of the present description;

fig. 3 is an exemplary flow diagram of a method for determining suspicious traffic in encrypted traffic according to some embodiments described herein;

FIG. 4 is an exemplary illustration of obtaining a second flow characteristic by a relational map, according to some embodiments described herein;

fig. 5 is a schematic diagram of a suspicious traffic identification model according to some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "apparatus", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

Fig. 1 is a schematic application scenario of a system for determining suspicious traffic in encrypted traffic according to some embodiments of the present disclosure.

DPI (Deep Packet Inspection ) refers to that a device performs Inspection and analysis on traffic and Packet content at a key point of a network, and can filter and control Inspection traffic according to a predefined policy, thereby completing functions such as fine service identification, traffic flow direction analysis, traffic flow ratio statistics, traffic ratio shaping, application layer denial of service attack, filtering viruses and trojans, and controlling P2P abuse of a link where the device is located. For example, the decryption DPI module may perform subsequent decryption analysis on the encrypted traffic to be detected when it is determined that the traffic type of the encrypted traffic to be detected is suspicious traffic.

As shown in fig. 1, the application scenario 100 may include a network 110, a router 120, a processor 130, encrypted traffic 140, and a traffic determination 150. Router 120 may obtain encrypted traffic 140 to be tested from network 110 and processor 130 may copy encrypted traffic 140 in router 120 to collect encrypted traffic 140 and generate traffic decision 150.

The network 110 may include any suitable network that provides information and/or data exchange capable of facilitating the bandwidth application scenario 100. The router 120 of the application scenario 100 may exchange information and/or data with the network 110. For example, network 110 may send user generated traffic information to router 120. In some embodiments, the network 110 may be any one or more of a wired network or a wireless network. In some embodiments, network 110 may include one or more network access points. For example, the network 110 may include wired or wireless network access points. In some embodiments, the network may be a point-to-point, shared, centralized, etc. variety of topologies or a combination of topologies.

Router 120 may be a network device that reads an address in a packet and then stores, groups, and forwards the packet. In some embodiments, router 120 may be used to connect two or more networks 110. In some embodiments, router 120 receives encrypted traffic 140 for network 110 and forwards encrypted traffic 140 stored in router 120 to processor 130. The router 120 may be local or remote.

Processor 130 may include an execution device for performing a method for determining suspicious traffic in encrypted traffic 140, and may process data and/or information obtained from router 120, and perform the method for determining suspicious traffic in encrypted traffic provided in this specification according to the related data, so as to generate traffic determination result 150. For example, processor 130 may determine a traffic characteristic according to the encrypted traffic information received by router 120, and determine whether encrypted traffic 140 is suspicious traffic based on the traffic characteristic, thereby generating traffic determination result 150. In some embodiments, processor 130 may be a single server or a group of servers. In some embodiments, processor 130 may be integrated with the suspicious traffic determination system (e.g., integrated within router 120). The processor 130 may be local or remote. The processor 130 may be implemented on a cloud platform.

The traffic may be traffic generated by the user during the internet surfing process. In some embodiments, the traffic may be encrypted traffic or non-encrypted traffic. The purpose of encrypting the flow is to cope with various eavesdropping and man-in-the-middle attacks, so that the webpage can not be tampered basically, and the internet access safety of a user is ensured. However, some malicious traffic may still be hidden in encrypted traffic 140. In some embodiments, the determination of malicious traffic is performed by processor 130. For example, encrypted traffic 140 containing malicious traffic is transmitted to a router by network 110 and then determined to be suspicious by processor 130.

Traffic determination 150 may include encrypted traffic 140 being suspicious traffic or encrypted traffic 140 being normal traffic. In some embodiments, the traffic determination 150 is performed by the processor 130.

It should be noted that the application scenario 100 is provided for illustrative purposes only and is not intended to limit the scope of the present application. It will be apparent to those skilled in the art that various modifications and variations can be made in light of the description herein. For example, the application scenario 100 may also include an information source. However, such changes and modifications do not depart from the scope of the present application.

Fig. 2 is a block diagram of a system for determining suspicious traffic in encrypted traffic according to some embodiments described herein.

As shown in fig. 2, in some embodiments, the suspicious traffic determination system 200 may include a traffic characteristic acquisition module 210 and a traffic type determination module 220.

The flow characteristic obtaining module 210 may be configured to collect encrypted flow to be detected, and extract encrypted flow characteristics of the encrypted flow to be detected. In some embodiments, the encrypted traffic characteristics may include first traffic characteristics, which may include access characteristic information, protocol characteristic information, and transfer characteristic information. For details regarding the traffic characterization, see step 310 and its associated description.

The traffic type determining module 220 may be configured to determine a traffic type of the encrypted traffic to be tested based on the encrypted traffic characteristics of the encrypted traffic to be tested. In some embodiments, the traffic types may include normal traffic and suspicious traffic. And the suspicious traffic is used for subsequent decryption analysis of the encrypted traffic to be detected. For details regarding the traffic type determination, reference may be made to step 320 and its associated description.

In some embodiments, the traffic type determination module 220 may be further configured to: and processing the encrypted flow characteristics of the encrypted flow to be detected based on a suspicious flow identification model, and determining the flow type of the encrypted flow to be detected, wherein the suspicious flow identification model is a machine learning model. Specific details regarding the suspicious traffic identification model can be found in fig. 5 and its associated description.

In some embodiments, the suspicious traffic determination system 200 may further include a decryption DPI module 230, configured to perform subsequent decryption analysis on the encrypted traffic to be tested in response to that the traffic type of the encrypted traffic to be tested is suspicious traffic.

It should be noted that the above descriptions of the suspicious traffic determination system 200 and the modules thereof are merely for convenience of description and are not intended to limit the present disclosure to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the teachings of the system, any combination of modules or sub-system may be configured to interface with other modules without departing from such teachings. In some embodiments, the flow characteristic obtaining module 210 and the flow type determining module 220 disclosed in fig. 2 may be different modules in a system, or may be a module that implements the functions of two or more modules. For example, each module may share one memory module, and each module may have its own memory module. Such variations are within the scope of the present disclosure.

Fig. 3 is an exemplary flow diagram of a method for determining suspicious traffic in encrypted traffic according to some embodiments described herein. In some embodiments, the process 300 may be performed by the processor 130. As shown in fig. 3, the process 300 includes the following steps.

Step 310, collecting the encrypted flow to be tested, and extracting the encrypted flow characteristic of the encrypted flow to be tested.

Encrypted traffic refers to traffic in the network traffic that has been encrypted. The encrypted traffic may be encrypted by the user or by the service provider to protect privacy. For example, for users who need to process traffic online based on internet communication, encryption mechanisms can be relied on in mobile applications, cloud applications, and Web applications, through data encryption processes, using keys and certificates, etc., to ensure security and establish trust.

The basic process of data encryption includes processing a file or data (traffic) originally in plaintext according to an algorithm, so that the file or data becomes an unreadable segment of code, which is generally called "ciphertext". Through the data encryption way, the data is protected from being illegally stolen and read. In some embodiments, encrypting traffic may include encrypting normal traffic and encrypting suspicious traffic. The encrypted suspicious traffic often disguises or hides the characteristics of malicious traffic. For example, encrypted suspicious traffic often disguises or hides trojan horses, infected viruses, worms, malicious downloaders and the like with attack behaviors, and attacks are launched on the server, so that the server is crashed and other problems occur.

The encrypted traffic characteristics are traffic characteristics associated with the encrypted traffic to be measured. The traffic characteristics may include statistics such as quintuple information, encryption protocol information, average packet size, average packet transmission interval, etc. The five-tuple information includes a source IP address, a source port, a destination IP address, a destination port, and a transport layer protocol. The encryption protocol information refers to a message related to a protocol of secure communication established between the server side and the client side in an authentication process. The authentication process includes: the client sends a message to the server; the server sends a self-authentication message response client; the client and the server finish the key exchange and end the authentication process. In some embodiments, the encryption protocol information may include a TLS/SSL protocol version, an extension field, and the like. The average size of a packet refers to the average length of data in several packets, and is expressed in bytes, for example, the average length of ten IP packets is 1000 bytes. The average packet transmission interval is an average time interval between the transmission of a current data frame and the transmission of a next data frame during data transmission, and for example, a data frame is transmitted every 2 seconds on average.

In some embodiments, the encrypted traffic characteristics may include a first traffic characteristic, the first traffic characteristic being a characteristic related to content contained in the encrypted traffic to be tested. In some embodiments, the first traffic characteristics may include access characteristic information, protocol characteristic information, and transfer characteristic information.

The access characteristic information refers to characteristic information related to access. The access may refer to a process in which a visitor actively searches for a specific purpose using a network platform, and traffic may be generated during the access process. For example, the encrypted traffic may be traffic generated by a visitor clicking a website URL collected in a bookmark or traffic generated by a visitor directly entering a website in a browser address bar. The access characteristic information may be used to distinguish between different sessions, e.g. communication between different users. In some embodiments, the access characteristic information may include source IP address, source port, destination IP address, destination port, and the like.

Whether the encrypted traffic is suspicious traffic may be determined based on the access characteristic information. For example, the access amount of an IP address corresponding to a certain browser in a day is 500% higher than before, and it is necessary to check whether access data is raised for suspicious traffic.

The protocol feature information refers to feature information related to a protocol. The protocol characteristic information may be used to distinguish the manner in which network traffic is transmitted, e.g., whether it is encrypted or not. In some embodiments, the protocol characteristic information may include transmission protocol information, encryption protocol information, and the like.

The probability that the encrypted traffic is suspicious traffic can be determined based on the protocol characteristic information. For example, statistical historical malicious traffic often selects a hidden encryption transmission protocol, and by identifying an encryption protocol used for encrypting traffic, such as Secure Socket Layer (SSL), it can be determined that the encryption protocol is more vulnerable to malicious traffic.

The transfer characteristic information refers to characteristic information related to information transfer. In some embodiments, the transfer characteristic information may include an average packet size, an average packet transmission interval, and the like.

Whether the encrypted traffic is suspicious traffic may be determined based on the delivery characteristic information. For example, the normal access time of a certain network platform is 8-18, the average size of a packet is 512 bytes, and the average transmission interval of the packet is 20ms, but a large number of intensive accesses occur at 02-04 in a certain day, the average size of the packet is 1500 bytes, and the average transmission interval of the packet is abnormally reduced to 6ms, and this part of abnormal traffic can be determined as suspicious traffic.

In some embodiments, the first traffic characteristics may further include a byte distribution probability vector.

In the field of computer security, data is transmitted from a sender to a receiver in the form of data packets, the data packets include a header, and the data transmitted by the sender is called a payload, i.e., the receiver subtracts the length of the IP header from the total length of the IP data packet, so as to determine the size of the payload of the data packet. The header is appended to the payload for transmission and then discarded upon successful arrival at the destination. The main source of malicious traffic-spreading viruses is the payload. The payload includes data corruption, messages with foul text, or bulk email sent to a large number of people. Byte distribution refers to a count of each byte value in the packet payload. For example, the byte distribution of a packet may be: in the payload of the packet, the first byte "00000001" appears 10 times, the second byte "00000011" appears 15 times, …, and the nth byte "11111111" appears 5 times. The byte distribution probability refers to the probability of occurrence of each byte value in the payload of a data packet. In some embodiments, the probability of occurrence of each byte value may be approximated by the frequency of occurrence of that byte value. The byte distribution probability vector is a vector formed by the probabilities that 256 values that a byte can take respectively appear in the data stream. The byte distribution probability vector can provide a large amount of information of data encoding and data filling, and illegal behaviors of a large amount of malicious traffic are often hidden in the information. In some embodiments, the byte distribution count for each byte value may be divided by the total number of bytes in the payload to obtain the byte distribution frequency, which is used to represent the byte distribution probability, and this feature is ultimately represented as a 1 x 256 dimensional byte distribution probability vector. For example, malicious traffic may utilize certain fields of the HTTP header (e.g., content-type, server, etc.) to initiate some malicious activity, which indicates that the HTTP field is well indicative of some malicious activity. An HTTP context flow refers to all HTTP flows issued by the same source IP address within a 5min window of the secure Transport protocol TLS (Transport Layer Security). All observed HTTP header information is represented by a feature vector of one binary variable, which will be 1 regardless of the other HTTP flows if any HTTP flow has a specific header value (i.e., a header containing malicious traffic). For the byte distribution probability vector P1, the processor 130 may count 100 pieces of traffic of P1 in the network traffic in a preset time period, where 60 pieces of HTTP header features are 1, that is, these 60 pieces of HTTP header features indicate that the traffic is malicious traffic, and the frequency that the byte distribution probability vector of the encrypted traffic to be detected is that the traffic of P1 is malicious traffic is 60%.

The acquisition of the encrypted flow to be detected can be realized by different flow acquisition methods. The method for acquiring the encrypted traffic to be detected includes, but is not limited to Sniffer, SNMP (Simple Network Management Protocol), netflow, sFlow, and the like.

In some embodiments, sniffer may be employed to collect encrypted traffic. For example only, a data collection point may be set at a mirror port of the switch, and the encrypted traffic information to be measured may be collected by completely copying data information in the network through the mirror port.

And after the encrypted flow to be detected is collected, extracting the encrypted flow characteristics of the encrypted flow to be detected. Encrypted traffic features may be extracted in a variety of ways, such as encrypted traffic base information extraction libraries (e.g., flowcontitainer), encrypted traffic feature extraction tools (e.g., wireShark, QPA, tstat, etc.), or other encrypted traffic extraction algorithms, machine learning models, and so forth.

In some embodiments, the encrypted traffic characteristics may also include a second traffic characteristic. The second traffic characteristic is a characteristic derived from the content of the encrypted traffic itself to be measured. The second traffic characteristic (domain name heat) may be determined by other external content (for example, querying the domain name in the internet or a knowledge graph) related to the second traffic characteristic after the destination domain name is extracted from the encrypted traffic to be detected. In some embodiments, the second traffic characteristic may include domain name heat.

Domain name hotness refers to the degree of domain names that malicious traffic tends to access. In some embodiments, the domain name popularity may include the number of times (or frequency, probability, etc.) malicious traffic accesses the domain name. The higher the frequency of domain name access by malicious traffic, the higher the domain name popularity. In some embodiments, if the network traffic includes a high-heat domain name, the traffic type determination module 220 may determine that the network traffic has a higher probability of being suspicious traffic.

In some embodiments, the traffic characteristic obtaining module 210 may obtain a second traffic characteristic of the encrypted traffic to be tested. In some embodiments, the second flow characteristic may represent a fractional value. For example, when the second traffic characteristic is domain name popularity, the higher the domain name popularity score value is, the higher the domain name popularity is, and otherwise, the lower the domain name popularity score value is. The point value can be obtained according to the historical access condition of the domain name, the reporting condition of the user to the domain name and the like. In some embodiments, the second traffic characteristics may be used to determine whether the domain name is vulnerable to malicious traffic and further used for traffic type determination.

In some embodiments, the second flow characteristic may be obtained by a relational map.

FIG. 4 is an exemplary illustration of obtaining a second flow characteristic via a relational map, according to some embodiments described herein.

The relationship graph may include domain name nodes, entity nodes, edges connecting between entity nodes, and edges connecting between entity nodes and domain name nodes; the edge attributes of an edge may include communication related data, as well as traffic type. Com, the entity node may be an IP address 207.46.197.101 corresponding to the domain name. One IP address may correspond to multiple domain names, but one domain name has only one IP address. When a user types a Domain Name, the Domain Name first arrives at the Domain Name System (DNS), and then the Domain Name is resolved into the IP address of the corresponding website, and the process of completing this task is called Domain Name resolution. The access of the client host to the server is completed through the domain name node and the entity node.

In some embodiments, the traffic characteristic obtaining module 210 may construct a relationship graph based on the encrypted traffic information to be detected, and obtain the second traffic characteristic according to the malicious proximity value determined by the relationship graph. The malicious proximity value represents the number of edges that a certain node (e.g., node a) satisfies a preset condition. Wherein the preset condition may include: the direction of the edge points to node a, i.e. with node a as the end point. Wherein the traffic type of the edge is malicious traffic.

As shown in fig. 4, the relationship graph 410 may include domain name nodes 420 (e.g., node a, node B, node C), entity nodes 430 (e.g., node 1, node 2, node 3), and edges 440 connecting the nodes. Wherein, the edge is a directed edge.

In some embodiments, the traffic feature acquisition module 210 may construct edges of the relationship graph from communications between various nodes. The communication represented by the edges is an abstract communication. An abstract communication may include multiple information interactions over a short period of time. For example, the nodes a and B are connected by a directed edge, which represents that multiple information interactions are performed between the nodes a and B in a short time, and these multiple information interactions can be regarded as an abstract communication. The direction of the edge may be determined by the originator of the first information interaction. For example, in the above multiple information interactions, the first information interaction is initiated from a to B, and then the directions of the multiple information interactions may be determined as a pointing to B accordingly. In some embodiments, node a and node B may have multiple directed edges in response to there being multiple communications between them (e.g., communications occurring at different times spanning a longer time span). In some embodiments, the processor 130 may count the number of edges between the nodes, where the traffic type is "malicious traffic", based on the attributes of the edges. The malicious proximity value of 0 indicates that no edge with the traffic type of malicious traffic exists between the two nodes, the malicious proximity value of 1 indicates that one edge with the traffic type of malicious traffic exists between the two nodes, the malicious proximity value of 2 indicates that two edges with the traffic type of malicious traffic exist between the two nodes, and so on. In some embodiments, the second traffic characteristic may be determined from the malicious proximity value. For more on determining the second traffic characteristics based on the malicious proximity values, reference may be made to fig. 4 and its associated description.

The schematic flow 400 is an example of determining a second flow characteristic via a relational map. Illustratively, the second traffic characteristic 460 in the illustrative flow 400 is domain name heat. Specifically, the traffic characteristic obtaining module 210 may find a node corresponding to the encrypted traffic information to be detected (for example, an IP address), and the processor 130 may obtain all edges 440 connected to the node based on the relationship graph 410, and count the number of edges whose traffic types are malicious traffic in the edge attribute corresponding to the node in the graph; determining a malicious proximity value based on the number of edges 450; domain name heat is determined based on the malicious proximity value. As shown in fig. 4, if the malicious proximity value of the node 1 and the node B is 1, the malicious proximity value of the node 3 is 2, and the malicious proximity value of the node 2 is 3, the domain name hot degree of the node 2 is the highest.

In some embodiments, the traffic feature acquisition module 210 may determine the domain name popularity based on the malicious proximity value 450 and a preset proximity rule. The preset proximity rule may be that malicious proximity values corresponding to the nodes are sorted according to size, and the more the ranking is, the higher the domain name heat is. The preset proximity rule can be set according to actual requirements. For example, the domain name with the highest domain name heat corresponding to the three nodes with the top rank is output, and for the domain name with the high domain name heat, the traffic corresponding to the domain name can directly enter the subsequent decryption DPI analysis without the traffic type classification.

In some embodiments, the edge characteristics of the relationship graph may also include the number of times correspondence data between two nodes is reported by a user. The user side is located at a node attacked by malicious traffic. When communication data between two nodes is reported by a user, recording that the traffic type corresponding to the communication is malicious traffic. Processor 130 may count the number of edges between nodes for which the traffic type is "malicious traffic". Further, based on the number of edges, a second flow characteristic is determined. For example only, when the second traffic characteristic is domain name popularity, the greater the number of edges whose traffic type is "malicious traffic", the higher the domain name popularity of the domain name corresponding to the node is.

In the embodiment of the description, the second traffic characteristic is determined through the relational graph, so that the traffic type integration of network traffic can be effectively performed, and the association between the domain name and the domain name, the association between the entity IP and the association between the domain name and the entity IP are constructed, that is, the edge between the nodes is constructed according to the flow direction of the traffic, thereby more efficiently supporting the mining and extraction of the second traffic characteristic; by determining the second traffic characteristic, the accuracy of determining that the encrypted traffic characteristic is malicious traffic can be improved.

And step 320, determining the traffic type of the encrypted traffic to be tested based on the encrypted traffic characteristics of the encrypted traffic to be tested.

In some embodiments, the traffic types may include normal traffic and suspicious traffic. The traffic type of the encrypted traffic to be measured can be determined in a number of ways. In some embodiments, the traffic type may be determined based on historical data, preset rules, or a suspected traffic identification model, among other ways. In some embodiments, determining the traffic type based on the historical data comprises: the historical suspicious traffic is obtained by the traffic type determining module 220, and the historical suspicious traffic is compared with the traffic characteristics of the encrypted traffic to be detected, and when the similarity is greater than a certain threshold (for example, greater than 0.8), the traffic type of the encrypted traffic to be detected is determined to be suspicious traffic. In some embodiments, determining the traffic type based on the preset rule includes determining the traffic type of the encrypted traffic to be tested as suspicious traffic when a number of suspicious traffic characteristics of the encrypted traffic to be tested is greater than a certain value (e.g., greater than 1). In some embodiments, the suspicious traffic identification model may be a machine learning model. For details of the suspicious traffic identification model, reference may be made to fig. 5 and its associated description.

Step 330, in response to the traffic type of the encrypted traffic to be tested being suspicious traffic, performing subsequent decryption analysis on the encrypted traffic to be tested.

In some embodiments, if the traffic type of the encrypted traffic to be detected is normal traffic, subsequent decryption analysis is not required.

The decryption analysis may include confirming a protocol type, a segmentation protocol domain, SSL offloading, payload analysis, and an identification negotiation protocol, and further determines whether suspicious traffic is malicious traffic by decryption analysis of the suspicious traffic, and may also mark traffic characteristics corresponding to the suspicious traffic by the decryption DPI module 230, store the suspicious traffic characteristics in the traffic type determination module 220 for identifying suspicious traffic in encrypted traffic to be detected, and facilitate obtaining more training samples for model training, so that determination of the encryption analysis is more accurate.

In the embodiment of the description, only the suspicious traffic is subjected to subsequent decryption analysis by screening out the normal traffic and the suspicious traffic in the encrypted traffic, so that the load of subsequent analysis work is reduced, and the analysis efficiency is improved.

In some embodiments, determining the traffic type of the encrypted traffic to be tested based on encrypted traffic characteristics of the encrypted traffic to be tested includes: and processing the encrypted flow characteristics of the encrypted flow to be detected based on a suspicious flow identification model, and determining the flow type of the encrypted flow to be detected, wherein the suspicious flow identification model is a machine learning model.

As shown in fig. 5, the initial suspicious traffic recognition model 550 may be based on a number of training samples 540 with identifications, resulting in a trained suspicious traffic recognition model 520. Specifically, the training sample 540 with the identifier is input into the initial suspicious traffic recognition model 550, and the initial suspicious traffic recognition model is trained based on the identifier. In some embodiments, the training samples 540 may be normal traffic and suspicious traffic.

In some embodiments, the identification of the training sample may be whether the training sample is suspicious traffic. For example, a training sample is suspicious traffic and is identified as 1, otherwise it is 0.

In some embodiments, the initial suspicious traffic identification model 550 may be a classifier trained on suspicious traffic as positive samples and normal traffic as negative samples. In some embodiments, the classifier may be one of a logistic regression model, a support vector machine, a random forest, or other classification model.

In some embodiments, the suspicious traffic identification model 520 may be used to determine the type of traffic corresponding to the input traffic characteristics, in some embodiments, the input of the suspicious traffic identification model 520 may include the first traffic characteristics 510-1 or/and the second traffic characteristics 510-2, and the output of the suspicious traffic identification model 520 may include one of suspicious traffic 530-1 and normal traffic 530-2.

In some embodiments, the training is ended when the trained suspicious traffic identification model satisfies the preset condition. The preset condition may be that the accuracy is greater than or equal to a preset threshold. The preset threshold may be specifically set according to actual requirements, for example, 90% or 95%.

In some embodiments, the accuracy of the trained suspicious traffic identification model may be determined by a plurality of test samples, the test samples containing a label of whether the traffic is suspicious. After a plurality of test samples are input into the trained suspicious flow identification model, corresponding prediction categories can be output, when the prediction categories are consistent with the labels, the prediction is correct, and otherwise, the prediction is wrong. The accuracy may be a value of the number of samples predicted to be correct divided by the total number of test samples.

In the embodiment of the description, the traffic type is identified by using the machine learning model, and the internal characteristics of malicious traffic can be learned based on a large amount of historical traffic data, so that whether the encrypted traffic to be detected is suspicious traffic can be more accurately judged.

In some embodiments, the output of the suspicious traffic identification model may also include a classification vector 530-3, the classification vector 530-3 including confidences that the encrypted traffic under test belongs to different classes of suspicious traffic.

In some embodiments, before the suspicious traffic recognition model is used to output the classification vector, a large number of multi-classification training samples should be used to train the initial suspicious traffic recognition model so that the initial suspicious traffic recognition model has a certain multi-classification capability. In some embodiments, the training samples may be normal traffic and different classes of malicious traffic, e.g., the malicious traffic may pertain to "privacy disclosure suspicious traffic," "malicious attack suspicious traffic," and so on. In some embodiments, the identification of the training samples may be a category of the training samples. For example, the malicious traffic is identified as a, and the category of the malicious traffic is "privacy disclosure suspicious traffic"; the malicious traffic is marked as B, and the category of the malicious traffic is suspicious traffic of malicious attack. In some embodiments, the classification vectors output by the suspicious traffic identification model may represent the confidence that suspicious traffic belongs to different malicious behaviors. In some embodiments, the classification vector output by the suspicious traffic model includes a plurality of values between 0 and 1, each of which is used to represent a confidence that the sample belongs to a corresponding class. As an example, the suspicious traffic identification model may output a vector [0.2,0.8,0.1] where a confidence of 0.2,0.8 that 0.2 indicates that the sample belongs to class a indicates that a confidence of 0.8,0.1 that the sample belongs to class C is 0.1, and then the sample may be determined to belong to class B.

In some embodiments, the inputs to the suspicious traffic identification model may also include the reference malicious values 510-3 of the byte distribution probability vectors. The byte distribution probability vector may be determined in step 310 and its associated description.

The reference malicious value refers to the probability that the byte distribution probability vector is suspicious traffic.

In some embodiments, the reference malicious value of the byte distribution probability vector may be determined based on historical data or the like.

In some embodiments, determining the reference malicious value based on the historical data includes obtaining a byte distribution probability vector of the historical suspicious traffic by a suspicious traffic determination model, comparing the byte distribution probability vector of the historical suspicious traffic with a byte distribution probability vector corresponding to the encrypted traffic to be tested, and determining the malicious value of the byte distribution probability vector of the historical suspicious traffic as the reference malicious value of the current byte distribution probability vector when the similarity is greater than a certain threshold (e.g., greater than 0.8).

In some embodiments, the edge attributes of the relationship graph further include a byte distribution probability vector.

In some embodiments, the reference malicious values may be obtained based on a relationship graph, including: and counting the frequency of the edge with the malicious flow type in the edge attribute of the edge meeting the preset condition based on the edge meeting the preset condition in the relation graph, and determining a reference malicious value based on the frequency.

In some embodiments, the preset condition is that the similarity between the byte distribution probability vector in the edge attribute and the byte distribution probability vector of the encrypted traffic to be tested is close to a preset range. The preset range may be one of a system default value, an empirical value, a human preset value, and the like. For example, for the byte distribution probability vector P2, the processor 130 may count 100 pieces of traffic P2 in the network traffic within a preset time period, where 40 pieces of traffic are normal traffic and 60 pieces of traffic are malicious traffic, that is, the frequency of the traffic representing the byte distribution probability vector P2 of the encrypted traffic to be measured as malicious traffic is 60%.

The reference maliciousness value may be further calculated based on the aforementioned 60%, the greater the frequency, the greater the reference maliciousness value. In some embodiments, the byte distribution probability vector of the current traffic to be measured is P2, and all vectors with similarity close to the vector P2 are searched in the relational graph. For example, all vectors with close similarity to vector P2 are: p3, P4 and P5, wherein the corresponding edges of P3 and P4 are malicious traffic; the edge corresponding to P5 is the normal flow. Then the frequency of malicious traffic is 75%, and a reference malicious value for P2 is calculated based on the frequency of malicious traffic of 75%. The manner of determining the reference malicious value according to the frequency may include determining the reference malicious value according to a rule table. For example, if the frequency of the byte distribution probability vector representing malicious traffic is 60%, the malicious value in the corresponding rule table is 80; if the frequency of the byte distribution probability vector representing the malicious traffic is 80%, the malicious value in the corresponding rule table is 90.

In some embodiments of the present specification, the updating of the preset relationship map may be performed according to an association relationship between encrypted traffic information to be detected and a reference malicious value, where the updating includes: comparing the frequency of the byte distribution probability vector corresponding to the encrypted flow to be detected as malicious flow with the frequency corresponding to the reference malicious value; if the byte distribution probability vector corresponding to the encrypted flow to be tested is that the frequency of the malicious flow is greater than the frequency corresponding to the current reference malicious value, adding a child node on the encrypted flow to be tested and associating the child node with the byte distribution probability vector representing the malicious flow so as to update the preset relation map; if the byte distribution probability vector corresponding to the encrypted flow to be tested is that the frequency of the malicious flow is smaller than the frequency corresponding to the current reference malicious value, adding a child node to the encrypted flow to be tested and associating the byte distribution probability vector of the child node representing the normal flow so as to update the preset relation map.

In the embodiment of the specification, the reference malicious value is obtained through the relation graph, a more accurate reference malicious value can be obtained based on a large number of byte distribution probability vectors obtained through statistics, and meanwhile, the relation graph is updated in real time, so that the reference malicious value can be obtained more accurately and efficiently in real time.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, though not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Additionally, the order in which elements and sequences are described in this specification, the use of numerical letters, or other designations are not intended to limit the order of the processes and methods described in this specification, unless explicitly stated in the claims. While certain presently contemplated useful embodiments of the invention have been discussed in the foregoing disclosure by way of various examples, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein described. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Where numerals describing the number of components, attributes or the like are used in some embodiments, it is to be understood that such numerals used in the description of the embodiments are modified in some instances by the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method for judging suspicious traffic in encrypted traffic is characterized by comprising the following steps:

acquiring encrypted flow to be detected, and extracting encrypted flow characteristics of the encrypted flow to be detected; wherein the encrypted traffic characteristics include first traffic characteristics including access characteristic information, protocol characteristic information, and transfer characteristic information;

determining the traffic type of the encrypted traffic to be detected based on the encrypted traffic characteristics of the encrypted traffic to be detected, wherein the traffic type comprises normal traffic and the suspicious traffic;

and in response to the fact that the flow type of the encrypted flow to be detected is the suspicious flow, performing subsequent decryption analysis on the encrypted flow to be detected through a decryption DPI module.

2. The method of claim 1, wherein the encrypted traffic characteristics further comprise a second traffic characteristic, the second traffic characteristic being obtained via a relational graph.

3. The method of claim 1, wherein the determining the traffic type of the encrypted traffic under test based on the encrypted traffic characteristics of the encrypted traffic under test comprises:

and processing the encrypted flow characteristics of the encrypted flow to be detected based on a suspicious flow identification model, and determining the flow type of the encrypted flow to be detected, wherein the suspicious flow identification model is a machine learning model.

4. The method of claim 3, wherein the inputs to the suspicious traffic identification model further comprise reference malicious values of byte distributed probability vectors, the reference malicious values obtained based on the relationship graph.

5. A system for determining suspicious traffic in encrypted traffic, the system comprising:

the flow characteristic acquisition module is used for acquiring encrypted flow to be detected and extracting encrypted flow characteristics of the encrypted flow to be detected; wherein the encrypted traffic characteristics include first traffic characteristics including access characteristic information, protocol characteristic information, and transfer characteristic information;

a traffic type determining module, configured to determine a traffic type of the encrypted traffic to be detected based on the encrypted traffic feature of the encrypted traffic to be detected, where the traffic type includes a normal traffic and the suspicious traffic;

and the decryption DPI module is used for responding to the suspicious traffic of the traffic type of the encrypted traffic to be detected and carrying out subsequent decryption analysis on the encrypted traffic to be detected.

6. The system of claim 5, wherein the encrypted traffic characteristics further comprise a second traffic characteristic, the second traffic characteristic being obtained via a relational map.

7. The system of claim 5, the traffic type determination module further to:

8. The system of claim 7, wherein the inputs to the suspicious traffic identification model further comprise reference malice values of byte distribution probability vectors, the reference malice values obtained based on the relationship graph.

9. A device for judging suspicious traffic in encrypted traffic comprises at least one processor and at least one memory; the at least one memory is for storing computer instructions; the at least one processor is configured to execute at least some of the computer instructions to implement the method of any of claims 1-4.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 4.