CN116738329A

CN116738329A - Malicious sample classification method and device, electronic equipment and storage medium

Info

Publication number: CN116738329A
Application number: CN202310544673.2A
Authority: CN
Inventors: 何清林; 何跃鹰; 罗冰
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-09-12

Abstract

The application discloses a malicious sample classification method, a malicious sample classification device, electronic equipment and a storage medium, which are used for solving the problem of low accuracy of the existing malicious sample classification method. The malicious sample classification method comprises the following steps: acquiring communication flow information of a malicious sample to be processed, wherein the communication flow information is data flow information flowing through each network node in the operation process of the malicious sample to be processed; respectively extracting session communication information of each session stage from the communication flow information of each malicious sample to be processed, and generating a corresponding session communication information sequence based on the session communication information of each session stage; according to session communication information sequences corresponding to every two malicious samples to be processed in the malicious samples to be processed, similarity of every two malicious samples to be processed is respectively determined; classifying the malicious samples to be processed according to the similarity of every two malicious samples to be processed, and obtaining a classification result.

Description

Malicious sample classification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of network security technologies, and in particular, to a malicious sample classification method, a malicious sample classification device, an electronic device, and a storage medium.

Background

Malicious samples refer to malware (or code) that is intended to attack a computer, server, client, internet of things device, computer network, or other intelligent device, or to steal user information. Malicious samples include viruses, trojans, worms, malicious advertising software (Adware), malicious installation software (instrer), spyware (Spyware), malicious browser plug-ins, and the like. In order to facilitate analysis of malicious samples, classification of malicious samples has become one of the research hotspots in recent years.

In the related art, two modes of static analysis and dynamic analysis are mainly adopted when classifying malicious samples. The static analysis extracts characteristics of the malicious samples by acquiring information such as codes, file structures and the like of the malicious samples, the characteristics can be called static characteristics, and classification is performed based on the static characteristics of the malicious samples, however, the extraction process of the static characteristics of the malicious samples is subjected to various restrictions, dynamic behavior characteristics of the malicious samples are difficult to fully mine, and the static analysis mode is difficult to establish uniform classification basis for all types of malicious samples due to various malicious attack modes and attack means. Dynamic analysis extracts features by running malicious samples, analyzing the behavior of the malicious samples in the running process, wherein the features can be called dynamic features, and classifying based on the dynamic features of the malicious samples, however, the dynamic analysis needs to invest a lot of time and calculation resources, is not suitable for large-scale data analysis, and because the types of the malicious samples are various, the dynamic features of each malicious sample are different, so that the accuracy of classifying all the malicious samples is difficult to ensure by using a dynamic analysis mode.

Disclosure of Invention

In order to solve the problems in the background art, the embodiment of the application provides a malicious sample classification method, a malicious sample classification device, electronic equipment and a storage medium.

In a first aspect, an embodiment of the present application provides a malicious sample classification method, including:

acquiring communication flow information of a malicious sample to be processed, wherein the communication flow information is data flow information flowing through each network node in the operation process of the malicious sample to be processed;

respectively extracting session communication information of each session stage from communication flow information of each malicious sample to be processed, and generating a corresponding session communication information sequence based on the session communication information of each session stage;

according to session communication information sequences corresponding to every two malicious samples in the malicious samples to be processed, similarity of the malicious samples to be processed is respectively determined;

classifying the malicious samples to be processed according to the similarity of every two malicious samples to be processed, and obtaining a classification result.

In one possible implementation manner, session communication information of each session stage is extracted from communication traffic information of each malicious sample to be processed, which specifically includes:

Aiming at each malicious sample to be processed, determining each session stage in the communication flow information of the malicious sample to be processed according to quintuple information corresponding to the malicious sample to be processed;

extracting session information of each session stage from a preset field corresponding to each session stage;

and determining the session information of each session stage as session communication information of each session stage.

In a possible implementation manner, according to session communication information sequences corresponding to every two malicious samples in the malicious samples to be processed, the similarity of the every two malicious samples to be processed is determined respectively, which specifically includes:

for two session communication information of the same session stage in a session communication information sequence corresponding to any two malicious samples to be processed, respectively signing the two session communication information to generate signature information of each of the two session communication information;

determining similarity scores of the two session communication information according to the similarity of the signature information of the two session communication information;

and determining the similarity of the any two malicious samples to be processed according to the similarity scores of the two session communication information of the same session stage in the session communication information sequences corresponding to the any two malicious samples to be processed.

In one possible implementation manner, determining the similarity score of the two session communication information according to the similarity of the signature information of the two session communication information specifically includes:

determining the matching score of the signature information of the two session communication information according to the similarity between the signature information of the two session communication information, wherein the matching score characterizes the similarity matching degree of the two session communication information;

calculating the distance between the two session communication information;

and determining similarity scores of the two session communication information according to the matching scores of the signature information of the two session communication information and the distance between the two session communication information.

In a possible implementation manner, determining the similarity of the any two malicious samples according to the similarity scores of the two session communication information of the same session stage in the session communication information sequences corresponding to the any two malicious samples specifically includes:

and determining the similarity of the any two malicious samples according to the similarity scores of the two session communication information of the same session stage in the session communication information sequences corresponding to the any two malicious samples to be processed, a preset length penalty term and the length of the session communication information sequences corresponding to the any two malicious samples to be processed.

In one possible implementation manner, the two session communication information are signed respectively, and the signature information of each of the two session communication information is generated, which specifically includes:

for each session communication information, partitioning the session communication information to obtain session communication information blocks;

respectively carrying out hash calculation on each session communication information block one by one to obtain a hash value of each session communication information block;

and generating signature information of the session communication information according to the hash value of each session communication information block.

In one possible implementation manner, according to the similarity between the signature information of the two session communication information, determining the matching score of the signature information of the two session communication information specifically includes:

respectively partitioning signature information of the two session communication information to generate respective corresponding signature information blocks;

calculating the weighted editing distance between the signature information blocks at the same position of the signature information of the two session communication information one by one;

if the weighted editing distance between any two signature information blocks is smaller than a first preset threshold value, determining that the any two signature information blocks are similar;

And determining the matching score of the signature information of the two session communication information according to the number of the similar signature information block pairs and the number of the signature information blocks of the signature information of the two session communication information.

In a possible implementation manner, calculating the distance between the two session communication information specifically includes:

calculating an N-gram distance between the two session communication information based on an N-gram model; and

determining a similarity score of the two session communication information according to the matching score and the distance of the signature information of the two session communication information, wherein the similarity score specifically comprises the following steps:

and determining similarity scores of the two session communication information according to the matching scores of the signature information of the two session communication information and the N-gram distance between the two session communication information.

In one possible implementation manner, determining the similarity score of the two session communication information according to the matching score of the signature information of the two session communication information and the N-gram distance between the two session communication information specifically includes:

calculating the similarity score of the two session communication information by the following formula:

wherein Score represents a similarity Score for the two session communication information;

match _score A matching score representing signature information of the two session communication information;

d represents the N-gram distance between the two session communication information;

alpha and beta are constants.

In a possible implementation manner, determining the similarity of the any two malicious samples according to the similarity score of two session communication information of each same session stage in the session communication information sequences corresponding to the any two malicious samples to be processed, a preset length penalty term, and the length of the session communication information sequences corresponding to the any two malicious samples to be processed specifically includes:

calculating the similarity of any two malicious samples to be processed through the following formula:

wherein Similarity represents the Similarity of any two malicious samples to be processed;

Score _i similarity scores of two session communication information of the ith same session stage in the session communication information sequence corresponding to the any two malicious samples to be processed are represented, and N represents the number of the same session stage in the session communication information sequence corresponding to the any two malicious samples to be processed;

len1 and len2 respectively represent the lengths of session communication information sequences corresponding to any two malicious samples to be processed;

Gamma represents the preset length penalty term, and gamma is more than or equal to 0 and less than or equal to 1.

In a possible implementation manner, classifying the malicious samples to be processed according to the similarity of every two malicious samples to be processed to obtain a classification result, which specifically includes:

and aiming at each two malicious samples to be processed, if the similarity of the two malicious samples to be processed is determined to be larger than a second preset threshold value, determining that the two malicious samples to be processed are malicious samples of the same category.

In a second aspect, an embodiment of the present application provides a malicious sample classification apparatus, including:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring communication flow information of a malicious sample to be processed, wherein the communication flow information is data flow information flowing through each network node in the operation process of the malicious sample to be processed;

the generation unit is used for respectively extracting session communication information of each session stage from the communication flow information of each malicious sample to be processed and generating a corresponding session communication information sequence based on the session communication information of each session stage;

the determining unit is used for respectively determining the similarity of each two malicious samples to be processed according to session communication information sequences corresponding to each two malicious samples to be processed;

The classification unit is used for classifying the malicious samples to be processed according to the similarity of every two malicious samples to be processed, and obtaining classification results.

In a possible implementation manner, the generating unit is specifically configured to determine, for each malicious sample to be processed, each session stage in the communication traffic information of the malicious sample to be processed according to quintuple information corresponding to the malicious sample to be processed; extracting session information of each session stage from a preset field corresponding to each session stage; and determining the session information of each session stage as session communication information of each session stage.

In a possible implementation manner, the determining unit is specifically configured to sign two session communication information of the same session stage in a session communication information sequence corresponding to any two malicious samples to be processed, and generate signature information of each of the two session communication information; determining similarity scores of the two session communication information according to the similarity of the signature information of the two session communication information; and determining the similarity of the any two malicious samples to be processed according to the similarity scores of the two session communication information of the same session stage in the session communication information sequences corresponding to the any two malicious samples to be processed.

In a possible implementation manner, the determining unit is specifically configured to determine a matching score of signature information of the two session communication information according to similarity between signature information of the two session communication information, where the matching score characterizes a similarity matching degree of the two session communication information; calculating the distance between the two session communication information; and determining similarity scores of the two session communication information according to the matching scores of the signature information of the two session communication information and the distance between the two session communication information.

In a possible implementation manner, the determining unit is specifically configured to determine the similarity of the any two malicious samples according to a similarity score of two session communication information of each same session stage in the session communication information sequences corresponding to the any two malicious samples to be processed, a preset length penalty term, and a length of the session communication information sequences corresponding to the any two malicious samples to be processed.

In a possible implementation manner, the determining unit is specifically configured to block, for each session communication information, the session communication information to obtain a session communication information block; respectively carrying out hash calculation on each session communication information block one by one to obtain a hash value of each session communication information block; and generating signature information of the session communication information according to the hash value of each session communication information block.

In a possible implementation manner, the determining unit is specifically configured to block signature information of the two session communication information respectively, and generate respective corresponding signature information blocks; calculating the weighted editing distance between the signature information blocks at the same position of the signature information of the two session communication information one by one; if the weighted editing distance between any two signature information blocks is smaller than a first preset threshold value, determining that the any two signature information blocks are similar; and determining the matching score of the signature information of the two session communication information according to the number of the similar signature information block pairs and the number of the signature information blocks of the signature information of the two session communication information.

In a possible implementation manner, the determining unit is specifically configured to calculate an N-gram distance between the two session communication information based on an N-gram model; and determining similarity scores of the two session communication information according to the matching scores of the signature information of the two session communication information and the N-gram distance between the two session communication information.

In a possible implementation manner, the determining unit is specifically configured to calculate the similarity score of the two session communication information by using the following formula:

alpha and beta are constants.

In a possible implementation manner, the determining unit is specifically configured to calculate the similarity between the any two malicious samples to be processed by using the following formula:

In a possible implementation manner, the classification unit is specifically configured to determine, for each two malicious samples to be processed, that the two malicious samples to be processed are malicious samples of the same class if it is determined that the similarity of the two malicious samples to be processed is greater than a second preset threshold.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the malicious sample classification method of the present application when executing the program.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which when executed by a processor performs steps in a malicious sample classification method according to the present application.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The embodiment of the application has the following beneficial effects:

according to the method, the device, the electronic equipment and the storage medium for classifying the malicious samples to be processed, the communication traffic information of the malicious samples to be processed is obtained, the communication traffic information is the data traffic information of each network node flowing through the malicious samples to be processed in the operation process of the malicious samples to be processed, session communication information of each session stage is extracted from the communication traffic information of each malicious sample to be processed, a corresponding session communication information sequence is generated based on the session communication information of each session stage, according to the session communication information sequences corresponding to each two malicious samples to be processed in the malicious samples to be processed, the similarity of each two malicious samples to be processed is respectively determined, classification results are obtained, and the inventor finds that repeated or very similar payable load exists in the same session stage for the communication traffic of two different malicious samples derived from the same malicious code family, therefore, according to the similarity of the communication traffic information of each session stage extracted from the obtained malicious samples to be processed, the malicious samples to be processed can be compared with the different types of the communication samples to be processed, the similarity of the malicious samples to be classified can be rapidly compared, the malicious samples can be compared with the different types of the malicious samples to be accurately established, and the malicious samples can be classified, and the malicious samples can be compared with the different in the conversation types of the different types are not be compared, the accuracy of malicious sample classification is improved, computing resources are saved, and classification efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 is an application scenario schematic diagram of a malicious sample classification method provided by an embodiment of the present application;

fig. 2 is a schematic flow chart of an implementation of a malicious sample classification method according to an embodiment of the present application;

fig. 3 is a schematic diagram of an implementation flow of extracting session communication information of each session stage from communication flow information of each malicious sample to be processed according to an embodiment of the present application;

fig. 4 is a schematic flowchart of an implementation of determining similarity between any two malicious samples to be processed according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an implementation flow of generating signature information of each session communication information according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an implementation flow chart for determining a similarity score of two session communication information according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an implementation flow chart for determining matching scores of signature information of two session communication information according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a malicious sample classification device according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and not for limitation of the present application, and embodiments of the present application and features of the embodiments may be combined with each other without conflict.

In this context, it is to be understood that the technical terms referred to in the present application are:

1. thing networking (Internet of Things, ioT): the network is used for realizing interconnection and mutual communication of various intelligent devices, sensors, tools and the like through the Internet and can perform data exchange, control and management. The internet of things has wide application in a plurality of fields such as families, automobiles, medical treatment, industry and the like.

2. Malicious samples: malicious samples refer to malware (or code) that is intended to attack a computer, server, client, internet of things device, computer network, or other intelligent device, or to steal user information. Malicious samples include, but are not limited to: viruses, trojans, worms, malicious advertising software, malicious installation software, spyware, malicious browser plug-ins, and the like.

3. Communication flow rate: the communication traffic refers to data traffic involving a plurality of network nodes in the network transmission process, and includes traffic between a source node and a target node, traffic of a network transit node, and the like. Traffic is one of the very important sources of data in network security analysis.

In the application, the communication traffic information of the malicious sample to be processed (i.e. the malicious sample to be classified) is the data traffic information flowing through each network node in the operation process of the malicious sample to be processed.

4. Command and control server (Command and Control Server, C & C server): is a control center for malicious specimens (i.e., malware) that can send commands to the malware to perform operations such as transmitting data, starting programs, downloading updates, etc.

5. Virtual network card: a virtual network card is a software interface device in a computer system that creates a virtual network interface card that enables a computer (or server, etc. device) to communicate between different virtual networks. Virtual network cards are often used in the field of network security for Hook technology and hidden communication of command and control servers.

6. Static analysis: static analysis is a method of analyzing a program, and does not require running the program being analyzed. By checking the source code, compiled code and binary code of the program, static analysis can find out the problems of loopholes, errors, potential safety hazards and the like in the program.

7. Dynamic analysis: dynamic analysis is a method for analyzing a program, which needs to run the analyzed program, and analyzes information such as operation, resource use, system call and the like of the program by observing the behavior of the program, so as to judge whether the logic and the function of the program are correct.

8. Flow black hole: traffic black holes refer to a node in the network that is able to absorb a large amount of data traffic without generating any response data. In the field of network security, traffic blackholes are commonly used to offload and mitigate network attacks.

9. Sandbox (sandbox): also known as a sandbox, is a virtual system program that allows software programs, browsers, or other programs to run in a sandbox environment, so that changes made to the running can be subsequently deleted. It creates an independent working environment like a sandbox, and can run software in the sandbox first, and if malicious behavior is contained, further running of the program is prohibited, without causing any harm to the system. In the field of network security, sandboxes refer to tools used to test the behavior of untrusted software, files, or applications in an isolated environment.

10. C & C communication protocol (Command and Control Protocol, command and control communication protocol): for managing and controlling infected computers in a network by sending instructions to control the infected computers, such as obtaining machine information, attacking target machines, distributing malware, etc.

Referring first to fig. 1, which is a schematic diagram of an application scenario of a malicious sample classification method provided by an embodiment of the present application, the method may include a sandbox server 101 and a black hole server 102, and may further include other network nodes, such as a router 103, a command and control server (C & C server) 104, and the connection relationship between the devices is shown in fig. 1. The sandboxed server 101 is configured with a virtual network card IP and a port, the black hole server 102 is configured with a black hole interception program and a data packet processing plug-in, a malicious sample to be processed is executed in the sandboxed server 101, and an original access path for executing traffic generated by the malicious sample to be processed is directly through a network access command and an address of the control server 104, namely: the sandbox server 101 transmits the traffic generated by executing the malicious sample to be processed to the router 103, and then the traffic is forwarded to the command and control server 104 by the router 103, or directly performs network scanning and attempts to propagate the traffic which may cause the network exit to be marked as unsafe or generate hidden infection trouble to the internal network device, so that a black hole interception program is additionally deployed in the black hole server 102 to intercept the traffic of the malicious sample, where the traffic of the malicious sample to be processed, that is, the traffic of the data flowing through each network node during the operation of the malicious sample to be processed, as in fig. 1, the traffic information of the malicious sample to be processed includes: data traffic from sandboxed server 101 flowing through router 103 for malicious samples to be processed, and data traffic from router 103 flowing through command and control server 104 for malicious samples to be processed. Firstly, the sandbox server 101 encapsulates and transmits a traffic specification VPN (Virtual Private Network ) virtual network card generated by executing a malicious sample to be processed, then forwards the traffic of the malicious sample to be processed passing through the VPN virtual network card to a network egress network card, and routes the traffic to the black hole server 102, the black hole server 102 can filter the traffic of the malicious sample to be processed transmitted by using the VPN virtual network card according to a set packet filtering rule, marks and encrypts the traffic and forwards the traffic to a black hole interception program, the black hole server 102 modifies an IP (Internet Protocol ) address of the source server sandbox server into an IP address of the black hole server by using a packet processing plug-in, and thus, a response message returned by the destination server command and control server 104 is directly returned to the black hole server 102, thereby avoiding the influence of the communication traffic of the malicious sample on other nodes in the system. Further, after the black hole server 102 obtains the communication traffic information of the malicious sample to be processed, the malicious sample classification flow proposed by the embodiment of the present application is executed.

In the embodiment of the present application, other network node devices besides the sandbox server and the black hole server may, but are not limited to, include: the computer, the server, the terminal device, the client, the internet of things device, the network device or other intelligent devices may be based on an actual application scenario, which is not limited in the embodiment of the present application.

Based on the above application scenario, an exemplary embodiment of the present application will be described in more detail below with reference to fig. 2 to 3, and it should be noted that the above application scenario is only shown for the convenience of understanding the spirit and principle of the present application, and the embodiments of the present application are not limited in any way herein. Rather, embodiments of the application may be applied to any scenario where applicable.

As shown in fig. 2, which is a schematic flow chart of an implementation of the malicious sample classification method according to the embodiment of the present application, the method can be applied to the black hole server 102 in fig. 1, and includes the following steps:

s21, acquiring communication flow information of a malicious sample to be processed.

In specific implementation, the communication traffic information of the malicious sample to be processed is data traffic information flowing through each network node in the operation process of the malicious sample to be processed, and the process of acquiring the communication traffic information of the malicious sample to be processed by the black hole server is not described herein.

S22, respectively extracting session communication information of each session stage from the communication flow information of each malicious sample to be processed, and generating a corresponding session communication information sequence based on the session communication information of each session stage.

In specific implementation, session communication information of each session stage may be extracted from the communication traffic information of each malicious sample to be processed according to the flow shown in fig. 3, and for each malicious sample to be processed, the following steps are performed:

s31, determining each session stage in the communication flow information of the malicious sample to be processed according to quintuple information corresponding to the malicious sample to be processed.

In specific implementation, the five-tuple includes: source IP address, destination IP address, source port, destination port, and transport layer protocol. The five-tuple can distinguish between different sessions and the corresponding session is unique, for example: "192.168.X.x,10000, TCP (Transmission Control Protocol ), 121.14. Xx.xx.80" constitutes a five-tuple, which characterizes: a device with an IP address of 192.168.X.x is connected to a device with an IP address of 121.14.Xx.xx, and a port of 80 through a port 10000, using TCP protocol.

In this step, after obtaining the traffic flow information of the malicious sample to be processed, the black hole server restores each session stage in the traffic flow information of the malicious sample to be processed in five-tuple form according to the C & C communication protocol used by the malicious sample and the complete session stage required by the specific attack scenario, so as to perform complete session restoration, where the C & C communication protocol used by the malicious sample and the complete session stage required by the specific attack scenario include: session flows of TCP protocol are used in the malicious sample establishment connection stage, session flows of IRC (Internet Relay Chat ) protocol, SMTP (Simple Mail Transfer Protocol, simple mail transfer protocol) and other protocols are used in the command receiving and response sending stage, and session stages of P2P (Peer to Peer) protocol, FTP (File Transfer Protocol ) and other protocols are used in the data transfer stage.

S32, extracting session information of each session stage from a preset field corresponding to each session stage.

In the implementation, session information of each session stage is extracted from the payload field (i.e., the data field) corresponding to each session stage.

S33, determining the session information of each session stage as session communication information of each session stage.

In this step, session information of each session stage is used as session communication information of each session stage.

There is often a specific pattern behavior in the payload field of the traffic message in the form of a string between the traffic information of similar malicious samples, for example, in the same session stage in the traffic of two malicious samples, the data of the two malicious samples in the payload field of the session stage are as follows:

the data of the first malicious sample in the payload field of the session stage includes the following three parts:

Session part 1payload:617263682061726d76340a

Session part2 payload：50494e47

Session part3 payload：1b5c53376d0d0a

the string "617263682061726d76340a50494e471b5c53376d0d0a" formed by concatenating the data in the payload fields of the above three session portions is the session communication information of the first malicious sample in the session stage.

The second malicious sample data in the payload field of the session stage includes three parts:

Session part1 payload：617263682a53726dee340a

Session part2 payload：50494e47

Session part3 payload：369a3337659b0a

the string "617263682a53726dee340a50494e4750494e47" formed by concatenating the data in the payload fields of the above three session portions is the session communication information of the second malicious sample in the session stage.

S23, according to session communication information sequences corresponding to every two malicious samples to be processed in the malicious samples to be processed, similarity of every two malicious samples to be processed is respectively determined.

In implementation, the similarity of any two malicious samples to be processed can be determined according to the flow shown in fig. 4, which includes the following steps:

s41, respectively signing two session communication information aiming at two session communication information of the same session stage in a session communication information sequence corresponding to any two malicious samples to be processed, and generating signature information of each of the two session communication information.

In the specific implementation, any two malicious samples to be processed are two malicious samples to be processed, wherein the communication traffic information of the two malicious samples to be processed comprises at least one same session stage, and if the communication traffic information of the two malicious samples to be processed does not have the same session stage, the two malicious samples to be processed are dissimilar and belong to different categories, and the similarity of the two malicious samples is not required to be determined, so that the classification efficiency of the malicious samples is improved.

In this step, the signature information of the session communication information may be generated by, but not limited to, a fuzzy hash algorithm, which is not limited in the embodiment of the present application.

Specifically, signature information of each session communication information may be generated according to a flow shown in fig. 5, including the steps of:

s51, according to each session communication information, the session communication information is segmented to obtain session communication information blocks.

In specific implementation, for each session communication information of two session communication information of the same session stage in a session communication information sequence corresponding to any two malicious samples to be processed, the session communication information is segmented according to a first preset length, so as to obtain a plurality of session communication information blocks. The preset length is assumed to be k, that is, the size of the block is k, and the k value can be set according to the actual requirement, which is not limited in the embodiment of the present application.

Still referring to the example listed in step S33, in the same session stage, the session communication information of the first malicious sample in the session stage is: "617263682061726d76340a50494e471b5c53376d0d0a", the session communication information of the second malicious sample at the present session stage is: "617263682a53726dee340a50494e4750494e47", which respectively block the two character strings, assuming that the block size k=4, "617263682061726d76340a50494e471b5c53376d0d0a" may be divided into the following 11 blocks: "6172", "6368", "2061", "726d", "7634", "0a50", "494e", "471b", "5c53", "376d", "0d0a", "617263682a53726dee340a50494e4750494e47" may be divided into the following 10 blocks: "6172", "6368", "2a53", "726d", "ee34", "0a50", "494e", "4750", "494e", "47", wherein the 10 th block has a length smaller than k, and may be supplemented with 0 so that its length is equal to k, and thus the 10 th block "47" becomes "4700" after supplementing two 0 s.

S52, hash calculation is carried out on each session communication information block one by one to obtain the hash value of each session communication information block.

In specific implementation, a local hash algorithm may be used to perform hash calculation on each session communication information block one by one to obtain a hash value of each session communication information block, where the local hash algorithm may be, but is not limited to, a Spamsum algorithm, a winwing algorithm, an LSH (Locality Sensitive Hashing, local sensitive hash) algorithm, or any other hash algorithm.

The above example is continued, and the Spamsum algorithm is taken as an example for explanation.

For 11 session communication information blocks "6172", "6368", "2061", "726d", "7634", "0a50", "494e", "471b", "5c53", "376d", "0d0a" corresponding to the first malicious sample, a Spamsum algorithm is used to calculate respectively, so as to obtain 11 hash values of 64 bits: hash 1=spamsum (6172), hash 2=spamsum (6368), hash 3=spamsum (2061), hash 4=spamsum (726 d), hash 5=spamsum (7634), hash 6=spamsum (0 a 50), hash 7=spamsum (494 e), hash 8=spamsum (471 b), hash 9=spamsum (5 c 53), hash 10=spamsum (376 d), hash 11=spamsum (0 d0 a).

For 10 session communication information blocks "6172", "6368", "2a53", "726d", "ee34", "0a50", "494e", "4750", "494e", "4700" corresponding to the second malicious sample, a Spamsum algorithm is used to calculate respectively, so as to obtain 10 64-bit hash values: hash 1=spamsum (6172), hash 2=spamsum (6368), hash 3=spamsum (2 a 53), hash 4=spamsum (726 d), hash 5=spamsum (ee 34), hash 6=spamsum (0 a 50), hash 7=spamsum (494 e), hash 8=spamsum (4750), hash 9=spamsum (494 e), hash 10=spamsum (4700).

S53, generating signature information of the session communication information according to the hash value of each session communication information block.

In the implementation, for each session communication information, hash values of the session communication information blocks are sequentially connected into a complete character string to obtain hash signature information of the session communication information.

By adopting the fuzzy hash algorithm, hash values can be calculated and compared rapidly, the classification speed of malicious samples is improved, and calculation resources are saved. Moreover, as the malicious samples usually have varieties, such as adding confusion means of filling, shifting, encryption and the like, the conventional malicious sample classification can fail the varieties, and the fuzzy hash algorithm can calculate hash values through session communication information segmentation of the malicious samples, and for the samples with higher similarity but the varieties with confusion parts, the hash values are similar, so that the varieties of the malicious samples can be more accurately identified.

S42, determining similarity scores of the two session communication information according to the similarity of the signature information of the two session communication information.

In specific implementation, the similarity score of the two session communication information may be determined according to the flow shown in fig. 6, which includes the following steps:

s61, according to the similarity between the signature information of the two session communication information, determining the matching score of the signature information of the two session communication information.

Wherein the matching score characterizes the degree of similarity matching of the two session communication information.

In implementation, the matching score of the signature information of the two session communication information may be determined according to the flow shown in fig. 7, including the following steps:

s71, respectively blocking signature information of the two session communication messages to generate corresponding signature information blocks.

In the specific implementation, for signature information of two session communication information of the same session stage in a session communication information sequence corresponding to any two malicious samples to be processed, the signature information of the two session communication information is respectively segmented according to a second preset length to obtain respective corresponding signature information blocks. The second preset length, that is, the size of the block, may be set according to the actual requirement, which is not limited in the embodiment of the present application. For example, but not limited to, set to 64 bits.

Continuing the above example, assume that the signature information of the session communication information corresponding to the first malicious sample in the same session stage obtained according to step S53 is: signature1, signature information of session communication information corresponding to the second malicious sample is: signature2, signature1 may be partitioned into n 64-bit signature information blocks: block1_1, block1_2, …, block1_n, partition signature2 into m 64-bit signature information blocks: when the last block block1_n and block2_m is less than 64 bits, then 0 can be appended to the last block block2_1, block2_2, …, block2_m to supplement 64 bits.

S72, calculating the weighted editing distance between the signature information blocks at the same position of the signature information of the two session communication information one by one.

In this step, the weighted edit distances of the signature information blocks block1_i and block2_i of the signature information of the two session communication information are calculated one by one.

Further, for comparison, each calculated weighted edit distance may be quantized to an integer between 0 and 100, respectively.

The embodiment of the application is not limited to calculating the weighted editing distance between the signature information blocks at the same position of the signature information of the two session communication information, but can also calculate Hamming distance, jaccard distance and the like between the signature information blocks at the same position of the signature information of the two session communication information, and any other algorithm capable of calculating the distance between two objects.

For example, when calculating the Jaccard distance between two signature information blocks, the intersection of the two signature information blocks, i.e., the characters appearing in both signature information blocks, may be calculated, the number of characters appearing in both signature information blocks is counted, denoted as a, the union of the two signature information blocks is calculated again, the number of characters in the union, i.e., the number of all characters appearing in both signature information blocks is denoted as B, and the Jaccard distance value between the two signature information blocks is calculated according to the formula J (a, B) =a/B of Jaccard distance. The Jaccard distance has a value ranging between [0,1], with a value closer to 1 indicating a higher similarity between the two signature information blocks and a value closer to 0 indicating a greater difference between the sets.

And S73, if the weighted editing distance between any two signature information blocks is smaller than a first preset threshold value, determining that any two signature information blocks are similar.

In this step, the smaller the editing distance, the fewer the number of operations of converting one signature information block into another signature information block through the operations of "adding, deleting, changing", which means that the more similar the two are, the first preset threshold may be set according to the experience value, which is not limited by the embodiment of the present application.

S74, determining the matching score of the signature information of the two session communication information according to the number of similar signature information block pairs and the number of the signature information blocks of the signature information of the two session communication information.

In this step, when it is determined that a pair of signature information blocks at the same position of signature information of two session communication information are similar, 1 is added, and the number of similar pairs of signature information blocks is counted, and then, the matching score of the signature information of the two session communication information can be calculated by the following formula:

wherein match _score A matching score representing signature information of the two session communication information;

matched _blocks the number of similar pairs of signature information blocks in the signature information blocks representing the signature information of the two session communication information;

n and m are the number of signature information blocks of signature information of the two session communication information, respectively.

S62, calculating the distance between the two session communication messages.

In one embodiment, the N-gram distance between two session communication information may be calculated based on an N-gram model.

Specifically, each session communication information (character string) is decomposed into a set of all continuous several character sequences by using an N-gram model, wherein the N-gram model comprises a binary model Bi-gram model and a ternary model Tri-gram model, and the N-gram model can be selected according to the needs during implementation, so that the embodiment of the application is not limited. Decomposing the two session communication information into respective corresponding continuous N-gram sequence sets by using the selected N-gram model, for example, decomposing "617263682061726d76340a50494e471b5c53376d0d0a" by using the Tri-gram model, so as to obtain the corresponding N-gram sequence sets as follows: {617, 172, 726, 263, … …, d0d,0d0, d0a }.

After N-gram sequence sets corresponding to two session communication information are obtained, vectors are formed on probability distribution of each N-gram sequence set, respective N-gram matrixes are constructed, N-gram distances between the two session communication information are calculated according to elements in the two N-gram matrixes, and a specific calculation mode is a mature algorithm in the prior art and is not repeated here.

Because the session communication information is the session stage information derived from the complete session, the session communication information has the characteristic of context correlation, based on the method, the N-gram distance between the two session communication information is calculated, the N-gram matrix considers the context information when calculating the similarity, the method is not limited to a single character when matching the character sequence, but considers larger units (continuous multiple characters), more context correlation information can be captured, and accuracy of similarity calculation is improved. For the case of malicious sample variant confusion, the N-gram distance is more advantageous for similarity matching of multiple texts (containing tab, special character and other filling or marking) than the conventional distance method based on character matching, and can further strengthen the response to the malicious sample variant confusion scene.

The embodiment of the application is not limited to calculating the distance between two session communication information by using the N-gram model, but can also calculate the distance between two session communication information in any other mode capable of calculating the distance between two objects, for example, euclidean distance and the like.

S63, determining similarity scores of the two session communication information according to the matching scores of the signature information of the two session communication information and the distance between the two session communication information.

In the implementation, after calculating the N-gram distance between the two session communication information based on the N-gram model, the similarity score of the two session communication information can be determined according to the matching score of the signature information of the two session communication information and the N-gram distance between the two session communication information.

Specifically, the similarity score of the two session communication information may be calculated by the following formula:

wherein Score represents a similarity Score for two session communication information;

d represents an N-gram distance between two session communication information;

alpha and beta are constants.

The alpha and beta are used for adjusting the relative importance between the matching score and the distance, and the values of the alpha and the beta can be set according to actual requirements and empirical values.

Thus, the similarity score of the two session communication information of each same session stage in the session communication information sequence corresponding to the two malicious samples to be processed can be calculated.

S43, determining the similarity of any two malicious samples to be processed according to the similarity scores of the two session communication information of the same session stage in the session communication information sequences corresponding to the any two malicious samples to be processed.

In the implementation, according to the similarity score of two session communication information of each same session stage in the session communication information sequence corresponding to any two malicious samples to be processed, a preset length penalty term and the length of the session communication information sequence corresponding to the any two malicious samples to be processed, the similarity of the any two malicious samples to be processed is determined.

When calculating the similarity between two character strings, the influence of the length of the character strings on the similarity needs to be considered, if the lengths of the two character strings are very different, even if the two character strings are similar in content, the calculated similarity is also very low, so that the embodiment of the application introduces a length penalty term to penalize the length difference between two session communication information, thereby improving the accuracy of the similarity calculation.

Specifically, the similarity of the any two malicious samples to be processed can be calculated by the following formula:

gamma represents a preset length penalty term, and gamma is more than or equal to 0 and less than or equal to 1.

The length penalty term is used to adjust the impact of the length of the session communication information (string).

S24, classifying the malicious samples to be processed according to the similarity of every two malicious samples to be processed, and obtaining a classification result.

In implementation, for every two malicious samples to be processed, if the similarity of the two malicious samples to be processed is determined to be greater than a second preset threshold, the two malicious samples to be processed are determined to be malicious samples of the same category, wherein the second preset threshold can be set according to actual requirements, for example, can be set to be the same as the first preset threshold or can be set to be other values, and the embodiment of the application is not limited to this.

According to the method for classifying the malicious samples, which is provided by the embodiment of the application, the communication flow information of the malicious samples to be processed is obtained, the communication flow information is the data flow information of each network node flowing through each network node in the operation process of the malicious samples to be processed, session communication information of each session stage is extracted from the communication flow information of each malicious sample to be processed, a corresponding session communication information sequence is generated based on the session communication information of each session stage, the similarity of each two corresponding session communication information sequences of the malicious samples to be processed is respectively determined, the similarity of each two malicious samples to be processed is respectively determined, the malicious samples to be processed are classified according to the similarity of each two malicious samples to be processed, classification results are obtained, and because repeated or very similar payload data exist in the same session stage for the communication flow generated by the samples derived from the same malicious code family, therefore, the application has certain similarity when the whole session is measured, the communication information of each session stage is extracted from the communication flow information of each acquired malicious sample to be processed, the similarity of the malicious samples can be quickly established, the different types of the malicious samples to be processed can be compared with the different types of the malicious samples to be accurately classified, the malicious samples can be compared with the different types of the malicious samples to be calculated, and the malicious samples can be classified, the malicious samples can be classified according to the different types can be compared with the different types, the classification efficiency is improved.

Based on the same inventive concept, the embodiment of the application also provides a malicious sample classification device, and because the principle of solving the problem of the malicious sample classification device is similar to that of a malicious sample classification method, the implementation of the device can refer to the implementation of the method, and the repetition is omitted.

Fig. 8 is a schematic structural diagram of a malicious sample classification device according to an embodiment of the present application, which may include:

an obtaining unit 81, configured to obtain communication traffic information of a malicious sample to be processed, where the communication traffic information is data traffic information of each network node flowing through the malicious sample to be processed in an operation process of the malicious sample to be processed;

a generating unit 82, configured to extract session communication information of each session stage from communication traffic information of each malicious sample to be processed, and generate a corresponding session communication information sequence based on the session communication information of each session stage;

a determining unit 83, configured to determine similarity of each two malicious samples to be processed according to session communication information sequences corresponding to each two malicious samples to be processed in the malicious samples to be processed;

and the classification unit 84 is configured to classify the malicious samples to be processed according to the similarity between every two malicious samples to be processed, so as to obtain a classification result.

In a possible implementation manner, the generating unit 82 is specifically configured to determine, for each malicious sample to be processed, each session stage in the communication traffic information of the malicious sample to be processed according to quintuple information corresponding to the malicious sample to be processed; extracting session information of each session stage from a preset field corresponding to each session stage; and determining the session information of each session stage as session communication information of each session stage.

In a possible implementation manner, the determining unit 83 is specifically configured to sign two session communication information in the same session stage in a session communication information sequence corresponding to any two malicious samples to be processed, and generate signature information of each of the two session communication information; determining similarity scores of the two session communication information according to the similarity of the signature information of the two session communication information; and determining the similarity of the any two malicious samples to be processed according to the similarity scores of the two session communication information of the same session stage in the session communication information sequences corresponding to the any two malicious samples to be processed.

In a possible implementation manner, the determining unit 83 is specifically configured to determine a matching score of signature information of the two session communication information according to similarity between signature information of the two session communication information, where the matching score characterizes a degree of similarity matching of the two session communication information; calculating the distance between the two session communication information; and determining similarity scores of the two session communication information according to the matching scores of the signature information of the two session communication information and the distance between the two session communication information.

In a possible implementation manner, the determining unit 83 is specifically configured to determine the similarity of the any two malicious samples to be processed according to the similarity score of two session communication information of each same session stage in the session communication information sequences corresponding to the any two malicious samples to be processed, a preset length penalty term, and the length of the session communication information sequences corresponding to the any two malicious samples to be processed.

In a possible implementation manner, the determining unit 83 is specifically configured to block, for each session communication information, the session communication information to obtain a session communication information block; respectively carrying out hash calculation on each session communication information block one by one to obtain a hash value of each session communication information block; and generating signature information of the session communication information according to the hash value of each session communication information block.

In a possible implementation manner, the determining unit 83 is specifically configured to block signature information of the two session communication information respectively, and generate respective corresponding signature information blocks; calculating the weighted editing distance between the signature information blocks at the same position of the signature information of the two session communication information one by one; if the weighted editing distance between any two signature information blocks is smaller than a first preset threshold value, determining that the any two signature information blocks are similar; and determining the matching score of the signature information of the two session communication information according to the number of the similar signature information block pairs and the number of the signature information blocks of the signature information of the two session communication information.

In a possible implementation manner, the determining unit 83 is specifically configured to calculate an N-gram distance between the two session communication information based on an N-gram model; and determining similarity scores of the two session communication information according to the matching scores of the signature information of the two session communication information and the N-gram distance between the two session communication information.

In a possible implementation manner, the determining unit 83 is specifically configured to calculate the similarity score of the two session communication information by using the following formula:

alpha and beta are constants.

In a possible implementation manner, the determining unit 83 is specifically configured to calculate the similarity between the two malicious samples to be processed according to the following formula:

In a possible implementation manner, the classification unit 84 is specifically configured to determine, for each two malicious samples to be processed, that the two malicious samples to be processed are malicious samples of the same class if it is determined that the similarity between the two malicious samples to be processed is greater than a second preset threshold.

Based on the same technical concept, the embodiment of the present application further provides an electronic device 900, referring to fig. 9, where the electronic device 900 is configured to implement the malicious sample classification method described in the above method embodiment, and the electronic device 900 of this embodiment may include: memory 901, processor 902, and a computer program stored in the memory and executable on the processor, such as a malicious sample classification program. The steps of the above-described embodiments of the malicious sample classification method are implemented by the processor when executing the computer program, for example, step S21 shown in fig. 2. Alternatively, the processor, when executing the computer program, performs the functions of the modules/units of the apparatus embodiments described above, e.g. 81.

The specific connection medium between the memory 901 and the processor 902 is not limited in the embodiment of the present application. In the embodiment of the present application, the memory 901 and the processor 902 are connected through the bus 903 in fig. 9, the bus 903 is shown by a thick line in fig. 9, and the connection manner between other components is only schematically illustrated, but not limited to. The bus 903 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 9, but not only one bus or one type of bus.

The memory 901 may be a volatile memory (RAM) such as a random-access memory (RAM); the memory 901 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD), or any other medium that can be used to carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 901 may be a combination of the above memories.

A processor 902, configured to implement a malicious sample classification method as shown in fig. 2, including:

the processor 902 is configured to invoke the computer program stored in the memory 901 to execute step S21 shown in fig. 2, obtain the traffic flow information of the malicious samples to be processed, step S22, extract session traffic information of each session stage from the traffic flow information of each malicious sample to be processed, generate a corresponding session traffic information sequence based on the session traffic information of each session stage, step S23, determine the similarity of each two malicious samples to be processed according to the session traffic information sequences corresponding to each two malicious samples to be processed, and step S24, classify the malicious samples to be processed according to the similarity of each two malicious samples to be processed, thereby obtaining a classification result.

The embodiment of the application also provides a computer readable storage medium which stores computer executable instructions required to be executed by the processor and contains a program for executing the processor.

In some possible embodiments, aspects of the malicious sample classification method provided by the present application may also be implemented in a form of a program product, which includes program code, when the program product runs on an electronic device, for causing the electronic device to perform the steps in the malicious sample classification method according to the various exemplary embodiments of the present application described above, for example, the electronic device may perform step S21 shown in fig. 2, obtain traffic flow information of a malicious sample to be processed, step S22, extract session traffic information of each session stage from the traffic flow information of each malicious sample to be processed, generate a corresponding session traffic information sequence based on the session traffic information of each session stage, respectively determine similarity of each two malicious samples to be processed according to the session traffic information sequence corresponding to each two malicious samples to be processed in the malicious sample to be processed, and respectively classify the malicious samples to be processed according to the similarity of each two malicious samples to be processed, so as to obtain a classification result.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of classifying malicious samples, comprising:

2. The method of claim 1, wherein the extracting session traffic information of each session stage from the traffic information of each malicious sample to be processed, comprises:

3. The method of claim 1 or 2, wherein determining the similarity of each two malicious samples according to session communication information sequences corresponding to each two malicious samples, respectively, specifically includes:

4. The method of claim 3, wherein determining the similarity score for the two session communication information based on the similarity of the signature information of the two session communication information, specifically comprises:

calculating the distance between the two session communication information;

5. The method of claim 3, wherein determining the similarity of the any two malicious samples according to the similarity scores of the two session communication information of the same session stage in the session communication information sequences corresponding to the any two malicious samples specifically comprises:

6. The method of claim 3, wherein signing the two session communication information respectively generates signature information of each of the two session communication information, and specifically comprises:

7. The method according to claim 4, wherein determining the matching score of the signature information of the two session communication information according to the similarity between the signature information of the two session communication information, specifically comprises:

8. The method of claim 7, wherein calculating the distance of the two session communication information specifically comprises:

9. The method according to claim 8, wherein determining the similarity score of the two session communication information according to the matching score of the signature information of the two session communication information and the N-gram distance between the two session communication information, specifically comprises:

match _score representing the two session communicationMatching score of signature information of the information;

alpha and beta are constants.

10. The method of claim 5, wherein determining the similarity of the any two malicious samples according to the similarity score of the two session communication information of each same session stage in the session communication information sequences corresponding to the any two malicious samples, the preset length penalty term, and the length of the session communication information sequences corresponding to the any two malicious samples, specifically comprises:

11. The method of claim 1, wherein classifying the malicious samples to be processed according to the similarity of every two malicious samples to be processed to obtain a classification result, specifically comprises:

12. A malicious sample classification device, comprising:

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the malicious sample classification method of any one of claims 1-11 when the program is executed by the processor.

14. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps in the malicious sample classification method as claimed in any one of claims 1 to 11.