CN112214764B

CN112214764B - Complex network-oriented malicious program classification method and system

Info

Publication number: CN112214764B
Application number: CN202010935440.1A
Authority: CN
Inventors: 石志鑫; 殷其雷; 姜建国; 黄伟庆; 吕彬; 康肖钰
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2020-09-08
Filing date: 2020-09-08
Publication date: 2024-01-09
Anticipated expiration: 2040-09-08
Also published as: CN112214764A

Abstract

The embodiment of the invention provides a complex network-oriented malicious program classification method and a complex network-oriented malicious program classification system, wherein the method comprises the following steps: acquiring complete network traffic of a network malicious program in preset time, dividing the complete network traffic into a plurality of network activities, and carrying out network activity depiction on the plurality of network activities to obtain a network comprehensive behavior portrayal model; generating a behavior signature by using a behavior distance measurement function and a preset clustering algorithm and using the behavior characteristics of a sample network obtained from the network comprehensive behavior image model; and respectively carrying out overall similarity calculation on the known malicious network training sample and the unknown test sample based on the behavior signature to obtain category attribution of the unknown network malicious program. According to the embodiment of the invention, the comprehensive behavior portrait classification is carried out on the network malicious programs, the behavior characteristics of the complex network malicious programs are finely and comprehensively described, the correct category attribution judgment is further made, and the classification method is not specific to specific network activities, protocols and formats, so that the applicability is strong.

Description

Complex network-oriented malicious program classification method and system

Technical Field

The invention relates to the technical field of network security, in particular to a complex network malicious program classification method and system.

Background

In the technical field of network security, the classification of network malicious programs refers to analysis and accurate judgment of category attribution of detected malicious programs. Aiming at the known class samples, more targeted threat elimination and disassembly measures can be further adopted, the variation trend among the same class samples is analyzed, and the corresponding load signature and behavior feature library is updated; and aiming at unknown class samples, after a certain number of unknown class samples are accumulated, a semi-automatic or automatic analysis method is adopted, characteristics of each layer of the unknown class samples are extracted to serve as new class identifiers, and a malicious program threat information library is continuously enriched.

At present, a classification method for network malicious programs mainly adopts a technical route of behavior analysis, namely, behavior characteristics of the network malicious programs are characterized from multiple angles, and the similarity and difference of behaviors of the network malicious programs and known samples are analyzed and compared to further judge. Specifically, perdiscoi et al propose a multi-level HTTP protocol malicious program classification and family signature extraction method based on network behavior, which performs primary clustering through relatively general network layer statistical behavior features, performs secondary fine clustering and cluster fusion by using the behavior features extracted from the structure in the HTTP request header, and finally extracts the cluster 'center point' feature as a family signature for use in detection. In addition, a malicious program family consistency judging method based on the graph is also provided for the multi-scanner label background. Dietrich et al also focused on the classification and signature generation of different families of malicious programs, which describes the C & C (Command & Control) traffic of the malicious programs through three types of behavior features of traffic protocols, packet size sequences and independent byte numbers in HTTP protocols, defines a general distance measurement mode among different C & C traffic, forms class clusters by hierarchical clustering, evaluates the weights of values in the class clusters according to the characteristic value distribution in the class clusters, and finally performs malicious program C & C traffic detection and family classification by training the obtained class clusters as family signatures. Rafique et al propose a FIRMA method, firstly, the behavior similar flow generated by a malicious program is gathered, a character set meeting the defined conditions in a class cluster is extracted to be used as a malicious program family signature, and finally, the signature can be used in actual detection and family classification tasks after fusion, clustering and format conversion. In addition, the method also uses the protocol behavior characteristics and the load state characteristics of the malicious program family, and adopts an evolutionary algorithm to realize accurate classification of the malicious program family, and achieves better effect than the traditional machine learning method.

According to the prior implementation scheme, the prior behavior portrait model related to the network malicious program is characterized by different granularity and dimension, or is characterized by only coarser and not comprehensive network behavior characteristics of the malicious program, such as simple statistics of uplink and downlink traffic size, interval, protocol zone bit number and the like, or is specially aimed at network activities and behaviors of a specific type of the malicious program, such as C & C channel activities, downloading activities, attack activities, DNS domain name resolution activities and the like. If advanced network malicious programs with complex current behaviors, long-time running and numerous and increasing numbers are characterized and classified based on behavior portraits facing specific activities, only network behavior characteristics in the view of malicious program parts can be obtained, and classification errors are easily caused. For example, two different families of network malware may employ the same DGA (Domain Generation Algorithm) domain name generation module and DDOS (Distributed Denial of Service) attack flow module, and thus behave more similarly in terms of domain name requests and DDOS attack activity. At this time, if classification of the two-family malicious programs is performed based on a behavior representation model for domain name or DDOS attack, it is difficult to find the difference between them and to distinguish them accurately. Similarly, the existing rough and incomplete general behavior portrait model also cannot finely describe the inherent behavior characteristics of the malicious programs, so that the similarity and the difference between the malicious programs are found, and correct category judgment is given.

Disclosure of Invention

The embodiment of the invention provides a complex network-oriented malicious program classification method and a complex network-oriented malicious program classification system, which are used for solving the defects that the prior art is limited by detecting program types and the malicious programs are not completely and deeply characterized.

In a first aspect, an embodiment of the present invention provides a method for classifying malicious programs for a complex network, including:

acquiring complete network traffic of a network malicious program in preset time, dividing the complete network traffic into a plurality of network activities, and carrying out network activity depiction on the plurality of network activities to obtain a network comprehensive behavior portrayal model;

generating a behavior signature by using a behavior distance measurement function and a preset clustering algorithm and using the behavior characteristics of a sample network obtained from the network comprehensive behavior image model;

and respectively carrying out overall similarity calculation on the known malicious network training sample and the unknown test sample based on the behavior signature to obtain category attribution of the unknown network malicious program.

Further, the obtaining the complete network traffic of the network malicious program in the preset time divides the complete network traffic into a plurality of network activities, and performs network activity characterization on the plurality of network activities to obtain a network comprehensive behavior portrayal model, which specifically comprises:

Dividing a plurality of network flows into the same type of network activities based on preset binary group information, removing the network activities meeting preset definition, and obtaining an initial network activity dividing result;

and carrying out network activity depiction on the initial network activity division result from the whole data, the data stream, the data packet, the data type mark and the activity time ratio to obtain the network comprehensive behavior portrayal model.

Further, the dividing the network traffic into the same class of network activities based on the preset binary group information, removing the network activities meeting the preset definition, and obtaining an initial network activity dividing result, which specifically includes:

dividing a plurality of network flows with the same destination address, the same destination port and different transport layer protocols into the same type of network activities;

and removing network activities which only comprise single data packets, single data streams and unidirectional data streams in the same type of network activities to obtain the initial network activity dividing result.

Further, the network activity characterization is performed on the initial network activity division result from the data entity, the data flow, the data packet, the data type mark and the activity time ratio to obtain the network comprehensive behavior portrait model, which specifically comprises:

Numerical representation is carried out from the overall dimension of the TCP data and the overall dimension of the UDP data, and an overall data depiction result is obtained; the overall dimension of the TCP data comprises the number of uplink TCP bytes, the number of downlink TCP bytes, the number of uplink TCP packets, the number of downlink TCP packets and the ratio of the number of all TCP streams to the overall active time of network activity, and the overall dimension of the UDP data comprises the number of uplink UDP bytes, the number of downlink UDP bytes, the number of uplink UDP packets, the number of downlink UDP packets and the ratio of the number of all UDP streams to the overall active time of network activity;

carrying out numerical value set representation from the TCP data stream dimension and the UDP data stream dimension to obtain a data stream depiction result; the TCP data stream dimension comprises a TCP stream size sequence, a TCP stream duration sequence and a TCP stream interval sequence, and the UDP data stream dimension comprises a UDP stream size sequence, a UDP stream duration sequence and a UDP stream interval sequence;

performing discrete distribution representation from the TCP data packet dimension and the UDP data packet dimension to obtain a data packet characterization result; the TCP data packet dimension comprises an uplink TCP data packet and a downlink TCP data packet, and the UDP data packet dimension comprises an uplink UDP data packet and a downlink UDP data packet;

acquiring data packet type information from network activities to obtain a data type mark characterization result;

Calculating the ratio of the network activity time to the sample execution time to obtain an activity time ratio characterization result;

and combining the data overall characterization result, the data flow characterization result, the data packet characterization result, the data type mark characterization result and the activity time ratio characterization result into the network comprehensive behavior portrait model.

Further, the generating a behavior signature from the sample network behavior characteristics obtained from the network comprehensive behavior image model through the behavior distance measurement function and a preset clustering algorithm specifically includes:

respectively acquiring a first behavior portrait and a second behavior portrait corresponding to any two network activities;

acquiring distances among values, distances among value sets and distances among discrete distributions;

respectively calculating the data overall distance, the data flow distance and the data packet distance between the first behavior representation and the second behavior representation; the data overall distance is equal to the distance between the numerical values, the data flow distance is equal to the average value of the distance between the first behavior portrait and the second behavior portrait numerical value set and the distance between the second behavior portrait and the first behavior portrait numerical value set, and the data packet distance is equal to the distance between the discrete distributions;

Obtaining a matrix storing distances among corresponding behavior portraits of all network activities of a sample according to the overall data distance, the data flow distance and the data packet;

and classifying similar images in the matrix of the distances between the behavior images into the same class of clusters by adopting the preset clustering algorithm, and taking all class clusters as behavior signatures of malicious samples.

Further, the overall similarity calculation is performed on the known malicious network training sample and the unknown test sample based on the behavior signature to obtain category attribution of the unknown network malicious program, which specifically comprises:

respectively acquiring a first behavior signature set corresponding to the unknown test sample, and a second behavior signature set corresponding to any malicious program in the known malicious network training sample set;

and calculating a similarity measurement function between the first behavior signature set and the second behavior signature set, and judging whether the unknown test sample belongs to the category of the known malicious network training sample.

Further, the calculating a similarity measure function between the first behavior signature set and the second behavior signature set to obtain the category attribution specifically includes:

respectively acquiring first behavior signatures in the first behavior signature set, a first network activity behavior portrait in the first behavior signatures, a second behavior signature in the second behavior signature set, and a second network activity portrait in the second behavior signature;

Obtaining a similarity between the first behavioral signature and the second behavioral signature based on a distance between the first network behavioral representation and the second network behavioral representation and a ratio of activity times between the first network behavioral representation and the second network behavioral representation;

obtaining the maximum value of the similarity of all behavior signatures in the first behavior signature set and the second behavior signature set, and setting the maximum value not smaller than a preset threshold value;

and calculating the sum of maximum values of the similarity of all behavior signatures in the first behavior signature set and all behavior signatures in the second behavior signature set, and marking the category attribution according to the sum of the maximum values and the maximum similarity of the current test sample on different training sample behavior signature sets.

In a second aspect, an embodiment of the present invention further provides a complex network malicious program classification system, including:

the network comprehensive behavior portrayal model comprises a portrayal module, a portrayal module and a network comprehensive behavior portrayal module, wherein the portrayal module is used for acquiring the complete network flow of a network malicious program in a preset time, dividing the complete network flow into a plurality of network activities and carrying out network activity portraying on the plurality of network activities to obtain the network comprehensive behavior portrayal model;

The signature module is used for generating a behavior signature from the behavior characteristics of the sample network acquired from the network comprehensive behavior image model through a behavior distance measurement function and a preset clustering algorithm;

and the judging module is used for respectively carrying out overall similarity calculation on the known malicious network training sample and the unknown test sample based on the behavior signature to obtain category attribution of the unknown network malicious program.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the steps of any one of the above-mentioned complex network malicious program classification methods when the processor executes the program.

In a fourth aspect, embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a complex network-oriented malicious program classification method as described in any of the above.

According to the complex network malicious program classification method and system provided by the embodiment of the invention, the behavior characteristics of the complex network malicious program are finely and comprehensively described by carrying out comprehensive behavior portrait classification on the network malicious program, so that correct category attribution judgment is further made, and the classification method is not specific to specific network activities, protocols and formats, and has strong applicability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for classifying malicious programs for a complex network according to an embodiment of the present invention;

FIG. 2 is an overall flowchart of network malware classification provided by an embodiment of the present invention;

FIG. 3 is a graph showing a comparison of classification accuracy under the family discrimination criteria provided by the embodiment of the present invention;

FIG. 4 is a graph showing a comparison of classification accuracy under the scene discrimination criteria provided by the embodiment of the invention;

FIG. 5 is a schematic structural diagram of a system for classifying malicious programs for a complex network according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms "first", "second" in the embodiments of the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the terms "comprise" and "have," along with any variations thereof, are intended to cover non-exclusive inclusions. For example, a system, article, or apparatus that comprises a list of elements is not limited to only those elements or units listed but may alternatively include other elements not listed or inherent to such article, or apparatus. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Aiming at the limitations of the existing network malicious program classification method in the prior art when dealing with the current complex network malicious program, the embodiment of the invention provides a novel classification method aiming at the network malicious program, firstly provides a novel comprehensive behavior portrait model which can finely and comprehensively describe each network activity of the malicious program from a plurality of layers, and further provides a specific malicious program classification method matched with the behavior portrait model.

Fig. 1 is a flow chart of a complex network malicious program classification method provided by an embodiment of the present invention, as shown in fig. 1, including:

s1, acquiring complete network traffic of a network malicious program in preset time, dividing the complete network traffic into a plurality of network activities, and carrying out network activity depiction on the plurality of network activities to obtain a network comprehensive behavior portrayal model;

s2, generating a behavior signature by using the behavior distance measurement function and a preset clustering algorithm and using the behavior characteristics of the sample network obtained from the network comprehensive behavior image model;

and S3, respectively carrying out overall similarity calculation on the known malicious network training sample and the unknown test sample based on the behavior signature to obtain category attribution of the unknown network malicious program.

Specifically, the overall flow of the network malicious program classification method provided by the embodiment of the invention is shown in fig. 2, and mainly comprises three specific steps, including: network activity characterization, behavior signature generation, and malicious program family judgment.

In the network activity depiction step, the complete network flow of a network malicious program in a certain time is divided into different network activities, and the network activities are respectively depicted by the comprehensive behavior portrayal model provided by the embodiment of the invention; in the behavior signature generation step, the main network behavior characteristics of the sample are mined from a plurality of network activity behavior portraits to be used as signatures through a clustering method and a behavior distance measurement function; finally, the test sample behavior signature with unknown category performs overall similarity calculation with the training sample behavior signature with known category, and determines category attribution of unknown complex network malicious program based on similarity.

According to the embodiment of the invention, the comprehensive behavior portrait classification is carried out on the network malicious programs, the behavior characteristics of the complex network malicious programs are finely and comprehensively described, the correct category attribution judgment is further made, and the classification method is not specific to specific network activities, protocols and formats, so that the applicability is strong.

Based on the above embodiment, step S1 in the method specifically includes:

The method comprises the steps of dividing a plurality of network flows into the same type of network activities based on preset binary group information, removing the network activities meeting preset definition, and obtaining an initial network activity dividing result, wherein the method specifically comprises the following steps:

The network activity characterization is carried out on the initial network activity division result from the whole data, the data flow, the data packet, the data type mark and the activity time ratio to obtain the network comprehensive behavior portrait model, which comprises the following steps:

In particular, since numerous network traffic generated by malicious programs over a long run time corresponds to a wide variety of network activities, all the characteristics exhibited by the traffic should not be directly regarded as behavioral characteristics of the sample, which may either cause incomplete and rough characterization due to the use of only a small number of characteristics, or may cause the generated behavioral model to be excessively complex and rich in scattered or noisy behaviors.

It can be understood that the strategy adopted by the embodiment of the invention is to firstly carry out fine granularity and comprehensive characterization on the single network activity of the sample, so as to mine out the main behavior characteristics of the sample on the basis of the single portrait. When restoring the original traffic to a specific network activity, the restoration is performed based on the same triplet information, such as destination address, destination port and transport layer protocol. In addition, in order to reduce the number of activities to be described and reduce the operation overhead of the subsequent steps, the embodiment of the invention further generalizes the traffic with the same destination address and port but different transport layer protocols to the same activity, namely, the division is only based on the binary group information of the destination address and the destination port. For preliminary partitioning results, embodiments of the present invention filter out activity that contains only a single packet, a single stream, and a unidirectional stream. These particular network activities lack interaction and relatively persistent communication data, and are difficult to represent a meaningful network activity. To carefully and comprehensively characterize a complete network activity ACT, it is expressed as:

ACT＝(Act_bh,Flows_bh,Pkt_bh,Act_label,Act_we)

Wherein act_bh, flow_bh, and pkt_bh represent the behavior characteristics of the network activity at three levels of data integrity, data flow, and data packet, respectively.

Act_bh consists of 10 statistical behavior features in the form of values (F ₁ ,F ₂ ...,F ₁₀ ) The first five features represent the number of upstream TCP bytes, the number of downstream TCP bytes, the number of upstream TCP packets, the number of downstream TCP packets, and the number of all TCP streams in the network activity divided by the overall active time of the network activity, respectively; the sixth through tenth are calculated in the same way for the UDP traffic in this network activity. The overall active time of network activity is defined as the duration of each stream it contains and subtracting the overlapping portions. The embodiment of the invention eliminates the behavior difference possibly caused by inconsistent execution time of different samples by dividing the related statistic value of the TCP and UDP traffic by the overall active time of the activity.

The flow_bh is composed of behavior features in the form of six value sets (F ₁₁ ,F ₁₂ ...,F ₁₆ ) Wherein F ₁₁ ，F ₁₂ ，F ₁₃ The sequence of flow sizes, sequence of flow durations and sequence of flow intervals comprising the TCP traffic contained in the network activity can show the behavior characteristics, and the other three characteristics are also characterized by the behavior of the three related value sequences of the UDP flow. In order to characterize the behavior of a sequence of values, the conventional method is to directly extract the mean value of the sequence of values, but this method is not suitable for characterizing the sequence of values related to the flow, because, in a specific network activity, different TCP/UDP flows correspond to different subtasks, so that a difference in behavior is exhibited, for example, in a network activity for obtaining web resources, a sample may obtain a web frame through a fast and short flow, and further obtain multimedia resources on a web page through a plurality of durable and relatively slow flows. Therefore, it is not appropriate to use a single mean to characterize the sequence of sizes or duration of these streams. Therefore, the embodiment of the invention adopts a clustering method, namely, a flow related numerical value The sequence is clustered into a plurality of clusters, and a set formed by the central values of the clusters is used as a description of the sequence, and an X-means algorithm is specifically selected to cluster the numerical sequences.

Pkt_bh is a characteristic of behavior (F ₁₇ ,F ₁₈ ,F ₁₉ ,F ₂₀ ) And describing the behaviors of all uplink TCP data packets, downlink TCP data packets, uplink UDP data packets and downlink UDP data packets in the network activity. Taking all upstream TCP packets as an example, the embodiment of the invention calculates the proportions of these packets in six size ranges (0-64 bytes, 64-128 bytes, 128-256 bytes, 256-512 bytes, 512-1024 bytes, and greater than 1024 bytes), and composes the six proportions into a discrete distribution F ₁₇ The other three features are calculated in the same way as the feature.

For the other two features, if the network activity includes only TCP packets, the embodiment of the present invention sets act_label to "TCP", sets UDP packets to "UDP", and sets it to "mix" if there are two types of packets. Finally, to measure how important each network activity is in time for the sample as a whole, act_we is set to be the ratio of this network activity active time to the sample execution time.

Compared with the behavior portraits proposed in other methods, the comprehensive behavior portraits model provided by the embodiment of the invention has finer, more comprehensive and more universal network behavior portraying capability of the malicious program, can better and more fully learn the behavior characteristics of the malicious program of the complex network, and further obtains more excellent malicious program classifying capability compared with the existing methods and portraits.

Based on any of the above embodiments, step S2 in the method specifically includes:

Specifically, on the basis of the above embodiment, it is necessary to further obtain the overall network behavior characteristic that can be exhibited by the sample, so as to be used as the network behavior signature of each sample. Therefore, the embodiment of the invention groups all network activity behavior images into different clusters through a clustering method, and each cluster is respectively represented as a main behavior characteristic of the sample. The key point of the clustering operation is how to quantitatively evaluate the differences among the behavior portraits corresponding to different activities, and as the portraits model provided by the embodiment of the invention comprises three different types of behavior features of numerical values, numerical value sets and discrete distribution, the embodiment of the invention needs to define the distances among the behavior features of the same type. Let a and b be two values, C and D be two sets of values, and e= (E) ₁ ,e ₂ ,...,e _k ) Sum f= (F ₁ ,f ₂ ,...,f _k ) Two discrete distributions of the same length define the following three formulas to measure the distance between values, between sets of values, and between discrete distributions, respectively:

wherein dis_distribution is the Hellinger distance.

Here, let X and Y be the behavioral portraits corresponding to two network activities, the present invention further defines the dis_activity function to measure their distance:

wherein dis _Act (X.F _i ,Y.F _i ) Namely dis_value (X.F) _i ,Y.F _i )(1≤i≤10)，

dis _Flow (X.F _i ,Y.F _i ) Then define as dis_collection (X.F) _i ,Y.F _i ) And

dis_collect(Y.F _i ,X.F _i ) (11.ltoreq.i.ltoreq.16) average value, dis _Pkt (X.F _i ,Y.F _i ) Then it is equivalent to dis_distribution (X.F _i ,Y.F _i ) (17.ltoreq.i.ltoreq.20). Through the distance function, the embodiment of the invention can acquire the distance matrix among the network activity behavior portraits contained in one network malicious program. Finally, the embodiment of the invention adopts a DBSCAN clustering algorithm to group similar images into the same class of clusters, and each class of clusters generated finally is regarded as a network behavior signature of the sample.

Based on any of the above embodiments, step S3 in the method specifically includes:

Specifically, the class judgment process of the test malicious program is judged based on the similarity between the test malicious program and the training malicious programs of the known families, and the similarity between the two malicious programs is determined by the similarity between the respective behavior signature sets. Let U and V be two malicious programs respectively, M and N be their respective behavior signature sets, and the embodiment of the invention defines the similarity measurement function between them as:

Where M and N are two specific behavioral signatures in M and N, respectively, and j and k are one network activity behavioral representation in M and N, respectively. The idea behind the similarity measurement function between malicious programs is that the similarity of U to V is given by the sum of the similarity between each behavior signature M in the behavior signature set M and the behavior signature set N of V, and the similarity between M and N is further given by the maximum value of the respective similarity of all the behavior signatures in M and N. This maximum similarity value is set to be greater than a specified threshold max_threshold to prevent sample-to-sample similarity from accumulating through a large number of low-similarity behavioral signatures. Finally, the similarity between two behavioral signatures is determined by the most similar pair of network activity portraits they contain, whose calculation involves not only the distance between the two portraits, but also their Act_we properties. Based on the above, the embodiment of the invention can calculate the similarity of the test malicious program to each training malicious program, and further label the test malicious program as the training malicious program category most similar to the test malicious program.

Based on any of the above embodiments, the embodiments of the present invention have fully been experimentally evaluated using the following data and methods:

The evaluation experiment is carried out based on a plurality of public data sets containing complex network malicious programs, and particularly, a plurality of network malicious program scenes with rich and variable behaviors in the data sets used in CTU-13, stratosphere and PEERRUSH documents are selected, and the information of the formed evaluation data set is shown in the following table. In order to evaluate the class judgment result of the test malicious program, two judgment modes are defined in the embodiment of the invention: family-based and scene-based. The family mode is to judge whether the family attribute of the malicious program in the class label predicted by the test malicious program is consistent with the real family attribute of the sample. The scene mode is to judge whether the scene attribute of the malicious program in the class label predicted and given by the test malicious program is consistent with the real scene of the sample. Furthermore, the invention defines the family classification Accuracy (Accuracy) as the proportion of correctly classified test malicious programs to all test malicious programs in a family discrimination mode, and the scene classification Accuracy is the same. In the two discrimination modes, the family mode is more loose and common, and network malicious programs in different scenes in the test data set are likely to belong to the same family. However, under the same family, malicious program samples from different scenarios may still exhibit subtle behavior differences due to changes in code, environment, or even an attacker. In order to further evaluate the capability of the image and classification method of the present invention to distinguish the fine network behavior difference, the embodiment of the present invention defines a more strict scene discrimination mode, and table 1 is the evaluation data information used by the classification method of network malicious programs.

TABLE 1

Experimental evaluation shows that the method has the following advantages:

(1) The comprehensive behavior portrait model provided by the embodiment of the invention can describe the fine-grained and comprehensive behaviors of different activities by matching with different types of characteristics from three different layers aiming at different types of malicious program network activities. In addition, the behavior portrait model is irrelevant to load content, has strong applicability and can be used for analyzing different types of network malicious programs.

(2) The corresponding classification method has excellent classification performance, as shown in fig. 3 and 4, which show performance in Accuracy (Accuracy) in two different discrimination modes on the evaluation data. From the knotAs can be seen, the best values for the family classification and scene classification accuracy are 1.000 and 0.923, respectively. Namely, the embodiment of the invention obtains 100% correct classification performance on family accuracy widely used for various classification methods and devices. The embodiment of the invention is more strictly customized, is more suitable for the scene accuracy of the complex network malicious program, and still obtains the classification performance of more than 0.9. In addition, the embodiment of the invention also evaluates the method classification results under different discrimination modes based on the F-measure which integrates the accuracy rate and the recall rate. CAT for a particular class of malicious program _i Taking the class as a positive sample, taking all other non-class malicious programs as negative samples, and further obtaining the class CAT _i Accuracy Pre of (2) _i And Recall rate Recall _i And calculates the F-measure value of the category as Fmscore _i . After the F-measure values of various results are obtained under a certain criterion, the overall F-measure value is calculated by the following formula, wherein |CAT _i I is CAT _i Under the current discriminant criteria, the number of malicious programs contained.

(3) Compared with other malicious program behavior portraits, the malicious program behavior portraits provided by the embodiment of the invention can better and more carefully capture the behavior characteristics of the malicious program of the complex network, thereby bringing better classification performance to the classification method of the embodiment of the invention. For comparison, three other behavioral portraits (BC, CO, and MAL) were chosen that were also used for malicious program classification, and the present embodiment was compared under the same conditions. The comparison results at different discriminant criteria and metrics are shown in Table 2 (comparison results at Accuracy) and Table 3 (comparison results at F-measure values). It can be seen that the best family and scene classification accuracy of the BC model is 0.923 and 0.692, and the best family and scene classification F-measure values are 0.905 and 0.594. The best family and scene classification accuracy of the CO model is 0.807 and 0.653, and the best family and scene classification F-measure value is 0.843 and 0.665. Their performance is inferior to the best results of the present invention: the best family and scene classification accuracy is 1.000 and 0.923, and the best F-measure value is 1.000 and 0.906. The accuracy of the MAL model is 0.923, and the accuracy of the MAL model is 0.897. In addition to scene classification accuracy, MAL model performance is behind the classification performance of the invention.

TABLE 2

TABLE 3 Table 3

The network malicious program classification system provided by the embodiment of the invention is described below, and the network malicious program classification system described below and the network malicious program classification method described above can be referred to correspondingly.

Fig. 5 is a schematic structural diagram of a complex network malicious program classification system according to an embodiment of the present invention, where, as shown in fig. 5, the system includes: a depiction module 51, a signature module 52 and a judgment module 53; wherein:

the characterization module 51 is configured to obtain a complete network flow of a network malicious program within a preset time, divide the complete network flow into a plurality of network activities, and characterize the network activities to obtain a network comprehensive behavior portrait model; the signature module 52 is configured to generate a behavior signature from the sample network behavior characteristics obtained in the network comprehensive behavior image model through a behavior distance measurement function and a preset clustering algorithm; the judging module 53 is configured to perform overall similarity calculation on the known malicious network training sample and the unknown test sample based on the behavior signature, so as to obtain category attribution of the unknown network malicious program.

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: a processor (processor) 610, a communication interface (communication interface) 620, a memory (memory) 630, and a communication bus (bus) 640, wherein the processor 610, the communication interface 620, and the memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a network malware classification method comprising: acquiring complete network traffic of a network malicious program in preset time, dividing the complete network traffic into a plurality of network activities, and carrying out network activity depiction on the plurality of network activities to obtain a network comprehensive behavior portrayal model; generating a behavior signature by using a behavior distance measurement function and a preset clustering algorithm and using the behavior characteristics of a sample network obtained from the network comprehensive behavior image model; and respectively carrying out overall similarity calculation on the known network training sample and the unknown test sample based on the behavior signature to obtain category attribution of the unknown network malicious program.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, embodiments of the present invention further provide a computer program product, including a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions which, when executed by a computer, enable the computer to perform the network malicious program classification method provided by the above method embodiments, the method including: acquiring complete network traffic of a network malicious program in preset time, dividing the complete network traffic into a plurality of network activities, and carrying out network activity depiction on the plurality of network activities to obtain a network comprehensive behavior portrayal model; generating a behavior signature by using a behavior distance measurement function and a preset clustering algorithm and using the behavior characteristics of a sample network obtained from the network comprehensive behavior image model; and respectively carrying out overall similarity calculation on the known malicious network training sample and the unknown test sample based on the behavior signature to obtain category attribution of the unknown network malicious program.

In still another aspect, an embodiment of the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the network malicious program classification method provided in the above embodiments, the method including: acquiring complete network traffic of a network malicious program in preset time, dividing the complete network traffic into a plurality of network activities, and carrying out network activity depiction on the plurality of network activities to obtain a network comprehensive behavior portrayal model; generating a behavior signature by using a behavior distance measurement function and a preset clustering algorithm and using the behavior characteristics of a sample network obtained from the network comprehensive behavior image model; and respectively carrying out overall similarity calculation on the known malicious network training sample and the unknown test sample based on the behavior signature to obtain category attribution of the unknown network malicious program.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The complex network-oriented malicious program classification method is characterized by comprising the following steps of:

respectively carrying out overall similarity calculation on a known malicious network training sample and an unknown testing sample based on the behavior signature to obtain category attribution of the unknown network malicious program;

The method comprises the steps of obtaining the complete network flow of a network malicious program in a preset time, dividing the complete network flow into a plurality of network activities, and carrying out network activity depiction on the plurality of network activities to obtain a network comprehensive behavior portrait model, and specifically comprises the following steps:

network activity depiction is carried out on the initial network activity division result from the whole data, the data stream, the data packet, the data type mark and the activity time ratio to obtain the network comprehensive behavior portrayal model;

the method for generating the behavior signature by the network comprehensive behavior image model comprises the following steps of:

2. The complex network-oriented malicious program classification method according to claim 1, wherein the dividing the plurality of network traffic into the same class of network activities based on the preset binary group information, removing the network activities meeting the preset definition, and obtaining an initial network activity division result, specifically comprises:

3. The complex network-oriented malicious program classification method according to claim 1, wherein the network activity classification is performed on the initial network activity classification result from a data entity, a data flow, a data packet, a data type label and an activity time ratio to obtain the network comprehensive behavior portrayal model, and specifically comprises:

4. The complex network malicious program classification method according to claim 1, wherein the overall similarity calculation is performed on the known malicious network training sample and the unknown test sample based on the behavior signature, so as to obtain class attribution of the unknown network malicious program, specifically including:

5. The complex network-oriented malicious program classification method according to claim 4, wherein the calculating a similarity metric function between the first behavior signature set and the second behavior signature set to obtain the class attribution specifically comprises:

6. A complex network-oriented malicious program classification system, comprising:

the judging module is used for respectively carrying out overall similarity calculation on the known malicious network training sample and the unknown testing sample based on the behavior signature to obtain category attribution of the unknown network malicious program;

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the complex network oriented malicious program classification method of any one of claims 1 to 5 when the program is executed by the processor.

8. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the complex network malicious program oriented classification method of any of claims 1 to 5.