CN114036515B

CN114036515B - Webshell malicious family clustering analysis method

Info

Publication number: CN114036515B
Application number: CN202111255079.9A
Authority: CN
Inventors: 周芳芳; 袁键; 陈茁; 王心远; 吕胜蓝; 范毅伦; 李影; 赵颖
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2022-08-16
Anticipated expiration: 2041-10-27
Also published as: CN114036515A

Abstract

The invention discloses a webshell malicious family clustering analysis method, which relates to the technical field of information security; it comprises the following steps: step 1: acquiring function calling information, parameter values and return value information when Webshell runs; step 2: function call information

Cleaning, splicing and sequencing; and step 3: vectorizing the function call sequence information in the step 2; and 4, step 4: calculating the information entropy of the parameter values and the return values, and sequencing according to the sequence of function calling; and 5: according to the func _ seq, argv _ seq and return _ seq obtained in the steps 2 and 4, an RNN model is built to respectively predict the three sequences and learn the family characteristics of the codes; step 6: mapping the original sequence data and the predicted sequence data into pixel points after minhash processing to form a pixel map; and 7: superposing the original pixel map obtained in the step 6 with the predicted pixel map, and drawing a final pixel map; and 8: and (4) clustering the pixel map obtained in the step (7) by using a DBSCAN clustering algorithm.

Description

Webshell malicious family clustering analysis method

Technical Field

The invention belongs to the technical field of information security, and particularly relates to a webshell malicious family clustering analysis method.

Background

Webshell is a command execution environment written by a script language, and an attacker uploads a script file to a server and hides the script file in a benign file, so that the aim of operating the server is fulfilled. Currently, WebShell has become the primary source of harm affecting cloud host security. In order to prevent hacker intrusion and guarantee the asset security of cloud users in real time, a high-accuracy and high-efficiency malicious Webshell detection method is of great importance.

Most of the traditional WebShell prevention and control means are based on predefined rules, and the creation of new rules and the updating of old rules are always slower than the WebShell variant speed, so that malicious files easily bypass the rule detection. In order to solve the problems that the rules are difficult to update and new varieties cannot be detected in time, people try to detect the malicious WebShell based on a heuristic algorithm or a deep learning model, but the method still needs to invest a large amount of manpower to assist in detection, such as manually labeling malicious file samples, manually verifying suspicious files with a malignant/benign fuzzy boundary to reduce false alarm, manually confirming the appearance of new varieties and the like.

The method for constructing the malicious file family system is one direction of malicious file detection. A complete Webshell family system is constructed, the workload of a manual assistance part in detection can be effectively reduced, the discovery efficiency of new varieties is improved, and the accuracy and precision of automatic detection are improved. Firstly, the WebShell has high variant speed, and because the WebShell can be written by a plurality of development languages and is easily adapted by program development means such as confusion, encryption, shell adding and the like, the variant speed of the WebShell malicious file is high, and the accuracy and the efficiency of the malicious file detection method are greatly influenced; secondly, security holes caused by function parameters are omitted by a traditional API call information-based malicious file detection method, and experts lack an effective Webshell malicious family cluster analysis method for detecting, comparing and analyzing malicious Webshells.

Disclosure of Invention

Aiming at solving the problems of the defects and the shortcomings of the prior art; the invention aims to provide a webshell malicious family cluster analysis method.

A webshell malicious family cluster analysis method is characterized by comprising the following steps: the method comprises the following steps:

step 1: acquiring function calling information, parameter values and return value information when Webshell runs;

step 2: function call information

Cleaning, splicing and sequencing;

and step 3: vectorizing the function call sequence information in the step 2;

and 4, step 4: calculating the information entropy of the parameter values and the return values, and sequencing according to the sequence of function calling;

and 5: according to the func _ seq, argv _ seq and return _ seq obtained in the steps 2 and 4, an RNN model is built to respectively predict the three sequences and learn the family characteristics of the codes;

step 6: mapping the original sequence data and the predicted sequence data into pixel points after minhash processing to form a pixel map;

and 7: superposing the original pixel map obtained in the step 6 with the predicted pixel map, and drawing a final pixel map;

and 8: and (4) clustering the pixel map obtained in the step (7) by using a DBSCAN clustering algorithm.

Preferably, each Webshell sample data in step 1 can be abstracted into a multi-attribute event sequence data x ═ { x ═ ₁ ,x ₂ ,…,x _N In which x is _i Representing the operation information set called i time by the Webshell, wherein the information set called each time can be divided into basic attribute information and extended attribute information, namely x _i Can be described as a pair<Basic ⁱ ,Extended ⁱ >. Wherein, the Basic attribute information (Basic) comprises a calling function name (caller) and a called function name (caller), the Extended attribute (Extended) comprises a parameter value (argv), a return value (return) and taint information (taint) of the called function, and the Webshell sample x _i The calling function, the called function, the parameter value, the return value and the taint information are respectively

Preferably, step 2 comprises the steps of:

step 2.1: computing all calling functions in a sample

And called function

The function with more times is taken out as a core function, and the rest functions are non-core functions and are regarded as the same function in the subsequent coding processing.

Step 2.2: based on the natural language processing thought of n-gram, related in the same function calling process

And

performing character string splicing to

Considered as a minimum unit item.

Step 2.3: and (3) sequencing the items obtained in the step (2.2) according to the sequence of function calling to obtain the function calling sequence information of each Webshell sample. In this case, the function call sequence information in one sample may be expressed as func _ seq ═ { item ═ m ₁ ，item ₂ ，…，item _i …，item _n }. The purpose of the n-gram is to preserve sequence information during function calls.

Preferably, step 3 comprises the steps of:

step 3.1: similar to the idea of vectorizing a short text by using a word vector model in natural language processing, sequence information of each sample in the Webshell is equivalent to the short text, each minimum unit item is equivalent to a word in the short text, and function calling sequences of all samples form a corpus. Since each item is a "word", no word segmentation or the like is required.

Step 3.2: designing word vector dimensions of the CBOW model, inputting function call sequence information obtained in the step 2.3 as a corpus into the CBOW model for training, and training by using trainingCalculating a vector V corresponding to each item by the trained CBOW model _item And saving the parameter weight in the CBOW model.

Step 3.3: using the vector V obtained in step 3.2 _item Separately, the function call sequence func _ seq in each sample is { item ═ m ₁ ，item ₂ ，…，item _i …，item _n The vector V corresponding to _sentence ＝[V ₁ ,V ₂ ,…,V _i …,V _n ]Summing, and averaging to obtain the vector corresponding to the function call sequence

Preferably, step 4 comprises the steps of:

step 4.1: the parameter values and return values are sorted. If the parameter value or the return value is the executable code, extracting a function name of the executable code, if the parameter value does not contain the code, calculating the information entropy of the parameter value, and dividing the parameter value and the return value according to the function name or the information entropy.

And 4.2: and respectively sequencing the parameter values and the return values according to the sequence of function calling. A parameter value argv _ sequence and a return value sequence return _ sequence are obtained.

Preferably, step 5 comprises the steps of:

step 5.1: RNN sample sets were constructed. One sample x _i Each element in the three types of sequence information is regarded as a window which is arranged side by side, a sliding window with the size of K is arranged, the Mth element in the window is data to be predicted, and the Mth element in the window is predicted by using the front M-1 element and the rear K-M element in the sliding window each time. And forming a training sample t by moving the sliding window forward by one window, and executing the operations on all the samples to obtain a function call sequence sample set Tf, a parameter sequence sample set Ta and a return value sequence sample set Tr.

Step 5.2: and building a unidirectional LSTM network. The number of input neurons of the LSTM network is (K-1) × P, the output dimension is P, where P is a vector dimension of an element to be predicted, such as a dimension 255 of a vector dimension word vector of a function, and Tf, Ta, and Tr obtained in step 6.1 are used to respectively train three different LSTMs by randomly extracting partial data from the three types of samples.

Step 5.3: predicting the Tf, Ta and Tr sample sets by using the LSTM network trained in the step 5.2, replacing the predicted result with data in the corresponding position in the original sequence to obtain a predicted sequence func _ seq corresponding to the three sequences ^pre 、argv_seq ^pre And return _ seq ^pre And according to func _ seq ^pre Calculating correspondences according to step 3)

Preferably, step 6 comprises the steps of:

step 6.1: with argv _ seq ═ { argv ═ argv ₁ ，argv ₂ ，…，argv _i …，argv _n Example, where argv _i Obtaining argv for the parameter classes divided in the step 4 based on a minHash method _i Corresponding signatures, i.e., several hash values, perform the same operation on func _ seq and return _ seq to obtain corresponding hash values. Grouping the hash values obtained by calculation of three different sequences to obtain hash value tuples<xi,yi,zi>Wherein xi, yi, zi in the same tuple are from the hash values of the same position of argv _ seq, return _ seq, func _ seq, respectively. Performing operation of taking module m on xi, yi, and then performing operation of taking module k on zi, namely xi, yi, mapping to integer of floor ([0, m), and zi mapping to floor ([0, k)]]) Is an integer of (1). Wherein xi and yi respectively correspond to the coordinate positions of the x axis and the y axis of the pixel point, and zi is a gray value. Notably, func _ seq and func _ seq ^pre The vector U obtained in the step 3 and the step 5 is directly utilized without minhash operation _func And

and directly carrying out the modulus taking operation.

Step 6.2: and drawing an original pixel map and a predicted pixel map by using pixel points obtained by the original sequence and the predicted sequence. Taking the original sequences func _ seq, argv _ seq, and return _ seq as examples, if a pixel point with the same position appears, adding a corresponding gray value on the basis of the gray value of the original pixel point and taking an average value.

Preferably, the operation method in step 7 is to superimpose the pixel map corresponding to the original sequence and the pixel map of the prediction sequence, and if there is no pixel point at the position corresponding to the pixel maps of the two, the gray value of the superimposed position is 0; if one pixel image has a pixel point and only one pixel image has a pixel point, the gray value of the position after superposition is the value of the original pixel point; if both the two corresponding positions have pixel points, the gray value of the position after superposition is the average value obtained after the gray values of the two pixel points are added.

Preferably, the operation method in step 8 is to compress the pixel map to 2 dimensions by using a t-SNE dimension reduction method, and then to set two parameters required by the DBSCAN clustering algorithm: radius eps and the number minpts of the minimum points in the neighborhood; and clustering the data after dimensionality reduction to obtain a clustered result of the DBSCAN.

Compared with the prior art, the invention has the beneficial effects that: the invention clusters Webshell by using function calling information, parameter value information and return value information of Webshell, belongs to a dynamic detection method of Webshell malicious family based on multi-source data, and compared with a static analysis method and some single-source dynamic detection methods, the method realizes fusion of multi-source data, is more representative in representing malicious family, can improve the discovery efficiency of new varieties at the clustering level, and indirectly improves the accuracy and precision of automatic detection. And the workload of a manual assistance part in the detection can be effectively reduced.

Drawings

For ease of illustration, the invention is described in detail by the following detailed description and the accompanying drawings.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a pixel map obtained by Webshell after feature processing;

fig. 3 is a two-dimensional scatter plot of clustering results after clustering using the DBSCAN clustering algorithm and assigning different colors to each cluster.

Detailed Description

In order that the objects, aspects and advantages of the invention will become more apparent, the invention will be described by way of example only, and with reference to the accompanying drawings. It is to be understood that this description is made only by way of example and not as a limitation on the scope of the invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.

As shown in fig. 1, a Webshell malicious family cluster analysis method includes the following steps:

step 1: and acquiring function calling information, parameter values, return value information and the like during Webshell operation. The data is Webshell data of a single large enterprise user during the period from 9/13 days in 2020 to 10/13 days in 2020, and each Webshell data sample can be divided into Basic attribute (Basic) information and Extended attribute information (Extended), wherein the Basic attribute information is also generally called opcode information. The basic attribute information comprises a calling function name (caller) and a called function name (caller), the extended attribute comprises a parameter value (argv), a return value (return) and taint information (taint) of the called function, and the Webshell sample x _i The calling function, the called function, the parameter value, the return value and the taint information are respectively

Table 1 is an example of a sample:

TABLE 1Webshell sample example

Step 2: for function call information

The specific method for cleaning, splicing and sequencing comprises the following steps:

step 2.1: function call information obtained according to step 1

Computing all calling functions in a sample

And called function

The first 254 functions with the largest number of occurrences are taken out as core functions, and the rest functions are non-core functions and are regarded as the same function in the subsequent coding processing.

Step 2.2: idea of natural language processing based on n-gram, where n is set to 2, i.e. Bi-gram, relating to the same function call

And

performing character string splicing to

Considered as a minimum unit item.

Step 2.3: and (3) sequencing the item obtained in the step (2.2) according to the sequence index of function calling to obtain the function calling sequence information of each Webshell sample. At this time, the function call sequence information in one sample may also be expressed as func _ seq ═ item ₁ ，item ₂ ，…，item _i …，item _n }

And step 3: vectorizing the function call sequence information in step 2.3.

Step 3.2: designing a CBOW model with a word vector dimension of 255 and a target word and context distance of 5, and carrying out step selectionAnd 2.3, inputting the function call sequence information of all the samples as a corpus into the CBOW model for training, and calculating a vector V corresponding to each item by using the trained CBOW model _item 。

And 4, step 4: and sorting the parameter values and the return values according to the function calling sequence, and calculating the information entropy of the parameter values and the return values.

Step 4.1: and respectively sequencing the parameter values and the return values according to the sequence of function calling. A parameter value argv _ sequence and a return value sequence return _ sequence are obtained.

Step 4.2: the parameter values and return values are sorted. If the parameter value or the return value is a executable code, extracting the function name, and if the parameter value does not contain the code, calculating the information entropy of the parameter value

Dividing the parameter values and the return values according to function names or information entropies, wherein the parameter values are equally divided into 50 intervals according to the obtained information entropies, one function name is of one category, and U corresponding to argv _ sequence and return _ sequence is obtained _argv And U _return 。

And 5: and (4) according to the func _ seq, argv _ sequence and return _ sequence obtained in the steps (2) and (4), building an RNN model to respectively predict the three sequences and learn the family characteristics of the codes.

Step 5.1: RNN sample sets were constructed. One sample x _i Each element in the three types of sequence information is regarded as a window which is arranged side by side one by oneSetting a sliding window with the size of 16, wherein the 9 th element in the window is data to be predicted, and predicting the 9 th element in the window by using the first 8 elements and the last 7 elements in the sliding window each time. And forming a training sample t by moving the sliding window forward by one window, and executing the operations on all the samples to obtain a function call sequence sample set Tf, a parameter sequence sample set Ta and a return value sequence sample set Tr.

Step 5.2: and building a unidirectional LSTM network. The number of input neurons of the LSTM network is 15 × P, the output dimension is P, where P is the vector dimension of the element to be predicted, such as the dimension 255 of the vector dimension word vector of the function, and the data of 1/5 is randomly extracted from the three types of samples to train three different LSTMs, respectively, using Tf, Ta, and Tr obtained in step 6.1.

Step 5.3: predicting the Tf, Ta and Tr sample sets by using the LSTM network trained in the step 5.2, replacing the predicted result with data in the corresponding position in the original sequence to obtain a predicted sequence func _ seq corresponding to the three sequences ^pre 、argv_seq ^pre And return _ seq ^pre And according to func _ seq ^pre Calculating correspondences according to step 3

Step 6: and mapping the original sequence data and the predicted sequence data into pixel points after minhash processing to form a pixel map.

Step 6.1: with argv _ seq ═ { argv ═ argv ₁ ，argv ₂ ，…，argv _i …，argv _n As an example, wherein argv _i Obtaining argv for the parameter classes divided in the step 4 based on a minHash method _i Corresponding signatures, i.e., a plurality of hash values, perform the same operation on func _ seq and return _ seq to obtain corresponding hash values. Grouping the hash values obtained by calculation of three different sequences to obtain hash value tuples<xi,yi,zi>Wherein xi, yi, zi in the same tuple are from the hash values of the same position of argv _ seq, return _ seq, func _ seq, respectively. The operation of taking the modulus 32 is performed on xi, yi, and then zi is advancedThe operation modulo 255, xi, yi, maps to an integer of floor ([0,32) and zi maps to floor ([0, 255)]]) Is an integer of (1). Wherein xi and yi respectively correspond to the coordinate positions of the x axis and the y axis of the pixel point, and zi is a gray value. Notably, func _ seq and func _ seq ^pre Directly utilizing the vector U obtained in the step 3 and the step 5 without minhash operation _func And

and directly carrying out the modulus taking operation.

Step 6.2: and drawing an original pixel map and a predicted pixel map by using pixel points obtained by the original sequence and the predicted sequence. Taking the original sequences func _ seq, argv _ seq and return _ seq as examples, if the pixels with the same position appear, adding the corresponding gray value on the basis of the gray value of the original pixel and averaging.

And 7: and (5) overlapping the original pixel map obtained in the step (6) with the predicted pixel map, and drawing a final pixel map. Superposing the pixel map corresponding to the original sequence and the pixel map of the prediction sequence, and if the positions corresponding to the pixel maps of the original sequence and the prediction sequence have no pixel points, setting the gray value of the superposed position to be 0; if one pixel image has a pixel point and only one pixel image has a pixel point, the gray value of the position after superposition is the value of the original pixel point; if both the two corresponding positions have pixel points, the gray value of the position after superposition is the average value obtained after the gray values of the two pixel points are added.

And 8: and (4) clustering the pixel map obtained in the step (7) by using a DBSCAN clustering algorithm. Firstly, adopting PCA to reduce the dimension of data to 50-100 dimensions, then adopting a t-SNE dimension reduction method to compress a pixel map to 2 dimensions, and then setting two parameters required by a DBSCAN clustering algorithm: radius eps and the number minpts of the minimum points in the neighborhood; and clustering the data after dimensionality reduction to obtain a clustered result of the DBSCAN.

The method comprises the following contents of experimental evaluation, wherein an experimental data set is 5111 malicious Webshell samples provided by a large-scale enterprise user, 130 malicious family types are obtained after the data set is marked manually, in addition, on the final clustering evaluation index, two indexes of accuracy (acc) and NMI are used for evaluating clustering effects, the accuracy is calculated by using the Hungarian maximum matching algorithm, the experiment searches for the optimal clustering effect in the range of the number of clusters [50,130], and meanwhile, the clustering result of the number of clusters of 130 is also given.

In experiment 1, we have performed the contents of step 1, step 2, step 3, step 4, step 5, step 6, step 7, and step 8, and in step 4, xi, yi are respectively modulo 32 to obtain the positions of the scattered point x-axis and y-axis coordinates, so the size of the pixel map is 32 × 32, and fig. 2 is an example of the pixel map. In step 8, a dimensionality reduction method of PCA + TSNE is adopted, PCA is firstly adopted to reduce the dimensionality of data to 100, then a t-SNE dimensionality reduction method is adopted to compress a pixel map to 2 dimensions, and finally two parameters of DBSCAN are traversed from the range of [0.0001,2], [2,20 ]: the best match is obtained by the radius eps and the number minpts of the minimum points in the neighborhood, the corresponding clustering accuracy is 58.8%, the NMI is 0.793, and the cluster number k is 57 at this time.

In experiment 2, we take substantially the same steps as in experiment 1, except that in step 8, the traditional kmeans clustering algorithm is selected, and k is set to 130, i.e. the number of true classes is the same, and finally we obtain 56.8% accuracy and 0.778 NMI.

In experiment 3, the procedure of experiment 3 is basically the same as experiment 1, and in experiment 3, xi and yi in step 4 are respectively modulo 128, so the size of the pixel map is 128 × 128, the rest of the procedures are the same as experiment 1, the final clustering accuracy is 52.7%, the NMI is 0.787, and k is 67 at this time. Compared with experiment 1, the result of experiment 3 is observed that the clustering accuracy is reduced, and the possible reason is that the dimension of the pixel map is too large, so that the pixel matrix is too sparse, and the final clustering effect is influenced.

In experiment 4, the step of experiment 4 is basically the same as experiment 1, we do not adopt the operation of reducing dimensions in step 8, but directly perform clustering, the rest steps are the same as experiment 1, the final obtained accuracy is 58.0%, NMI is 0.808, at this time k is 65, the accuracy of experiment 4 clustering is similar to experiment 1, and it is shown that in our method, the effect of the dimension reduction means on improving the accuracy of clustering is not obvious.

Experimental conditions as in table 2, experiments 1-4 are, in general, comparative experiments performed under alternative parameters and alternative procedures.

TABLE 2 clustering test conditions

From table 2 and fig. 3, it is concluded that the beneficial effects of the present invention are: the invention clusters Webshell by using function calling information, parameter value information and return value information of Webshell, belongs to a dynamic detection method of Webshell malicious family based on multi-source data, and compared with a static analysis method and some single-source dynamic detection methods, the method realizes fusion of multi-source data, is more representative in representing malicious family, can improve the discovery efficiency of new varieties at the clustering level, and indirectly improves the accuracy and precision of automatic detection. And the workload of a manual assistance part in detection can be effectively reduced.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A webshell malicious family cluster analysis method is characterized by comprising the following steps: the method comprises the following steps:

step 2: function call information

Cleaning, splicing and sequencing;

and step 3: vectorizing the function call information in the step 2;

and 5: according to the function call sequence func _ seq, the parameter value sequence argv _ seq and the return value sequence return _ seq obtained in the steps 2 and 4, an RNN model is built to respectively predict the three sequences and learn the characteristics of the code family;

step 5.1: constructing an RNN sample set; one sample x _i Each element in the three types of sequence information is regarded as a window which is arranged side by side, a sliding window with the size of K is arranged, the Mth element in the window is data to be predicted, and the Mth element in the window is predicted by using the front M-1 element and the rear K-M element in the sliding window each time; forming a training sample t by moving a sliding window forward every time, and performing the operation on all samples to obtain a function call sequence sample set Tf, a parameter value sequence sample set Ta and a return value sequence sample set Tr;

step 5.2: building a unidirectional LSTM network, wherein the number of input neurons of the LSTM network is (K-1) × P, the output dimension of the LSTM network is P, the P is the vector dimension of the element to be predicted, and the three different LSTMs are trained by respectively randomly extracting partial data from the three types of samples by utilizing Tf, Ta and Tr obtained in the step 5.1;

step 5.3: predicting the Tf, Ta and Tr sample sets by using the LSTM network trained in the step 5.2, replacing the predicted result with data in the corresponding position in the original sequence to obtain a predicted sequence func _ seq corresponding to the three sequences ^pre 、argv_seq ^pre And return _ seq ^pre And according to func _ seq ^pre Calculating the corresponding vector according to step 3

Step 6: mapping data in the original sequence in the step 5 and the predicted sequence data into pixel points after minhash processing to form a predicted pixel map;

and 7: superposing the predicted pixel map obtained in the step 6 with a pixel map formed by data in the original sequence to draw a final pixel map;

and 8: and (5) clustering the pixel map obtained in the step (7) by using a DBSCAN clustering algorithm.

2. The webshell malicious family cluster analysis method of claim 1, wherein: in the step 1, each Webshell sample data can be abstracted into multi-attribute event sequence data x ═ { x ═ ₁ ,x ₂ ,…,x _N In which x _i Representing the operation information set called i time by the Webshell, wherein the information set called each time can be divided into basic attribute information and extended attribute information, namely x _i Can be described as a pair<Basic ⁱ ,Extended ⁱ >(ii) a The Basic attribute information Basic comprises a calling function name, a called function name, an Extended attribute Extended comprises a parameter value argv, a return value return and taint information, and a Webshell sample x _i The calling function, the called function, the parameter value, the return value and the taint information are respectively

3. The webshell malicious family cluster analysis method of claim 1, wherein: the step 2 comprises the following steps:

step 2.1: computing all calling functions in a sample

And called function

Taking out the function with more times as a core function, wherein the rest functions are non-core functions and are regarded as the same function in the subsequent coding processing;

step 2.2: involving the same function call

And

performing character string splicing to

Considered as a minimum unit item;

step 2.3: sequencing the items obtained in the step 2.2 according to the sequence of function calling to obtain function calling sequence information of each Webshell sample; at this time, the function call sequence information in one sample is denoted as func _ seq ═ { item ═ m ₁ ，item ₂ ，…，item _i …，item _n }; the purpose of the n-gram is to preserve sequence information during function calls.

4. The webshell malicious family cluster analysis method of claim 3, wherein: the step 3 comprises the following steps:

step 3.1: the sequence information of each sample in Webshell is equivalent to a short text, each minimum unit item is equivalent to a word in the short text, and the function calling sequences of all samples form a corpus; since each item is a word, word segmentation operation is not needed;

step 3.2: designing word vector dimensions of the CBOW model, inputting function call sequence information obtained in the step 2.3 as a corpus into the CBOW model for training, and calculating a vector V corresponding to each item by using the trained CBOW model _item And saving the parameter weight in the CBOW model;

5. The webshell malicious family cluster analysis method of claim 4, wherein: step 4 comprises the following steps:

step 4.1: classifying the parameter values and the return values; if the parameter value or the return value is a runnable code, extracting a function name in the parameter value or the return value, if the parameter value does not contain the code, calculating the information entropy of the parameter value, and dividing the parameter value and the return value according to the function name or the information entropy;

step 4.2: and respectively sequencing the parameter values and the return values according to the sequence of function calling to obtain a parameter value sequence argv _ seq and a return value sequence return _ seq.

6. The webshell malicious family cluster analysis method of claim 5, wherein: step 6 comprises the following steps:

step 6.1: argv _ seq ═ { argv ═ argv ₁ ，argv ₂ ，…，argv _i …，argv _n In the method, argv is obtained based on minHash method _i Corresponding signatures, namely a plurality of hash values, carry out the same operation on the func _ seq and the return _ seq to obtain corresponding hash values; grouping the hash values obtained by calculation of three different sequences to obtain hash value tuples<xi,yi,zi>Wherein xi, yi, zi in the same tuple are respectively from the hash values of the same positions of argv _ seq, return _ seq, func _ seq; performing operation of taking module m on xi, yi, and then performing operation of taking module k on zi, namely xi, yi, mapping to integer of floor ([0, m), and zi mapping to floor ([0, k)]]) An integer of (d); wherein xi and yi respectively correspond to the coordinate positions of the x axis and the y axis of the pixel point, and zi is a gray value; func _ seq and func _ seq ^pre Directly utilizing the vector U obtained in the step 3 and the step 5 without minhash operation _func And

directly carrying out a mould taking operation;

step 6.2: drawing an original pixel map and a predicted pixel map by using pixel points obtained by the original sequence and the predicted sequence; in the original sequences func _ seq, argv _ seq and return _ seq, if the same-position pixel points appear, adding corresponding gray values on the basis of the gray values of the original pixel points and averaging.

7. The webshell malicious family cluster analysis method of claim 6, wherein: the operation method in step 7 is to superimpose the pixel map corresponding to the original sequence and the pixel map of the prediction sequence, and if the positions corresponding to the pixel maps of the original sequence and the pixel map of the prediction sequence have no pixel points, the gray value of the superimposed position is 0; if one pixel image has a pixel point and only one pixel image has a pixel point, the gray value of the position after superposition is the value of the original pixel point; if both the two corresponding positions have pixel points, the gray value of the position after superposition is the average value obtained after the gray values of the two pixel points are added.

8. The webshell malicious family cluster analysis method of claim 7, wherein: the operation method of step 8 is to compress the pixel map to 2 dimensions by adopting a t-SNE dimension reduction method, and then to set two parameters required by the DBSCAN clustering algorithm: radius eps and the number minpts of the minimum points in the neighborhood; and clustering the data after dimensionality reduction to obtain a clustered result of the DBSCAN.