CN114036515B - Webshell malicious family clustering analysis method - Google Patents

Webshell malicious family clustering analysis method Download PDF

Info

Publication number
CN114036515B
CN114036515B CN202111255079.9A CN202111255079A CN114036515B CN 114036515 B CN114036515 B CN 114036515B CN 202111255079 A CN202111255079 A CN 202111255079A CN 114036515 B CN114036515 B CN 114036515B
Authority
CN
China
Prior art keywords
sequence
seq
information
return
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111255079.9A
Other languages
Chinese (zh)
Other versions
CN114036515A (en
Inventor
周芳芳
袁键
陈茁
王心远
吕胜蓝
范毅伦
李影
赵颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202111255079.9A priority Critical patent/CN114036515B/en
Publication of CN114036515A publication Critical patent/CN114036515A/en
Application granted granted Critical
Publication of CN114036515B publication Critical patent/CN114036515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/561Virus type analysis

Abstract

The invention discloses a webshell malicious family clustering analysis method, which relates to the technical field of information security; it comprises the following steps: step 1: acquiring function calling information, parameter values and return value information when Webshell runs; step 2: function call information
Figure DDA0003323813170000011
Figure DDA0003323813170000012
Cleaning, splicing and sequencing; and step 3: vectorizing the function call sequence information in the step 2; and 4, step 4: calculating the information entropy of the parameter values and the return values, and sequencing according to the sequence of function calling; and 5: according to the func _ seq, argv _ seq and return _ seq obtained in the steps 2 and 4, an RNN model is built to respectively predict the three sequences and learn the family characteristics of the codes; step 6: mapping the original sequence data and the predicted sequence data into pixel points after minhash processing to form a pixel map; and 7: superposing the original pixel map obtained in the step 6 with the predicted pixel map, and drawing a final pixel map; and 8: and (4) clustering the pixel map obtained in the step (7) by using a DBSCAN clustering algorithm.

Description

Webshell malicious family clustering analysis method
Technical Field
The invention belongs to the technical field of information security, and particularly relates to a webshell malicious family clustering analysis method.
Background
Webshell is a command execution environment written by a script language, and an attacker uploads a script file to a server and hides the script file in a benign file, so that the aim of operating the server is fulfilled. Currently, WebShell has become the primary source of harm affecting cloud host security. In order to prevent hacker intrusion and guarantee the asset security of cloud users in real time, a high-accuracy and high-efficiency malicious Webshell detection method is of great importance.
Most of the traditional WebShell prevention and control means are based on predefined rules, and the creation of new rules and the updating of old rules are always slower than the WebShell variant speed, so that malicious files easily bypass the rule detection. In order to solve the problems that the rules are difficult to update and new varieties cannot be detected in time, people try to detect the malicious WebShell based on a heuristic algorithm or a deep learning model, but the method still needs to invest a large amount of manpower to assist in detection, such as manually labeling malicious file samples, manually verifying suspicious files with a malignant/benign fuzzy boundary to reduce false alarm, manually confirming the appearance of new varieties and the like.
The method for constructing the malicious file family system is one direction of malicious file detection. A complete Webshell family system is constructed, the workload of a manual assistance part in detection can be effectively reduced, the discovery efficiency of new varieties is improved, and the accuracy and precision of automatic detection are improved. Firstly, the WebShell has high variant speed, and because the WebShell can be written by a plurality of development languages and is easily adapted by program development means such as confusion, encryption, shell adding and the like, the variant speed of the WebShell malicious file is high, and the accuracy and the efficiency of the malicious file detection method are greatly influenced; secondly, security holes caused by function parameters are omitted by a traditional API call information-based malicious file detection method, and experts lack an effective Webshell malicious family cluster analysis method for detecting, comparing and analyzing malicious Webshells.
Disclosure of Invention
Aiming at solving the problems of the defects and the shortcomings of the prior art; the invention aims to provide a webshell malicious family cluster analysis method.
A webshell malicious family cluster analysis method is characterized by comprising the following steps: the method comprises the following steps:
step 1: acquiring function calling information, parameter values and return value information when Webshell runs;
step 2: function call information
Figure BDA0003323813150000021
Cleaning, splicing and sequencing;
and step 3: vectorizing the function call sequence information in the step 2;
and 4, step 4: calculating the information entropy of the parameter values and the return values, and sequencing according to the sequence of function calling;
and 5: according to the func _ seq, argv _ seq and return _ seq obtained in the steps 2 and 4, an RNN model is built to respectively predict the three sequences and learn the family characteristics of the codes;
step 6: mapping the original sequence data and the predicted sequence data into pixel points after minhash processing to form a pixel map;
and 7: superposing the original pixel map obtained in the step 6 with the predicted pixel map, and drawing a final pixel map;
and 8: and (4) clustering the pixel map obtained in the step (7) by using a DBSCAN clustering algorithm.
Preferably, each Webshell sample data in step 1 can be abstracted into a multi-attribute event sequence data x ═ { x ═ 1 ,x 2 ,…,x N In which x is i Representing the operation information set called i time by the Webshell, wherein the information set called each time can be divided into basic attribute information and extended attribute information, namely x i Can be described as a pair<Basic i ,Extended i >. Wherein, the Basic attribute information (Basic) comprises a calling function name (caller) and a called function name (caller), the Extended attribute (Extended) comprises a parameter value (argv), a return value (return) and taint information (taint) of the called function, and the Webshell sample x i The calling function, the called function, the parameter value, the return value and the taint information are respectively
Figure BDA0003323813150000031
Preferably, step 2 comprises the steps of:
step 2.1: computing all calling functions in a sample
Figure BDA0003323813150000032
And called function
Figure BDA0003323813150000033
The function with more times is taken out as a core function, and the rest functions are non-core functions and are regarded as the same function in the subsequent coding processing.
Step 2.2: based on the natural language processing thought of n-gram, related in the same function calling process
Figure BDA0003323813150000034
And
Figure BDA0003323813150000035
performing character string splicing to
Figure BDA0003323813150000036
Considered as a minimum unit item.
Step 2.3: and (3) sequencing the items obtained in the step (2.2) according to the sequence of function calling to obtain the function calling sequence information of each Webshell sample. In this case, the function call sequence information in one sample may be expressed as func _ seq ═ { item ═ m 1 ,item 2 ,…,item i …,item n }. The purpose of the n-gram is to preserve sequence information during function calls.
Preferably, step 3 comprises the steps of:
step 3.1: similar to the idea of vectorizing a short text by using a word vector model in natural language processing, sequence information of each sample in the Webshell is equivalent to the short text, each minimum unit item is equivalent to a word in the short text, and function calling sequences of all samples form a corpus. Since each item is a "word", no word segmentation or the like is required.
Step 3.2: designing word vector dimensions of the CBOW model, inputting function call sequence information obtained in the step 2.3 as a corpus into the CBOW model for training, and training by using trainingCalculating a vector V corresponding to each item by the trained CBOW model item And saving the parameter weight in the CBOW model.
Step 3.3: using the vector V obtained in step 3.2 item Separately, the function call sequence func _ seq in each sample is { item ═ m 1 ,item 2 ,…,item i …,item n The vector V corresponding to sentence =[V 1 ,V 2 ,…,V i …,V n ]Summing, and averaging to obtain the vector corresponding to the function call sequence
Figure BDA0003323813150000041
Preferably, step 4 comprises the steps of:
step 4.1: the parameter values and return values are sorted. If the parameter value or the return value is the executable code, extracting a function name of the executable code, if the parameter value does not contain the code, calculating the information entropy of the parameter value, and dividing the parameter value and the return value according to the function name or the information entropy.
And 4.2: and respectively sequencing the parameter values and the return values according to the sequence of function calling. A parameter value argv _ sequence and a return value sequence return _ sequence are obtained.
Preferably, step 5 comprises the steps of:
step 5.1: RNN sample sets were constructed. One sample x i Each element in the three types of sequence information is regarded as a window which is arranged side by side, a sliding window with the size of K is arranged, the Mth element in the window is data to be predicted, and the Mth element in the window is predicted by using the front M-1 element and the rear K-M element in the sliding window each time. And forming a training sample t by moving the sliding window forward by one window, and executing the operations on all the samples to obtain a function call sequence sample set Tf, a parameter sequence sample set Ta and a return value sequence sample set Tr.
Step 5.2: and building a unidirectional LSTM network. The number of input neurons of the LSTM network is (K-1) × P, the output dimension is P, where P is a vector dimension of an element to be predicted, such as a dimension 255 of a vector dimension word vector of a function, and Tf, Ta, and Tr obtained in step 6.1 are used to respectively train three different LSTMs by randomly extracting partial data from the three types of samples.
Step 5.3: predicting the Tf, Ta and Tr sample sets by using the LSTM network trained in the step 5.2, replacing the predicted result with data in the corresponding position in the original sequence to obtain a predicted sequence func _ seq corresponding to the three sequences pre 、argv_seq pre And return _ seq pre And according to func _ seq pre Calculating correspondences according to step 3)
Figure BDA0003323813150000051
Preferably, step 6 comprises the steps of:
step 6.1: with argv _ seq ═ { argv ═ argv 1 ,argv 2 ,…,argv i …,argv n Example, where argv i Obtaining argv for the parameter classes divided in the step 4 based on a minHash method i Corresponding signatures, i.e., several hash values, perform the same operation on func _ seq and return _ seq to obtain corresponding hash values. Grouping the hash values obtained by calculation of three different sequences to obtain hash value tuples<xi,yi,zi>Wherein xi, yi, zi in the same tuple are from the hash values of the same position of argv _ seq, return _ seq, func _ seq, respectively. Performing operation of taking module m on xi, yi, and then performing operation of taking module k on zi, namely xi, yi, mapping to integer of floor ([0, m), and zi mapping to floor ([0, k)]]) Is an integer of (1). Wherein xi and yi respectively correspond to the coordinate positions of the x axis and the y axis of the pixel point, and zi is a gray value. Notably, func _ seq and func _ seq pre The vector U obtained in the step 3 and the step 5 is directly utilized without minhash operation func And
Figure BDA0003323813150000061
and directly carrying out the modulus taking operation.
Step 6.2: and drawing an original pixel map and a predicted pixel map by using pixel points obtained by the original sequence and the predicted sequence. Taking the original sequences func _ seq, argv _ seq, and return _ seq as examples, if a pixel point with the same position appears, adding a corresponding gray value on the basis of the gray value of the original pixel point and taking an average value.
Preferably, the operation method in step 7 is to superimpose the pixel map corresponding to the original sequence and the pixel map of the prediction sequence, and if there is no pixel point at the position corresponding to the pixel maps of the two, the gray value of the superimposed position is 0; if one pixel image has a pixel point and only one pixel image has a pixel point, the gray value of the position after superposition is the value of the original pixel point; if both the two corresponding positions have pixel points, the gray value of the position after superposition is the average value obtained after the gray values of the two pixel points are added.
Preferably, the operation method in step 8 is to compress the pixel map to 2 dimensions by using a t-SNE dimension reduction method, and then to set two parameters required by the DBSCAN clustering algorithm: radius eps and the number minpts of the minimum points in the neighborhood; and clustering the data after dimensionality reduction to obtain a clustered result of the DBSCAN.
Compared with the prior art, the invention has the beneficial effects that: the invention clusters Webshell by using function calling information, parameter value information and return value information of Webshell, belongs to a dynamic detection method of Webshell malicious family based on multi-source data, and compared with a static analysis method and some single-source dynamic detection methods, the method realizes fusion of multi-source data, is more representative in representing malicious family, can improve the discovery efficiency of new varieties at the clustering level, and indirectly improves the accuracy and precision of automatic detection. And the workload of a manual assistance part in the detection can be effectively reduced.
Drawings
For ease of illustration, the invention is described in detail by the following detailed description and the accompanying drawings.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a pixel map obtained by Webshell after feature processing;
fig. 3 is a two-dimensional scatter plot of clustering results after clustering using the DBSCAN clustering algorithm and assigning different colors to each cluster.
Detailed Description
In order that the objects, aspects and advantages of the invention will become more apparent, the invention will be described by way of example only, and with reference to the accompanying drawings. It is to be understood that this description is made only by way of example and not as a limitation on the scope of the invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.
As shown in fig. 1, a Webshell malicious family cluster analysis method includes the following steps:
step 1: and acquiring function calling information, parameter values, return value information and the like during Webshell operation. The data is Webshell data of a single large enterprise user during the period from 9/13 days in 2020 to 10/13 days in 2020, and each Webshell data sample can be divided into Basic attribute (Basic) information and Extended attribute information (Extended), wherein the Basic attribute information is also generally called opcode information. The basic attribute information comprises a calling function name (caller) and a called function name (caller), the extended attribute comprises a parameter value (argv), a return value (return) and taint information (taint) of the called function, and the Webshell sample x i The calling function, the called function, the parameter value, the return value and the taint information are respectively
Figure BDA0003323813150000081
Table 1 is an example of a sample:
TABLE 1Webshell sample example
Figure BDA0003323813150000082
Step 2: for function call information
Figure BDA0003323813150000083
The specific method for cleaning, splicing and sequencing comprises the following steps:
step 2.1: function call information obtained according to step 1
Figure BDA0003323813150000084
Computing all calling functions in a sample
Figure BDA0003323813150000091
And called function
Figure BDA0003323813150000092
The first 254 functions with the largest number of occurrences are taken out as core functions, and the rest functions are non-core functions and are regarded as the same function in the subsequent coding processing.
Step 2.2: idea of natural language processing based on n-gram, where n is set to 2, i.e. Bi-gram, relating to the same function call
Figure BDA0003323813150000093
And
Figure BDA0003323813150000094
performing character string splicing to
Figure BDA0003323813150000095
Considered as a minimum unit item.
Step 2.3: and (3) sequencing the item obtained in the step (2.2) according to the sequence index of function calling to obtain the function calling sequence information of each Webshell sample. At this time, the function call sequence information in one sample may also be expressed as func _ seq ═ item 1 ,item 2 ,…,item i …,item n }
And step 3: vectorizing the function call sequence information in step 2.3.
Step 3.2: designing a CBOW model with a word vector dimension of 255 and a target word and context distance of 5, and carrying out step selectionAnd 2.3, inputting the function call sequence information of all the samples as a corpus into the CBOW model for training, and calculating a vector V corresponding to each item by using the trained CBOW model item
Step 3.3: using the vector V obtained in step 3.2 item Separately, the function call sequence func _ seq in each sample is { item ═ m 1 ,item 2 ,…,item i …,item n The vector V corresponding to sentence =[V 1 ,V 2 ,…,V i …,V n ]Summing, and averaging to obtain the vector corresponding to the function call sequence
Figure BDA0003323813150000096
And 4, step 4: and sorting the parameter values and the return values according to the function calling sequence, and calculating the information entropy of the parameter values and the return values.
Step 4.1: and respectively sequencing the parameter values and the return values according to the sequence of function calling. A parameter value argv _ sequence and a return value sequence return _ sequence are obtained.
Step 4.2: the parameter values and return values are sorted. If the parameter value or the return value is a executable code, extracting the function name, and if the parameter value does not contain the code, calculating the information entropy of the parameter value
Figure BDA0003323813150000101
Dividing the parameter values and the return values according to function names or information entropies, wherein the parameter values are equally divided into 50 intervals according to the obtained information entropies, one function name is of one category, and U corresponding to argv _ sequence and return _ sequence is obtained argv And U return
And 5: and (4) according to the func _ seq, argv _ sequence and return _ sequence obtained in the steps (2) and (4), building an RNN model to respectively predict the three sequences and learn the family characteristics of the codes.
Step 5.1: RNN sample sets were constructed. One sample x i Each element in the three types of sequence information is regarded as a window which is arranged side by side one by oneSetting a sliding window with the size of 16, wherein the 9 th element in the window is data to be predicted, and predicting the 9 th element in the window by using the first 8 elements and the last 7 elements in the sliding window each time. And forming a training sample t by moving the sliding window forward by one window, and executing the operations on all the samples to obtain a function call sequence sample set Tf, a parameter sequence sample set Ta and a return value sequence sample set Tr.
Step 5.2: and building a unidirectional LSTM network. The number of input neurons of the LSTM network is 15 × P, the output dimension is P, where P is the vector dimension of the element to be predicted, such as the dimension 255 of the vector dimension word vector of the function, and the data of 1/5 is randomly extracted from the three types of samples to train three different LSTMs, respectively, using Tf, Ta, and Tr obtained in step 6.1.
Step 5.3: predicting the Tf, Ta and Tr sample sets by using the LSTM network trained in the step 5.2, replacing the predicted result with data in the corresponding position in the original sequence to obtain a predicted sequence func _ seq corresponding to the three sequences pre 、argv_seq pre And return _ seq pre And according to func _ seq pre Calculating correspondences according to step 3
Figure BDA0003323813150000111
Step 6: and mapping the original sequence data and the predicted sequence data into pixel points after minhash processing to form a pixel map.
Step 6.1: with argv _ seq ═ { argv ═ argv 1 ,argv 2 ,…,argv i …,argv n As an example, wherein argv i Obtaining argv for the parameter classes divided in the step 4 based on a minHash method i Corresponding signatures, i.e., a plurality of hash values, perform the same operation on func _ seq and return _ seq to obtain corresponding hash values. Grouping the hash values obtained by calculation of three different sequences to obtain hash value tuples<xi,yi,zi>Wherein xi, yi, zi in the same tuple are from the hash values of the same position of argv _ seq, return _ seq, func _ seq, respectively. The operation of taking the modulus 32 is performed on xi, yi, and then zi is advancedThe operation modulo 255, xi, yi, maps to an integer of floor ([0,32) and zi maps to floor ([0, 255)]]) Is an integer of (1). Wherein xi and yi respectively correspond to the coordinate positions of the x axis and the y axis of the pixel point, and zi is a gray value. Notably, func _ seq and func _ seq pre Directly utilizing the vector U obtained in the step 3 and the step 5 without minhash operation func And
Figure BDA0003323813150000112
and directly carrying out the modulus taking operation.
Step 6.2: and drawing an original pixel map and a predicted pixel map by using pixel points obtained by the original sequence and the predicted sequence. Taking the original sequences func _ seq, argv _ seq and return _ seq as examples, if the pixels with the same position appear, adding the corresponding gray value on the basis of the gray value of the original pixel and averaging.
And 7: and (5) overlapping the original pixel map obtained in the step (6) with the predicted pixel map, and drawing a final pixel map. Superposing the pixel map corresponding to the original sequence and the pixel map of the prediction sequence, and if the positions corresponding to the pixel maps of the original sequence and the prediction sequence have no pixel points, setting the gray value of the superposed position to be 0; if one pixel image has a pixel point and only one pixel image has a pixel point, the gray value of the position after superposition is the value of the original pixel point; if both the two corresponding positions have pixel points, the gray value of the position after superposition is the average value obtained after the gray values of the two pixel points are added.
And 8: and (4) clustering the pixel map obtained in the step (7) by using a DBSCAN clustering algorithm. Firstly, adopting PCA to reduce the dimension of data to 50-100 dimensions, then adopting a t-SNE dimension reduction method to compress a pixel map to 2 dimensions, and then setting two parameters required by a DBSCAN clustering algorithm: radius eps and the number minpts of the minimum points in the neighborhood; and clustering the data after dimensionality reduction to obtain a clustered result of the DBSCAN.
The method comprises the following contents of experimental evaluation, wherein an experimental data set is 5111 malicious Webshell samples provided by a large-scale enterprise user, 130 malicious family types are obtained after the data set is marked manually, in addition, on the final clustering evaluation index, two indexes of accuracy (acc) and NMI are used for evaluating clustering effects, the accuracy is calculated by using the Hungarian maximum matching algorithm, the experiment searches for the optimal clustering effect in the range of the number of clusters [50,130], and meanwhile, the clustering result of the number of clusters of 130 is also given.
In experiment 1, we have performed the contents of step 1, step 2, step 3, step 4, step 5, step 6, step 7, and step 8, and in step 4, xi, yi are respectively modulo 32 to obtain the positions of the scattered point x-axis and y-axis coordinates, so the size of the pixel map is 32 × 32, and fig. 2 is an example of the pixel map. In step 8, a dimensionality reduction method of PCA + TSNE is adopted, PCA is firstly adopted to reduce the dimensionality of data to 100, then a t-SNE dimensionality reduction method is adopted to compress a pixel map to 2 dimensions, and finally two parameters of DBSCAN are traversed from the range of [0.0001,2], [2,20 ]: the best match is obtained by the radius eps and the number minpts of the minimum points in the neighborhood, the corresponding clustering accuracy is 58.8%, the NMI is 0.793, and the cluster number k is 57 at this time.
In experiment 2, we take substantially the same steps as in experiment 1, except that in step 8, the traditional kmeans clustering algorithm is selected, and k is set to 130, i.e. the number of true classes is the same, and finally we obtain 56.8% accuracy and 0.778 NMI.
In experiment 3, the procedure of experiment 3 is basically the same as experiment 1, and in experiment 3, xi and yi in step 4 are respectively modulo 128, so the size of the pixel map is 128 × 128, the rest of the procedures are the same as experiment 1, the final clustering accuracy is 52.7%, the NMI is 0.787, and k is 67 at this time. Compared with experiment 1, the result of experiment 3 is observed that the clustering accuracy is reduced, and the possible reason is that the dimension of the pixel map is too large, so that the pixel matrix is too sparse, and the final clustering effect is influenced.
In experiment 4, the step of experiment 4 is basically the same as experiment 1, we do not adopt the operation of reducing dimensions in step 8, but directly perform clustering, the rest steps are the same as experiment 1, the final obtained accuracy is 58.0%, NMI is 0.808, at this time k is 65, the accuracy of experiment 4 clustering is similar to experiment 1, and it is shown that in our method, the effect of the dimension reduction means on improving the accuracy of clustering is not obvious.
Experimental conditions as in table 2, experiments 1-4 are, in general, comparative experiments performed under alternative parameters and alternative procedures.
TABLE 2 clustering test conditions
Figure BDA0003323813150000131
Figure BDA0003323813150000141
From table 2 and fig. 3, it is concluded that the beneficial effects of the present invention are: the invention clusters Webshell by using function calling information, parameter value information and return value information of Webshell, belongs to a dynamic detection method of Webshell malicious family based on multi-source data, and compared with a static analysis method and some single-source dynamic detection methods, the method realizes fusion of multi-source data, is more representative in representing malicious family, can improve the discovery efficiency of new varieties at the clustering level, and indirectly improves the accuracy and precision of automatic detection. And the workload of a manual assistance part in detection can be effectively reduced.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (8)

1. A webshell malicious family cluster analysis method is characterized by comprising the following steps: the method comprises the following steps:
step 1: acquiring function calling information, parameter values and return value information when Webshell runs;
step 2: function call information
Figure FDA0003746132350000011
Cleaning, splicing and sequencing;
and step 3: vectorizing the function call information in the step 2;
and 4, step 4: calculating the information entropy of the parameter values and the return values, and sequencing according to the sequence of function calling;
and 5: according to the function call sequence func _ seq, the parameter value sequence argv _ seq and the return value sequence return _ seq obtained in the steps 2 and 4, an RNN model is built to respectively predict the three sequences and learn the characteristics of the code family;
step 5.1: constructing an RNN sample set; one sample x i Each element in the three types of sequence information is regarded as a window which is arranged side by side, a sliding window with the size of K is arranged, the Mth element in the window is data to be predicted, and the Mth element in the window is predicted by using the front M-1 element and the rear K-M element in the sliding window each time; forming a training sample t by moving a sliding window forward every time, and performing the operation on all samples to obtain a function call sequence sample set Tf, a parameter value sequence sample set Ta and a return value sequence sample set Tr;
step 5.2: building a unidirectional LSTM network, wherein the number of input neurons of the LSTM network is (K-1) × P, the output dimension of the LSTM network is P, the P is the vector dimension of the element to be predicted, and the three different LSTMs are trained by respectively randomly extracting partial data from the three types of samples by utilizing Tf, Ta and Tr obtained in the step 5.1;
step 5.3: predicting the Tf, Ta and Tr sample sets by using the LSTM network trained in the step 5.2, replacing the predicted result with data in the corresponding position in the original sequence to obtain a predicted sequence func _ seq corresponding to the three sequences pre 、argv_seq pre And return _ seq pre And according to func _ seq pre Calculating the corresponding vector according to step 3
Figure FDA0003746132350000021
Step 6: mapping data in the original sequence in the step 5 and the predicted sequence data into pixel points after minhash processing to form a predicted pixel map;
and 7: superposing the predicted pixel map obtained in the step 6 with a pixel map formed by data in the original sequence to draw a final pixel map;
and 8: and (5) clustering the pixel map obtained in the step (7) by using a DBSCAN clustering algorithm.
2. The webshell malicious family cluster analysis method of claim 1, wherein: in the step 1, each Webshell sample data can be abstracted into multi-attribute event sequence data x ═ { x ═ 1 ,x 2 ,…,x N In which x i Representing the operation information set called i time by the Webshell, wherein the information set called each time can be divided into basic attribute information and extended attribute information, namely x i Can be described as a pair<Basic i ,Extended i >(ii) a The Basic attribute information Basic comprises a calling function name, a called function name, an Extended attribute Extended comprises a parameter value argv, a return value return and taint information, and a Webshell sample x i The calling function, the called function, the parameter value, the return value and the taint information are respectively
Figure FDA0003746132350000022
3. The webshell malicious family cluster analysis method of claim 1, wherein: the step 2 comprises the following steps:
step 2.1: computing all calling functions in a sample
Figure FDA0003746132350000023
And called function
Figure FDA0003746132350000024
Taking out the function with more times as a core function, wherein the rest functions are non-core functions and are regarded as the same function in the subsequent coding processing;
step 2.2: involving the same function call
Figure FDA0003746132350000031
And
Figure FDA0003746132350000032
performing character string splicing to
Figure FDA0003746132350000033
Considered as a minimum unit item;
step 2.3: sequencing the items obtained in the step 2.2 according to the sequence of function calling to obtain function calling sequence information of each Webshell sample; at this time, the function call sequence information in one sample is denoted as func _ seq ═ { item ═ m 1 ,item 2 ,…,item i …,item n }; the purpose of the n-gram is to preserve sequence information during function calls.
4. The webshell malicious family cluster analysis method of claim 3, wherein: the step 3 comprises the following steps:
step 3.1: the sequence information of each sample in Webshell is equivalent to a short text, each minimum unit item is equivalent to a word in the short text, and the function calling sequences of all samples form a corpus; since each item is a word, word segmentation operation is not needed;
step 3.2: designing word vector dimensions of the CBOW model, inputting function call sequence information obtained in the step 2.3 as a corpus into the CBOW model for training, and calculating a vector V corresponding to each item by using the trained CBOW model item And saving the parameter weight in the CBOW model;
step 3.3: using the vector V obtained in step 3.2 item Separately, the function call sequence func _ seq in each sample is { item ═ m 1 ,item 2 ,…,item i …,item n The vector V corresponding to sentence =[V 1 ,V 2 ,…,V i …,V n ]Summing, and averaging to obtain the vector corresponding to the function call sequence
Figure FDA0003746132350000034
5. The webshell malicious family cluster analysis method of claim 4, wherein: step 4 comprises the following steps:
step 4.1: classifying the parameter values and the return values; if the parameter value or the return value is a runnable code, extracting a function name in the parameter value or the return value, if the parameter value does not contain the code, calculating the information entropy of the parameter value, and dividing the parameter value and the return value according to the function name or the information entropy;
step 4.2: and respectively sequencing the parameter values and the return values according to the sequence of function calling to obtain a parameter value sequence argv _ seq and a return value sequence return _ seq.
6. The webshell malicious family cluster analysis method of claim 5, wherein: step 6 comprises the following steps:
step 6.1: argv _ seq ═ { argv ═ argv 1 ,argv 2 ,…,argv i …,argv n In the method, argv is obtained based on minHash method i Corresponding signatures, namely a plurality of hash values, carry out the same operation on the func _ seq and the return _ seq to obtain corresponding hash values; grouping the hash values obtained by calculation of three different sequences to obtain hash value tuples<xi,yi,zi>Wherein xi, yi, zi in the same tuple are respectively from the hash values of the same positions of argv _ seq, return _ seq, func _ seq; performing operation of taking module m on xi, yi, and then performing operation of taking module k on zi, namely xi, yi, mapping to integer of floor ([0, m), and zi mapping to floor ([0, k)]]) An integer of (d); wherein xi and yi respectively correspond to the coordinate positions of the x axis and the y axis of the pixel point, and zi is a gray value; func _ seq and func _ seq pre Directly utilizing the vector U obtained in the step 3 and the step 5 without minhash operation func And
Figure FDA0003746132350000041
directly carrying out a mould taking operation;
step 6.2: drawing an original pixel map and a predicted pixel map by using pixel points obtained by the original sequence and the predicted sequence; in the original sequences func _ seq, argv _ seq and return _ seq, if the same-position pixel points appear, adding corresponding gray values on the basis of the gray values of the original pixel points and averaging.
7. The webshell malicious family cluster analysis method of claim 6, wherein: the operation method in step 7 is to superimpose the pixel map corresponding to the original sequence and the pixel map of the prediction sequence, and if the positions corresponding to the pixel maps of the original sequence and the pixel map of the prediction sequence have no pixel points, the gray value of the superimposed position is 0; if one pixel image has a pixel point and only one pixel image has a pixel point, the gray value of the position after superposition is the value of the original pixel point; if both the two corresponding positions have pixel points, the gray value of the position after superposition is the average value obtained after the gray values of the two pixel points are added.
8. The webshell malicious family cluster analysis method of claim 7, wherein: the operation method of step 8 is to compress the pixel map to 2 dimensions by adopting a t-SNE dimension reduction method, and then to set two parameters required by the DBSCAN clustering algorithm: radius eps and the number minpts of the minimum points in the neighborhood; and clustering the data after dimensionality reduction to obtain a clustered result of the DBSCAN.
CN202111255079.9A 2021-10-27 2021-10-27 Webshell malicious family clustering analysis method Active CN114036515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111255079.9A CN114036515B (en) 2021-10-27 2021-10-27 Webshell malicious family clustering analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111255079.9A CN114036515B (en) 2021-10-27 2021-10-27 Webshell malicious family clustering analysis method

Publications (2)

Publication Number Publication Date
CN114036515A CN114036515A (en) 2022-02-11
CN114036515B true CN114036515B (en) 2022-08-16

Family

ID=80142093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111255079.9A Active CN114036515B (en) 2021-10-27 2021-10-27 Webshell malicious family clustering analysis method

Country Status (1)

Country Link
CN (1) CN114036515B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599686B (en) * 2016-10-12 2019-06-21 四川大学 A kind of Malware clustering method based on TLSH character representation
CN110225007A (en) * 2019-05-27 2019-09-10 国家计算机网络与信息安全管理中心 The clustering method of webshell data on flows and controller and medium
CN110458187B (en) * 2019-06-27 2020-07-31 广州大学 Malicious code family clustering method and system
CN113032780A (en) * 2021-03-01 2021-06-25 厦门服云信息科技有限公司 Webshell detection method based on image analysis, terminal device and storage medium

Also Published As

Publication number Publication date
CN114036515A (en) 2022-02-11

Similar Documents

Publication Publication Date Title
Allamanis et al. A convolutional attention network for extreme summarization of source code
AU2020202658B2 (en) Automatically detecting user-requested objects in images
US20150178321A1 (en) Image-based 3d model search and retrieval
CN107515877A (en) The generation method and device of sensitive theme word set
CN112380319B (en) Model training method and related device
CA2997986C (en) Scoring mechanism for discovery of extremist content
CN113377932A (en) Automatic selection of objects in images using natural language processing and multi-object detection models
CN111832287A (en) Entity relationship joint extraction method and device
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
CN115357904B (en) Multi-class vulnerability detection method based on program slicing and graph neural network
CN115292520B (en) Knowledge graph construction method for multi-source mobile application
CN116361801A (en) Malicious software detection method and system based on semantic information of application program interface
CN103679034A (en) Computer virus analyzing system based on body and virus feature extraction method
Chu et al. Visualization feature and CNN based homology classification of malicious code
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN113297580B (en) Code semantic analysis-based electric power information system safety protection method and device
CN110989991B (en) Method and system for detecting source code clone open source software in application program
CN114036515B (en) Webshell malicious family clustering analysis method
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN115063604B (en) Feature extraction model training and target re-identification method and device
CN115858002A (en) Binary code similarity detection method and system based on graph comparison learning and storage medium
CN115563985A (en) Statement analysis method, statement analysis device, statement analysis apparatus, storage medium, and program product
KR102609616B1 (en) Method and apparatus for image processing, electronic device and computer readable storage medium
CN115331004A (en) Zero sample semantic segmentation method and device based on meaningful learning
CN103744830A (en) Semantic analysis based identification method of identity information in EXCEL document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant