CN110572397B

CN110572397B - Flow-based webshell detection method

Info

Publication number: CN110572397B
Application number: CN201910851073.4A
Authority: CN
Inventors: 徐钟豪; 孟雷; 谢忱
Original assignee: Shanghai Douxiang Information Technology Co ltd
Current assignee: Shanghai Douxiang Information Technology Co ltd
Priority date: 2019-09-10
Filing date: 2019-09-10
Publication date: 2022-05-24
Anticipated expiration: 2039-09-10
Also published as: CN110572397A

Abstract

The invention relates to a flow-based webshell detection method, which is used for performing webshell detection on flow generated by the same website and comprises the following steps: establishing a training model, grouping the training flow according to a url + parameter mode, and analyzing the attribute of each group of url + parameters to obtain a url + parameter entity and a mode portrait of each group of training flow; detecting, detecting the website with the established training model, grouping the flow to be detected according to the url + parameter mode, comparing the url + parameters of all the flow to be detected with the url + parameters of all the training flows, and if the flow to be detected with the same url and different parameters is detected, respectively judging the statistical score, the similarity score and the suspicion degree of each flow to be detected under the condition so as to detect the abnormal flow containing the webshell and reduce false alarm of the abnormal flow. The method and the device realize the establishment of the webshell training model based on the flow, can carry out detection according to the training model, and accurately and effectively detect the flow to be removed with lower false alarm.

Description

Flow-based webshell detection method

Technical Field

The invention relates to the technical field of network security, in particular to a method for detecting webshell based on flow.

Background

With the rapid development and popularization of internet technology, the server security problem is more and more serious, and even the normal operation of network services is seriously threatened. It is therefore important to detect vulnerabilities and backdoors of the server to ensure server security. The webshell is a pure text file, is easy to deform, is flexible to use, and is easy to confuse or hide feature codes, so that the webshell is difficult to quickly and accurately detect and remove based on a feature matching detection method.

Currently, most of existing detection webshells are based on rule matching, which results in a large amount of false reports and false reports, and is completely ineffective for unknown types of webshells. In the existing product, flow-based webshell detection relates to extraction of normal samples and selection of features, and a single flow cannot reflect the essential features of the webshell, so that a good model does not exist in the existing product. In addition, the existing machine model training depends heavily on the quality of the samples, but good positive and negative samples are often difficult to acquire, for example, the positive samples have very large mutual difference and have various characteristics, and common characteristics between normal samples are difficult to find for being used as training characteristics; for another example, negative samples are rare and difficult to collect, and on the other hand, the data of the security research room is slightly different from the real attack data traffic. Therefore, the model features are difficult to design and extract, and samples are difficult to collect, so that the existing machine learning or statistical model is difficult to achieve accuracy and effectiveness. In addition, in the prior art, double reduction between false alarm and false negative is difficult to achieve, for example, to reduce false negative, more rules are added to generate more false alarms, and to reduce false positive, some rules are added less to generate more false negative, so that in the prior art, double control over false alarm and false negative is difficult to achieve.

Therefore, it is necessary to provide a flow-based webshell detection method to establish a flow-based webshell training model, perform detection according to the training model, and accurately and effectively detect the flow which needs to be removed and has low false alarm and missing report.

Disclosure of Invention

The invention aims to provide a flow-based webshell detection method, which is used for establishing a flow-based webshell training model, detecting according to the training model and accurately and effectively detecting flow which is missed and has low false alarm and needs to be removed.

In order to solve the problems in the prior art, the invention provides a flow-based webshell detection method, which is used for performing webshell detection on the flow generated by the same website and comprises the following steps:

establishing a training model, grouping training flows according to a url + parameter mode, analyzing the attribute of each group of url + parameters to obtain a url + parameter entity of each group of training flows, and obtaining a mode portrait of each group of training flows according to the url + parameter entity of each group of training flows to obtain the training model; obtaining the jump relation between each group of url + parameters according to the jump relation between each training flow, thereby obtaining the graph model of each group of url + parameters, wherein all the graph models form a jump model graph;

detecting, detecting the website with the established training model, grouping the flow to be detected according to the url + parameter mode, the parameter is the access key value of the flow to be detected, the attribute of each group of url + parameter is analyzed, to obtain a url + parameter entity of each group of flows to be detected, comparing the url + parameters of all the flows to be detected with the url + parameters of all the training flows, if there are flows to be detected with the same url but different parameters, respectively calculating the statistical score and the similarity score of each flow to be detected under the condition, comparing the statistical score and the similarity score of each flow to be detected with a preset threshold value for judgment, and judging the doubtful degree of each flow to be detected under the condition, and detecting abnormal flow containing the webshell, and performing false alarm reduction processing on the abnormal flow to obtain the flow to be removed and normal flow.

Optionally, in the flow-based webshell detection method, when the flow to be detected has the same url and different parameters compared with the training flow, calculating the statistical score includes the following steps:

extracting the group of flows to be detected corresponding to the same url + different parameters and statistical characteristics of the flows, wherein the statistical characteristics comprise out-degree, in-degree, access IP diversity and usergent diversity, and a calculation formula of a statistical score is as follows:

wherein T is the statistical score of each flow to be detected in the extracted group of flows to be detected, T₁，T₂…T₄Is the characteristic value of each statistical characteristic.

Optionally, in the method for detecting a webshell based on a flow, T₁，T₂…T₄For the eigenvalue of each statistical characteristic, the calculation formula of each eigenvalue is T_X＝1-e^(-x)Wherein, T_xAnd e is a constant, and x is the number of out-degrees, the number of in-degrees, the number of access IP diversity or the number of usergent diversity of the extracted flow group to be detected.

Optionally, in the method for detecting a webshell based on a flow,

if the flow to be detected jumps to the training flow from the flow to be detected in the extracted group of flows to be detected, recording that the flow to be detected has a degree of out;

if the flow to be detected jumps from the training flow to the flow to be detected in the extracted group of flows to be detected, recording that the flow to be detected has an in-degree;

and counting the number of out-degree and in-degree of all the flows to be detected in the group of flows to be detected.

Optionally, in the flow-based webshell detection method, when the flow to be detected has the same url and different parameters as compared with the training flow, calculating the similarity score includes the following steps:

extracting the group of flows to be detected corresponding to the same url + different parameters, and acquiring at least one group of training flows, which are identical to the url of the group of flows to be detected, in the training model;

the similarity score is calculated by the formula:

wherein D is the similarity score of each flow to be detected in the extracted group of flows to be detected, S₁，S₂…S_zAnd z is the group number of the obtained training flow.

Optionally, in the flow-based webshell detection method, a calculation formula of similarity values between the extracted group of flows to be detected and the obtained groups of training flows is as follows:

or

S is the similarity value between the extracted group of flows to be detected and each acquired group of training flows, i is a lower limit, n is an upper limit, p is a sequence calculation value in the user portrait of each acquired group of training flows, and y is the sequence calculation value in the user portrait of each acquired group of training flows_iFor extracted vector values, Y, of attributes in the same url + different parameters_iFrequency of each attribute in user profile for each set of training traffic acquired, and y_iAnd Y_iThe attributes are in one-to-one correspondence, and J is a value for averaging obtained empirically.

Optionally, in the flow-based webshell detection method, the manner of determining the statistical score and the similarity score of each flow to be detected corresponding to the same url + different parameters is as follows:

respectively presetting a threshold value of a statistical score and a threshold value of a similarity score, comparing and judging the statistical score and the similarity score of each flow to be detected corresponding to the same url + different parameters with the preset threshold values, if the statistical score is lower than the threshold value of the statistical score and the similarity score is lower than the threshold value of the similarity score, judging the suspicious degree, otherwise, not calculating the suspicious degree, and confirming that the flow to be detected is the normal flow.

Optionally, in the flow-based webshell detection method, the suspicion degree determination includes general webshell suspicion degree determination and ice scorpion type webshell suspicion degree determination, after the general webshell suspicion degree determination and the ice scorpion type webshell suspicion degree determination are performed twice, the determination that the flow to be detected is normal is determined twice as normal flow, and otherwise, the flow is abnormal flow.

Optionally, in the method for detecting a webshell based on a flow, before determining a suspicious degree of a general webshell, the method further includes the following steps:

obtaining keywords of a known webshell, counting the number of the keywords containing the webshell in a request body part and a response part in each flow to be detected, and recording the number as n₁；

ObtainingCookie parameters in each flow to be detected are obtained, cookie parameters in all training flows are obtained, the number of cookie parameters in each flow to be detected, which do not exist in all training flows, is counted, and the number is recorded as n₂；

The calculation formula of the score of the doubtful degree is as follows: k-1-e^(q*n)Wherein, K is the suspicious degree score of each flow to be detected, e is a constant, q is the weight for adjusting the convergence speed, and n is n₁+n₂N participating in each calculation₁And n₁The same flow parameter to be detected.

Optionally, in the method for detecting a webshell based on a flow, the step of judging the suspicious degree of a general webshell includes the following steps:

setting a threshold value of a suspicious degree score, comparing the calculated suspicious degree score with the threshold value of the suspicious degree score, if the suspicious degree score is higher than the threshold value of the suspicious degree score, confirming that the flow to be detected contains a common webshell, and confirming that the flow to be detected is abnormal flow; otherwise, judging that the flow to be detected is normal.

Optionally, in the method for detecting a webshell based on a flow, the determining of the suspiciousness of the ice scorpion type webshell includes the following steps:

if the responseBody information entropy formula of the flow to be detected is calculated to be 0, the length of the responseBody exceeds 1000, and a blank space is not contained in the flow to be detected, judging that the flow to be detected contains an ice scorpion type webshell, and confirming that the flow to be detected is an abnormal flow; otherwise, judging that the flow to be detected is normal.

Optionally, in the flow-based webshell detection method, the false alarm reduction processing is performed on abnormal flow, and the method includes the following steps:

counting resource files in the training model, wherein the resource files are provided by responseBody in all training traffic, and loading data provided by responseBody in each abnormal traffic;

if the data provided by the responseBody in any abnormal flow exists in the resource file, the abnormal flow is confirmed to be a normal flow again; otherwise, confirming the flow to be removed and removing.

Optionally, in the method for detecting a webshell based on traffic, in the process of establishing a training model, the training traffic and the traffic to be detected are grouped according to a url + parameter mode, and the method further includes the following steps:

by analyzing the flow, the request type can be known to comprise get or post, and if the request type is the get request type, the url and the parameter name of the subsequent request are extracted to form a url + parameter mode; if the request type is post, the contents in the resuquestBody are spliced to the url followed by the pattern forming the url + parameter.

Optionally, in the method for detecting a webshell based on a flow, in the detection process, the method further includes the following steps:

comparing the url + parameters of all the flows to be detected with the url + parameters of all the training flows, and if the flows to be detected have the same url and the same parameters, confirming the flow to be detected as the normal flow; if the url is different, the similarity score of the flow to be detected is defaulted to be 0.

Optionally, in the method for detecting a webshell based on a flow, the attribute of each group of url + parameters in the training model includes: requestMethod, requestHeader, referrer, requestContentType, requestBody _ null, requestBody _ xml, requestBody _ json, requestBody _ kv _ base64, requestKv _ len, cookie _ key, cookie _ num, responseHeader, responseContentType, responseBody _ garble, responseBody _ keynum, and responseBody.

Optionally, in the method for detecting a webshell based on traffic, the attributes involved in the calculation of the similarity score include a requestMethod, a requestHeader, a referrer, a requestContentType, a requestBody _ null, a requestBody _ xml, a requestBody _ json, a requestBody _ kv _ base64, a requestBody _ kv _ len, a cookie _ num, a responseHeader, a responseContentType, and a responseBody.

The method for detecting the webshell based on the flow is used for detecting the flow generated by the same website, obtaining the user image and the flow jump model diagram by establishing the training model, and realizing the establishment of the training model of the webshell based on the flow; detecting websites with built training models, comparing url + parameters of all to-be-detected flows with url + parameters of all training flows, and if the to-be-detected flows with the same url and different parameters are detected, respectively judging the statistical score, the similarity score and the suspiciousness of each to-be-detected flow under the condition to detect abnormal flows containing webshells, so that the flow-based webshells are detected, and the detection missing is reduced; by carrying out false alarm reduction processing on abnormal flow, the accurate and effective detection of the flow to be removed with lower false alarm is realized.

Drawings

Fig. 1 is a flowchart of a method for detecting a webshell based on a flow according to an embodiment of the present invention;

FIG. 2 is a flowchart of establishing a training model according to an embodiment of the present invention;

fig. 3 is a detection flowchart according to an embodiment of the present invention.

Detailed Description

The following describes in more detail embodiments of the present invention with reference to the schematic drawings. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is merely for the purpose of facilitating and distinctly claiming the embodiments of the present invention.

Hereinafter, if the method described herein comprises a series of steps, the order of such steps presented herein is not necessarily the only order in which such steps may be performed, and some of the described steps may be omitted and/or some other steps not described herein may be added to the method.

In the prior art, most of detection methods for webshells are based on rule matching for detection, and false alarm and low flow to be removed cannot be accurately and effectively detected. Therefore, it is necessary to provide a method for detecting a webshell based on traffic, as shown in fig. 1, fig. 1 is a flowchart of a method for detecting a webshell based on traffic according to an embodiment of the present invention, where the method for detecting a webshell detects traffic generated by the same website, and includes the following steps:

establishing a training model, extracting the generated flow of the same website as a training flow within a period of time (for example, one month), grouping the training flows according to a url + parameter mode, wherein the parameter is an access key value of the training flow, analyzing the attribute of each group of url + parameters to obtain a url + parameter entity of each group of training flows, and obtaining a mode portrait of each group of training flows according to the url + parameter entity of each group of training flows to obtain the training model; obtaining the jump relation between each group of url + parameters according to the jump relation between each training flow, thereby obtaining the graph model of each group of url + parameters, wherein all the graph models form a jump model graph;

detecting, for example, extracting all flows in a specific time period (for example, 2 hours) of the website to be detected as the flow to be detected, grouping the flow to be detected according to a url + parameter mode, wherein the parameter is an access key value of the flow to be detected, analyzing attributes of url + parameters of each group to obtain url + parameter entities of the flow to be detected of each group, comparing the url + parameters of all the flows to be detected with the url + parameters of all the training flows, if the flows to be detected have the same url but different parameters, respectively calculating statistical scores and similarity scores of the flows to be detected under the condition, comparing the statistical scores and the similarity scores of the flows to be detected with a preset threshold value, and judging the doubtful degree of the flows to be detected under the condition to detect abnormal flows containing webshell, and carrying out false alarm reduction treatment on the abnormal flow to obtain the flow to be removed and the normal flow.

Therefore, the method is used for performing webshell detection on the flow generated by the same website, and the user image and the flow skip model diagram are obtained by establishing the training model, so that the establishment of the flow-based webshell training model is realized; detecting websites with built training models, comparing url + parameters of all to-be-detected flows with url + parameters of all training flows, and if the to-be-detected flows with the same url and different parameters are detected, respectively judging the statistical score, the similarity score and the suspiciousness of each to-be-detected flow under the condition to detect abnormal flows containing webshells, so that the flow-based webshells are detected, and the detection missing is reduced; by carrying out false alarm reduction processing on abnormal flow, the accurate and effective detection of the flow to be removed with lower false alarm is realized.

As shown in fig. 2, fig. 2 is a flowchart for establishing a training model according to an embodiment of the present invention, in a process of establishing the training model, training traffic and traffic to be detected are grouped according to a url + parameter mode, and by analyzing the traffic, it can be known that a request type includes get or post, and if the request type is get, a url and a parameter name of a subsequent request are extracted to form a url + parameter mode; if the request type is post, the contents in the resuquestBody are spliced to the url followed by the pattern forming the url + parameter. And after the url + parameters of all the flows in the same website are analyzed, attributing the obtained flows with the same url and parameters to the same url + parameters.

Further, analyzing the attribute of each url + parameter to obtain the user portrait of each url + parameter, and counting the frequency of each attribute as follows:

(1) requestMethod: extracting request modes in all training flows, and counting the frequency of specific types of the request modes;

(2) requestHeader: counting each parameter type in the request header, and counting the frequency of each parameter type;

(3) a referrer: counting the links jumped in the request header referrer, also performing url plus parameter erasing value operation, and then counting the jumping frequency of each link;

(4) requestContentType: extracting all types of Content-Type in all training flow http, request, header _ names, and counting frequency;

(5) requestBody _ null: whether the traffic http.response.body is empty or not is trained, and counting the frequency of the empty;

(6) requestBody _ xml: whether the training flow is an xml structure or not is judged, and the frequency of the xml structure is counted;

(7) requestBody _ json; whether the training flow http.response.body is a json structure or not is judged, and the frequency of the json structure is counted;

(8) requestBody _ kv _ base 64: if the training flow http, response and body are json type, judging whether the parameters are coded by base64 or not, and counting the frequency of coding by base 64;

(9) request _ kv _ len: if the training flow http, response and body is of the json type, judging whether parameters of the training flow http, response and body exceed a certain threshold value, and counting the frequency of exceeding the certain threshold value;

(10) cookie _ key: counting parameter types and occurrence frequencies of cookies in all training flows;

(11) cookie _ num: counting the parameter number of cookies in all training flows and the frequency of each training flow under the number;

(12) responseHeader: counting the type of a reponsebrodyheader in the training flow, and counting the frequency of each training flow in the type;

(13) responseContentType: counting all types of Content-Type in all training flow http, response, head _ names and counting frequency;

(14) responseBody _ garble: judging whether the responseBody is a messy code or not, and counting the messy code frequency;

(15) responseBody _ keynum: if the responseBody is in a key-value pair form, recording the number of the key-value pairs;

(16) responseBody common subsequence: if responseBody is in xml form, its tag sequence value is extracted, facilitating maximum common subsequence computation.

Further, the jump relationship between the groups of url + parameters is obtained from the jump relationship between the training flows, which is specifically as follows: and obtaining the jump relation between url + parameters of the same website or the same domain name through the jump relation between the training flow refer and the http. The in-degree is to jump to the current url + parameter entity, and the out-degree is to jump from the current url + parameter entity to other different url + parameter entities. And obtaining a plurality of out-degrees and in-degrees according to different url + parameter entities so as to establish a graph model of the training flow, wherein all the graph models form a jump model graph.

Preferably, cookie parameters in all training traffic may be obtained first for facilitating suspicion degree calculation. The resource files in the training model, which are provided by responseBody in all training flows, may also be counted to participate in reducing false positives.

As shown in fig. 3, fig. 3 is a detection flowchart provided in the embodiment of the present invention, in the detection process, url + parameters of all flows to be detected are compared with url + parameters of all training flows, and if there are flows to be detected with the same url and the same parameters, the flow to be detected is determined as a normal flow; if the url is different from the flow to be detected, the similarity score of the flow to be detected is defaulted to be 0; if the url of the flow to be detected is the same but the parameters of the flow to be detected are different, calculating the statistical score and the similarity score of each flow to be detected under the condition, comparing and judging the statistical score and the similarity score of each flow to be detected with a preset threshold value, and judging the doubtful degree of each flow to be detected under the condition, wherein the specific judgment is as follows:

compared with the training flow, when the flow to be detected has the same url and different parameters, the calculation of the statistical score comprises the following steps:

Preferably, the calculation formula of each characteristic value is T_X＝1-e^(-x)Definition of field [0, +]Value range [0, 1 ]]Wherein, T_xFor the characteristic value of any one statistical characteristic in each statistical characteristic, e is a constant, and x is the number of out-degree, in-degree and access IP diversity of the extracted group of flow to be detectedNumber or number of usergent diversity.

The number of the out-degree, in-degree, access IP diversity and useragent diversity is obtained by the following steps:

in the extracted group of flows to be detected, if the flow to be detected jumps to a training flow, recording that the flow to be detected has a degree of out, and counting the number of the degrees of out of all the flows to be detected in the extracted group of flows to be detected in a detection time period;

in the extracted group of flows to be detected, if jumping from the training flow to the flows to be detected, recording that the flow to be detected has an incoming degree, and counting the number of the incoming degrees of all the flows to be detected in the extracted group of flows to be detected in a detection time period;

and counting all the flows to be detected in the detection time period, and counting the number of the extracted access IPs with the same url + different parameters after duplication removal.

And counting all the flows to be detected in the detection time period, and counting the number of the extracted user agents with the same url and different parameters after the weights are removed. The variety of the user agents is not considered for the ice scorpion type webshell, and the ice scorpion type webshell can display various different user agents because the ice scorpion type webshell can randomly use the user agents.

Compared with the training flow, when the flow to be detected has the same url and different parameters, the calculation of the similarity score comprises the following steps:

the similarity score is calculated by the formula:

(1)：

or (2):

Preferably, calculating the similarity score comprises the steps of:

extracting vector values of user images with the same url + different parameters to be detected, wherein the calculation formula of the similarity score is as follows:

wherein S is the similarity score of the same url + different parameters to be detected, y₁，y₂…y₁₃Vector values, Y, of 13 attributes in the same user profile, Y, of the same url + different parameters to be detected₁，Y₂…Y₁₃The frequencies of 13 attributes of the user portrait which are respectively the same url + parameter in the training model, the url in the training model participating in calculation is the same as the url to be detected, and y₁And Y₁，y₂And Y₂…y₁₃And Y₁₃The attributes are in one-to-one correspondence, J is a value which is obtained by experience and is used for averaging, and the value range of the final result is ensured due to the mutual exclusion relationship or the inclusion relationship among partial characteristicsBetween 0 and 1, and is therefore averaged with J, which may be 12, for example.

Furthermore, the portrait with the same flow to be detected or the portrait with the same url but different parameters, y, can be found from the user portrait of the training model₁，y₂…y₁₃The specific calculation method may be as follows:

y₁: requestMethod: in the training flow, different types of httpRequestmethods are obtained according to the training flow with the same url + parameter, then the types are used as the characteristics of a new training flow and are determined according to whether the types are met, and the specific characteristics and characteristic values of the to-be-detected flow are determined according to the types with the same url + and different parameters. For example: in the training flow url + a1, if a get type, a post type, an options type, and the like appear in the request method, it is determined whether a flow to be detected has corresponding characteristics, for example, if a flow to be detected is a request type of get, the characteristic of the request method is 1, the characteristic of the request method is 0, and the characteristic of the request methods is 0;

y₂: requestHeader: in the training model, counting training flows under the same url + parameter, various types of occurring requestheaders, and then using the types as a feature of the requestheaders, for example, in all the training flows under the url + parameter of the training model, all types containing the requestheaders include content-length, connection, accept, host, and user-agent, when a new flow to be detected is detected, a vector of the portion where the flow to be detected is located is determined according to the characteristics of the requestHeader, content-length, request header, connection, requestHeader, accept, requestHeader, or requestHeader, host, etc., and if the flow to be detected contains the vector of "user-agent", "host", "access", and "corresponding to the requestor, request header, request _ 1. the corresponding to the identifier.

y₃: a referrer: in the training model, the re of the training flow under the same url + parameter is countedThe type and the extraction characteristics of the sensor, for example, part of training flow under the same url + parameter is skipped from the url a, and part of training flow is skipped from the url b, so that the characteristics of similarity calculation with the extracted characteristics are represented by the reference.

y₄: requestContentType: if a certain parameter in the requestHeader is a contentType, which also has its own parameter array, then extracting all types of training traffic under the same url and parameter, and obtaining all types covered by the contentType, for example, in all training traffic of a certain url + parameter of the training model, the extracted contentType types include: application/x-www-form-url encoded, text/html, text/place and the like, the features of the similarity to be calculated under the url + parameter are contentType application/x-www-form-url encoded, contentType text/html and contentType text/palin, when the flow to be detected is detected, the structure in the contentType is judged, if the structure has text/html and text/place, the feature values corresponding to the two types are both 1, and the rest are 0;

y₅: requestBody _ null: judging whether the http request body is empty, and if not, judging that the http request body is 1;

y₆: requestBody _ xml: judging whether the http request body is in an xml format or not, and if the http request body is in the xml format, judging that the http request body is in the 1;

y₇: requestBody _ json: judging whether the http request body is in a json format or not, and if so, judging that the http request body is in the json format, and judging that the http request body is in the json format 1;

y₈: requestBody _ kv _ base 64: judging whether the http request Body contains a Base64 character string or not, and if the http request Body contains a Base64 character string, judging that the http request Body contains the Base64 character string, and if the http request Body contains the Base64 character string, judging that the http request Body is 1;

y₉: requestBody _ kv _ len: judging whether the http request Body contains a long string character string exceeding a certain threshold, and if the http request Body contains the long string character string exceeding the certain threshold, determining that the http request Body is 1;

y₁₀: cookie _ num: in the training model, the training flow under the same url + parameter is countedThe type of the number of cookies in the volume, for example, the number of key value pairs of parameters of the training flow cookie includes 3 and 4, and features are extracted according to the cookie _ num, for example, if the flow to be detected is 3 key value pairs, the feature value of the cookie _ num _3 is 1, and the feature values of the rest cookies _ num _1, cookie _ num _2 and the like are all 0;

y₁₁: reponsebrodyhead: in the training model, the type of the reponsenodyheader in the training flow is counted, and the characteristics are extracted according to the counted parameter type. When the flow to be detected is compared with the flow to be detected, whether a certain responseHeader parameter type is met is judged, if yes, the characteristic value is 1, otherwise, the characteristic value is 0;

y₁₂: responseContentType: counting the http response Content-Type of the training flow in the training model, and when the flow to be detected is detected, extracting features according to the counted types, for example, text/html appears in the training model, wherein the feature value corresponding to responseContentType.text/html is 1 when the similarity is calculated;

y₁₃: ResponseBody: this section is divided into three features: the first characteristic is that: if the response returns a messy code, extracting the characteristic as response. The second feature is for the json format, and the number of key-value pairs is determined as a feature type, for example: responseBody.1, responseBody.2 and responseBody3, the specific number characteristic is obtained according to all training flow statistics of the same url and the same parameter in the training model; the third characteristic is a responseBody public subsequence which is used for firstly extracting all dom subsequences of url + parameters in the training model and extracting the dom tree subsequences of the flow to be detected aiming at the xml format.

Further, the attributes of the user image in the training model, requestMethod, requestHeader, reserer, requestContentType, requestBody _ null, requestBody _ xml, requestBody _ json, requestBody _ kv _ base64, requestkv _ len, cookie _ num, responseHeader, responseContentType, responseBody _ garble, responseBody _ keynum and responseBody participate in similarity calculation, wherein requestMethod, requestHeader, referrer, requestContentType, requestBody _ null, requestBody _ xml, requestBody _ json, requestBody _ kv _ base64, requestKv _ len, cookie _ num, responseHeader, and responseContentType are associated with y of the traffic to be detected₁～y₁₂The attributes of the flow to be detected are in one-to-one correspondence to participate in similarity value calculation, if the responseBody of the flow to be detected is in a messy code or key value pair form, the responseBody _ garble or responseBody _ keynum is in correspondence to participate in similarity value calculation, and the calculation formula is the formula (1) described above; if the responseBody of the flow to be detected is in the form of xml, extracting a sequence calculation value in the user image of the training flow to participate in the calculation of the similarity score, wherein the calculation formula is the formula (2).

Preferably, the manner of determining the statistical score and the similarity score of each flow to be detected corresponding to the same url + different parameters is as follows: respectively presetting a threshold value of a statistical score and a threshold value of a similarity score, wherein the threshold value of the statistical score can be 0.3, the threshold value of the similarity score can be 0.4, comparing and judging the statistical score and the similarity score of each to-be-detected flow corresponding to the same url + different parameters with the preset threshold values, judging the suspicious degree if the statistical score is lower than the threshold value of the statistical score and the similarity score is lower than the threshold value of the similarity score, and otherwise, not calculating the suspicious degree and confirming that the to-be-detected flow is the normal flow.

Furthermore, in the flow-based webshell detection method, the suspicion degree judgment comprises general webshell suspicion degree judgment and ice scorpion type webshell suspicion degree judgment, after the general webshell suspicion degree judgment and the ice scorpion type webshell suspicion degree judgment are carried out twice, the flow to be detected is judged to be normal, and the normal flow is confirmed to be normal twice, otherwise, the flow is abnormal.

Specifically, the method further comprises the following steps before judging the suspiciousness of the general webshell:

acquiring keywords of a known webshell, including: preg _ replace, system _ exec, eval, assert, system, z0, z1, z2, array _ map, portscan, secinfo, and,nowshow, mysql ladmin, sqlfile, phpenv, secinfo, b37k, fromsase 64String,/etc./hosts,/etc./passwd, caidao, chopper, 64che10rpass, action ═ cmd and/(etc./proc), etc., the number of the keywords of webshell in the request body part and the response part in each flow to be detected is counted, and is marked as n₁；

Obtaining cookie parameters in each flow to be detected, obtaining cookie parameters in all training flows, counting the number of cookie parameters in each flow to be detected, wherein the cookie parameters do not exist in all training flows, and recording the number as n₂；

The calculation formula of the score of the doubtful degree is as follows: k-1-e^(q*n)The definition domain is [0, +]Value range of [0, 1]Wherein, K is a suspicion degree score of each flow to be detected, e is a constant, q is a weight for adjusting convergence speed, and may be 1, for example, the weight may be adjusted according to the final result accuracy, and n is equal to n₁+n₂N participating in each calculation₁And n₂The same flow parameter to be detected.

Preferably, the general webshell suspicion degree judgment method comprises the following steps: setting a threshold value of the suspicious degree score, for example, the threshold value of the suspicious degree score may be 0.4, comparing the calculated suspicious degree score with the threshold value of the suspicious degree score, if the suspicious degree score is higher than the threshold value of the suspicious degree score, confirming that the flow to be detected contains a general webshell, and confirming that the flow to be detected is abnormal flow; otherwise, judging that the flow to be detected is normal.

Preferably, the judgment of the suspiciousness of the ice scorpion type webshell comprises the following steps: if the responseBody information entropy formula of the flow to be detected is calculated to be 0, the length of the responseBody exceeds 1000, and a blank space is not contained in the flow to be detected, judging that the flow to be detected contains an ice scorpion type webshell, and confirming that the flow to be detected is an abnormal flow; otherwise, judging that the flow to be detected is normal.

By adopting the method for judging the statistical score, the similarity score and the suspicion degree, all the flow to be detected is judged for three times, so that the condition of missing report can be greatly reduced.

Further, in order to improve the accuracy of detecting the webshell, the method for reducing the false alarm of the abnormal flow comprises the following steps: counting resource files in a training model, wherein the resource files are provided by responseBody in all training traffic, mainly aiming at xml format and including resource files with suffix extensions of js, html, htm, jpg, png, bmp, svg, jpeg, pdf, json, xml, zip, rar, txt, cgi, doc, docx, csv, xls, xlsxx, ppt, pptx and the like, and then loading data provided by responseBody in the user attribute corresponding to each abnormal traffic; if the data provided by the responseBody in any abnormal flow exists in the resource file, the abnormal flow is confirmed to be a normal flow again; otherwise, confirming the flow to be removed and removing.

In addition, for the front-end display result, the user judges that false alarm exists, the flow is confirmed manually, and the flow detected after confirmation is added into the trained user portrait manually in the detection process, so that similar false alarm can occur next time conveniently, and the training model can be improved.

Furthermore, the invention also adds the detected flow with normal detection of the suspicious degree and normal statistical characteristics into the user portrait, thereby facilitating the next detection and continuously improving the training model.

In conclusion, in the flow-based webshell detection method provided by the invention, the flow generated by the same website is subjected to webshell detection, and the user image and the flow jump model diagram are obtained by establishing the training model, so that the flow-based webshell training model is established; detecting websites with built training models, comparing url + parameters of all to-be-detected flows with url + parameters of all training flows, and if the to-be-detected flows with the same url and different parameters are detected, respectively judging the statistical score, the similarity score and the suspiciousness of each to-be-detected flow under the condition to detect abnormal flows containing webshells, so that the flow-based webshells are detected, and the detection missing is reduced; by carrying out false alarm reduction processing on abnormal flow, the accurate and effective detection of the flow to be removed with lower false alarm is realized.

The above description is only a preferred embodiment of the present invention, and does not limit the present invention in any way. It will be understood by those skilled in the art that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for detecting a webshell based on flow is characterized in that the method for detecting the webshell detects the flow generated by the same website, and comprises the following steps:

detecting, detecting a website with an established training model, grouping flows to be detected according to a url + parameter mode, wherein the parameter is an access key value of the flows to be detected, analyzing the attribute of each group of url + parameters to obtain a url + parameter entity of each group of flows to be detected, comparing the url + parameters of all the flows to be detected with the url + parameters of all the training flows, if the flows to be detected have the same url but different parameters, respectively calculating the statistical score and the similarity score of each flow to be detected under the condition, respectively presetting the threshold value of the statistical score and the threshold value of the similarity score, comparing and judging the statistical score and the similarity score of each flow to be detected corresponding to the same url + different parameters with the preset threshold values, if the statistical score is lower than the threshold value of the statistical score, and the similarity score is lower than the threshold value of the similarity score, and if not, calculating the suspicion degree, confirming that the flow to be detected is normal flow, and carrying out false alarm reduction treatment on the abnormal flow to obtain the flow to be removed and the normal flow.

2. The method for flow-based webshell detection as claimed in claim 1, wherein calculating the statistical score when the flow to be detected has the same url and different parameters compared to the training flow comprises the steps of:

3. The method of claim 2, wherein T is a measure of traffic-based webshell₁，T₂…T₄For the feature value of each statistical feature, the calculation formula of each feature value is as follows:

T_a＝1-e^(-x)，

wherein, T_aAnd a is the characteristic value of any one of the statistical characteristics, a is the serial numbers 1, 2, 3 and 4 of the characteristic value, e is a constant, and x is the number of out-degree, in-degree, access IP diversity or user diversity of the extracted flow to be detected in the group.

4. The method of flow-based webshell detection of claim 3,

5. The method for flow-based webshell detection as claimed in claim 1, wherein calculating the similarity score when the flow to be detected has the same url and different parameters compared to the training flow comprises the steps of:

the similarity score is calculated by the formula:

6. As claimed in claim

5The flow-based webshell detection method is characterized in that the calculation formula of the similarity value between the extracted group of flows to be detected and each acquired group of training flows is as follows:

or

S is the similarity value between the extracted group of flows to be detected and each acquired group of training flows, i is a lower limit, n is an upper limit, p is a sequence calculation value in the user portrait of each acquired group of training flows, and y is the sequence calculation value in the user portrait of each acquired group of training flows_iFor extracted vector values, Y, of attributes in the same url + different parameters_iFrequency of each attribute in user profile for each set of training traffic acquired, and y_iAnd Y_iThe attributes are in one-to-one correspondence, and J is a value used for averaging obtained according to experience.

7. The method for detecting the webshell based on the flow according to claim 1, wherein the suspiciousness judgment includes a general webshell suspiciousness judgment and an ice scorpion type webshell suspiciousness judgment, after the general webshell suspiciousness judgment and the ice scorpion type webshell suspiciousness judgment are performed twice, the flow to be detected is judged to be normal twice, and the flow to be detected is judged to be normal twice, otherwise, the flow is judged to be abnormal.

8. The method for detecting the webshell based on the traffic as claimed in claim 7, wherein the method for detecting the suspicious degree of the generic webshell further comprises the following steps:

The calculation formula of the score of the doubtful degree is as follows: k-1-e^(q*n)Wherein, K is the suspicion degree score of each flow to be detected, e is a constant, q is the weight for adjusting convergence speed and speed, and n is equal to n₁+n₂N participating in each calculation₁And n₂The same flow parameter to be detected.

9. The method for detecting the webshell based on the traffic as claimed in claim 8, wherein the determining the suspiciousness of the generic webshell comprises the following steps:

10. The method for detecting the webshell based on the flow according to claim 7, wherein the judgment of the doubtful degree of the ice scorpion type webshell comprises the following steps:

11. The method for detecting the webshell based on the flow according to claim 1, wherein the abnormal flow is processed by reducing false alarm, and the method comprises the following steps:

12. The method for detecting a webshell based on traffic of claim 1, wherein in the process of establishing the training model, the training traffic and the traffic to be detected are grouped according to a url + parameter mode, and the method further comprises the following steps:

13. The method for detecting a webshell based on traffic flow of claim 1, wherein the detecting process further comprises the following steps:

14. The method for detecting the webshell based on the traffic as claimed in claim 1, wherein the attribute of each group url + parameter in the training model comprises: requestMethod, requestHeader, referrer, requestContentType, requestBody _ null, requestBody _ xml, requestBody _ json, requestBody _ kv _ base64, request _ kv _ len, cookie _ key, cookie _ num, responseHeader, responseContentType, responseBody _ garble, responseBody _ keyum, and responseBody.

15. The method of detecting a webshell based on traffic of claim 1, wherein the attributes involved in the calculation of the similarity score include requestMethod, requestHeader, referrer, requestContentType, requestBody _ null, requestBody _ xml, requestBody _ json, requestBody _ kv _ base64, requestBody _ kv _ len, cookie _ num, responsehead, responseContentType, and ResponseBody.