CN113110986A - WebShell script file detection method and system - Google Patents

WebShell script file detection method and system Download PDF

Info

Publication number
CN113110986A
CN113110986A CN202010032760.6A CN202010032760A CN113110986A CN 113110986 A CN113110986 A CN 113110986A CN 202010032760 A CN202010032760 A CN 202010032760A CN 113110986 A CN113110986 A CN 113110986A
Authority
CN
China
Prior art keywords
program
control flow
webshell
external input
command execution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010032760.6A
Other languages
Chinese (zh)
Inventor
杨荣海
王大伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202010032760.6A priority Critical patent/CN113110986A/en
Publication of CN113110986A publication Critical patent/CN113110986A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3684Test management for test design, e.g. generating new test cases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The embodiment of the invention provides a detection method and a detection system for a WebShell script file, which are used for constructing a machine learning model to perform program analysis on a command execution program of a webpage to be detected so as to extract program characteristics capable of reflecting script behaviors better, thereby improving the accuracy of WebShell detection. The method provided by the embodiment of the invention comprises the following steps: constructing a program control stream and a program data stream aiming at a command execution program of a webpage file in a training sample; extracting corresponding control flow characteristics and data flow characteristics in the program control flow and the program data flow respectively, wherein the control flow characteristics and the data flow characteristics accurately describe script behaviors of the command execution program; training the control flow characteristics and the data flow characteristics by using a machine learning algorithm to generate a machine learning model; and detecting the WebShell in the command execution program of the webpage file to be detected by utilizing the machine learning model.

Description

WebShell script file detection method and system
Technical Field
The invention relates to the technical field of webpage security, in particular to a method and a system for detecting a WebShell script file.
Background
WebShell script file detection is a very important issue. With the development of Web services, more and more hackers upload the constructed WebShell to a server page directory, so that privacy information of a user is stolen by accessing the WebShell page, or the server is controlled. Such problems have become a major issue for Web security.
Traditional WebShell script detection relies primarily on two types of features:
1. text features are used. And detecting the sample by using the high-risk system function calling times, the occurrence times of the blackish keywords, or statistical characteristics such as information entropy, coincidence index and compression index in the script file.
2. Using the grammatical features. And (3) representing the detection sample into an abstract syntax tree by using syntax analysis, and extracting features such as function call, confusion operation (such as character string splicing) and the like on the abstract syntax tree.
The grammatical features are more accurate than the textual features. Nevertheless, both types of features are of relatively low order. On part of the sample, the features are not very distinguishable. For example, eval () is a hazard function commonly used by WebShell. This function is also used multiple times in a normal script file. It is not possible to distinguish whether the function call will cause harm based on only text features or grammatical features. Therefore, the existing solutions have a low accuracy for such samples.
Disclosure of Invention
The embodiment of the invention provides a detection method and a detection system for a WebShell script file, which are used for constructing a machine learning model to perform program analysis on a command execution program of a webpage to be detected so as to extract program characteristics capable of reflecting script behaviors better, thereby improving the accuracy of WebShell detection.
A first aspect of an embodiment of the present application provides a method for detecting a WebShell script file, including:
constructing a program control stream and a program data stream aiming at a command execution program of a webpage file in a training sample;
extracting corresponding control flow characteristics and data flow characteristics in the program control flow and the program data flow respectively, wherein the control flow characteristics and the data flow characteristics accurately describe script behaviors of the command execution program;
training the control flow characteristics and the data flow characteristics by using a machine learning algorithm to generate a machine learning model;
and detecting the WebShell in the command execution program of the webpage file to be detected by utilizing the machine learning model.
Preferably, before the detecting the WebShell in the command execution program of the web file to be detected by using the machine learning model, the method further includes:
constructing a black and white list filtering mechanism to filter a normal executive program and WebShell in the command executive program to obtain an uncertain executive program in the command executive program;
the detection of the WebShell in the command execution program of the webpage file to be detected by using the machine learning model comprises the following steps:
and performing WebShell detection on the uncertainty executive program in the command executive program by utilizing the machine learning model.
Preferably, the constructing the black and white list filtering mechanism includes:
and constructing the black and white list filtering mechanism through at least one of rule matching, hash algorithm and word vector matching algorithm, wherein the hash algorithm comprises strong hash algorithm and/or weak hash algorithm.
Preferably, the executing the program aiming at the command, and constructing a program control flow and a program data flow, includes:
representing basic code blocks in the command execution program by using nodes, wherein directed edges among the nodes represent paths of program control flow, and reverse edges among the nodes represent possible loops to construct the program control flow;
and traversing the program control flow, recording an initialization point and a reference point of a variable, and storing parameter information and data information corresponding to the initialization point and the reference point to construct the program data flow.
Preferably, the extracting the corresponding control flow features in the program control flow includes:
and extracting at least one characteristic of a circulation condition, a judgment condition, an external input value or a variable related to the external input value in the judgment condition, a comparison object and a judgment result in the judgment condition in the program control flow as the control flow characteristic, wherein the comparison object is used for comparing with the external input value or the variable related to the external input value.
Preferably, the extracting the corresponding data stream feature in the program data stream includes:
extracting a danger function in the program data stream;
performing analysis on the risk function by utilizing taint propagation to judge whether parameters in the risk function can receive external input or not;
analyzing whether the external input can be transferred to the danger function by using the reached fixed value.
Preferably, the analyzing the risk function by using the taint propagation to determine whether the parameters in the risk function can receive the external input includes:
and judging whether the parameters in the dangerous function can receive external input or not by judging whether the path condition and/or the function calling relation are effective or not.
A second aspect of the embodiments of the present application provides a detection system for a WebShell script file, including:
the building unit is used for executing a program aiming at the command of the webpage file in the training sample and building a program control stream and a program data stream;
the extracting unit is used for respectively extracting corresponding control flow characteristics and data flow characteristics in the program control flow and the program data flow, wherein the control flow characteristics and the data flow characteristics accurately describe script behaviors of the command execution program;
the training unit is used for performing training on the control flow characteristics and the data flow characteristics by using a machine learning algorithm to generate a machine learning model;
and the detection unit is used for detecting the WebShell in the webpage file command execution program to be detected by utilizing the machine learning model.
Preferably, the system further comprises:
the filtering unit is used for constructing a black and white list filtering mechanism so as to filter a normal execution program and WebShell in the command execution program and obtain an uncertainty execution program in the command execution program;
the detection unit is specifically configured to:
and performing WebShell detection on the uncertainty executive program in the command executive program by utilizing the machine learning model.
Preferably, the filter unit is specifically configured to:
and constructing the black and white list filtering mechanism through at least one of rule matching, hash algorithm and word vector matching algorithm, wherein the hash algorithm comprises strong hash algorithm and/or weak hash algorithm.
The construction unit is specifically configured to:
representing basic code blocks in the command execution program by using nodes, wherein directed edges among the nodes represent paths of program control flow, and reverse edges among the nodes represent possible loops to construct the program control flow;
and traversing the program control flow, recording an initialization point and a reference point of a variable, and storing parameter information and data information corresponding to the initialization point and the reference point to construct the program data flow.
Preferably, the extraction unit is specifically configured to:
and extracting at least one characteristic of a circulation condition, a judgment condition, an external input value or a variable related to the external input value in the judgment condition, a comparison object and a judgment result in the judgment condition in the program control flow as the control flow characteristic, wherein the comparison object is used for comparing with the external input value or the variable related to the external input value.
Preferably, the extraction unit is specifically configured to:
extracting a danger function in the program data stream;
performing analysis on the risk function by utilizing taint propagation to judge whether parameters in the risk function can receive external input or not;
analyzing whether the external input can be transferred to the danger function by using the reached fixed value.
Preferably, the extraction unit is specifically configured to:
and judging whether the parameters in the dangerous function can receive external input or not by judging whether the path condition and/or the function calling relation are effective or not.
A third aspect of the present embodiment provides a computer apparatus, including a processor, where the processor is configured to, when executing a computer program stored on a memory, implement the method for detecting a WebShell script file provided in the first aspect of the present embodiment.
A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when being executed by a processor, is configured to implement the method for detecting a WebShell script file provided in the first aspect of the embodiments of the present application.
According to the technical scheme, the embodiment of the invention has the following advantages:
in the embodiment of the application, a command execution program of a webpage file in a training sample is obtained; constructing a program control stream and a program data stream for the command execution program; extracting corresponding control flow characteristics and data flow characteristics in the program control flow and the program data flow respectively, wherein the control flow characteristics and the data flow characteristics accurately describe script behaviors of the command execution program; training the control flow characteristics and the data flow characteristics by using a machine learning algorithm to generate a machine learning model; the WebShell in the webpage file command execution program to be detected is detected by utilizing the machine learning model, because the machine learning model in the embodiment is generated by learning the control flow characteristics and the data flow characteristics in the training sample, and the control flow characteristics and the data flow characteristics accurately describe the script behaviors of the command execution program, the machine learning model in the embodiment has higher accuracy in the detection of the webpage command execution program to be detected, and the WebShell identification is performed by the machine learning model.
Drawings
Fig. 1 is a schematic diagram of an embodiment of a detection method for a WebShell script file in an embodiment of the present application;
FIG. 2 is a schematic diagram of another embodiment of a detection method for a WebShell script file in an embodiment of the present application;
FIG. 3 is a detailed step of step 101 in the embodiment of FIG. 1 of the present application;
FIG. 4 is a schematic diagram of a process for constructing a program control flow in an embodiment of the present application;
FIG. 5 is a diagram illustrating a process for constructing a program data stream according to an embodiment of the present application;
FIG. 6 is a detailed step of step 102 in the embodiment of FIG. 1 of the present application;
FIG. 7 is a detailed step of step 602 in the embodiment of FIG. 6 of the present application;
fig. 8 is a schematic diagram of an embodiment of a detection system for a WebShell script file in an embodiment of the present application.
Detailed Description
The embodiment of the invention provides a detection method and a detection system for a WebShell script file, which are used for carrying out program analysis on a command execution program of a webpage so as to extract program characteristics capable of reflecting script behaviors better and improve the accuracy of WebShell identification.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, the terms of art referred to in this application are described first and will be continued for use in the description of the later sections and will not be described again.
WebShell is a command execution program in the form of a webpage file, also called a script file, is an important means for hackers to invade websites, and usually exists in the form of webpage files such as asp, php, jsp or cgi.
Program analysis: the program written in a certain language is used as an object, and the internal operation flow is analyzed. Including data flow analysis and control flow analysis.
Controlling flow: describing the sequence and calling relation of the program logic execution.
Data flow: describing the circulation propagation mode and the behavior state of data in the program running process.
Abstract Syntax Tree (AST): an abstract representation of a source code syntax structure. It represents the syntactic structure of the programming language in the form of a tree, each node on the tree representing a structure in the source code.
Generalization ability: the learning ability can be understood in a colloquial way, which means the adaptability of a machine learning algorithm to a fresh sample, and an excellent machine learning algorithm has strong generalization ability.
Next, a method for detecting a WebShell script file in the present application is described, and referring to fig. 1, an embodiment of a method for detecting a WebShell script file in the present application includes:
101. constructing a program control stream and a program data stream aiming at a command execution program of a webpage file in a training sample;
the web page file is generally executed by a corresponding command execution program (script), and in order to improve the security of the web page file, the command execution program of the web page file is detected, and a WebShell script file in the command execution program is discovered, so that the security of the web page is improved.
Specifically, WebShell is a command execution program in the form of a web page file, also called a script file, which is an important means for hackers to invade websites, and usually exists in the form of web page files such as asp, php, jsp, cgi, and the like.
Different from a method for detecting the text characteristic or the grammatical characteristic of the command execution program, before the detection is carried out on the webpage file to be detected, the method can construct a program control stream and a program data stream aiming at the command execution program of the webpage file in the training sample so as to carry out program analysis on the command execution program. The training sample comprises a command execution program of a normal webpage and a command execution program of a problem webpage (including a webpage of WebShell). Specifically, the process of constructing the program control stream and the program data stream will be described in the following embodiments, and will not be described herein again.
102. Extracting corresponding control flow characteristics and data flow characteristics in the program control flow and the program data flow respectively, wherein the control flow characteristics and the data flow characteristics accurately describe script behaviors of the command execution program;
after a program control flow and a program data flow corresponding to the command execution program are obtained, corresponding control flow characteristics and data flow characteristics in the program control flow and the program data flow are respectively extracted, wherein the control flow characteristics and the data flow characteristics accurately describe script behaviors of the command execution program, in other words, the control flow characteristics and the data flow characteristics are characteristics of range control information and data information.
The process of proposing the control flow characteristics and the data flow characteristics is also described in detail in the following embodiments, and will not be described in detail here.
103. Training the control flow characteristics and the data flow characteristics by using a machine learning algorithm to generate a machine learning model;
and (4) training the control flow characteristics and the data flow characteristics acquired in the step 102 by using a machine learning algorithm to obtain a machine learning model.
Specifically, the machine learning algorithm includes, but is not limited to, an svm (support Vector machine) support Vector machine, a cnn (convolutional Neural network) convolutional Neural network algorithm, a decision tree algorithm, a random forest algorithm, and a logistic regression algorithm, and is not limited herein.
104. And identifying the WebShell in the webpage command execution program to be detected by utilizing the machine learning model.
And after the machine learning model is obtained, inputting the control flow characteristics and the data flow characteristics of the webpage command executive program to be detected in the machine learning model, namely outputting the corresponding WebShell in the webpage command executive program to be detected.
In the embodiment of the application, a program control stream and a program data stream are constructed aiming at a command execution program of a webpage file in a training sample; extracting corresponding control flow characteristics and data flow characteristics in the program control flow and the program data flow respectively, wherein the control flow characteristics and the data flow characteristics accurately describe script behaviors of the command execution program; training the control flow characteristics and the data flow characteristics by using a machine learning algorithm to generate a machine learning model; the WebShell in the webpage file command execution program to be detected is detected by utilizing the machine learning model, because the machine learning model in the embodiment is generated by learning the control flow characteristics and the data flow characteristics in the training sample, and the control flow characteristics and the data flow characteristics accurately describe the script behaviors of the command execution program, the machine learning model in the embodiment has higher accuracy in the detection of the webpage command execution program to be detected, and the WebShell identification is performed by the machine learning model.
Based on the embodiment described in fig. 1, before step 104, in order to improve the detection efficiency of the WebShell, the following steps may also be performed, please refer to fig. 2, where another embodiment of a detection method for a WebShell script file in the present application includes:
201. constructing a black and white list filtering mechanism to filter a normal executive program and WebShell in the command executive program to obtain an uncertain executive program in the command executive program;
in order to improve the detection efficiency of WebShell in a webpage file, before identifying WebShell in a webpage command execution program to be detected by using a machine learning model, a black and white list filtering mechanism can be constructed to filter a normal execution program and the WebShell in the command execution program to obtain an uncertain execution program in the command execution program, and then learning is performed on the uncertain execution program by using the machine learning model.
Specifically, the black-and-white list filtering mechanism is constructed and can be realized through at least one of a rule matching algorithm, a hash algorithm and a word vector matching algorithm, wherein the hash algorithm comprises a strong hash algorithm and/or a weak hash algorithm.
Specifically, the rule matching is to respectively construct regular rules for a known WebShell and a known normal command execution program, so as to perform quick matching on the WebShell and the normal command execution program in the command execution program to be detected, thereby achieving the purpose of identifying the WebShell and the normal command execution program. Furthermore, in order to improve the matching efficiency, the character strings in the regular rule can be integrated to generate an automaton for matching the to-be-detected command execution program, so as to improve the matching efficiency. The automaton is only one way of improving the string matching efficiency, and can also be implemented in other ways, and no specific limitation is made here.
If the hash value of the command execution program to be detected is matched with the first hash value, the command execution program is proved to be the WebShell or the normal command execution program, and otherwise, the hash value is not the WebShell or the normal command execution program.
It should be noted that, when calculating the hash value of the command execution program, the conventional MD5 algorithm has a completely different hash value once the command execution program slightly changes, and the weak hash algorithm (e.g., ssdeep) is similar to the hash value of the slightly modified WebShell, and in order to improve the detection rate of the WebShell, the weak hash algorithm is preferred in the embodiment.
In addition, the word vector algorithm is also a mechanism for constructing black and white lists. Specifically, a known WebShell, a known normal command execution program and a to-be-detected command execution program are respectively converted into vectors, then the distance between the word vector of the to-be-detected command execution program and the word vector of the known WebShell is calculated, if the distance is smaller than or equal to a preset threshold value, the to-be-detected command execution program is the WebShell, otherwise, the distance is not the preset threshold value.
Similarly, the distance between the word vector of the command execution program to be detected and the word vector of the known normal command execution program is calculated, if the distance is smaller than or equal to the preset threshold value, the command execution program to be detected is the normal command execution program, otherwise, the distance is not the preset threshold value.
It should be noted that the black-and-white list can be constructed by any one of the above methods, or a weighted combination of multiple methods, and is not limited herein. When the black-and-white list is constructed by the three methods, the command executive program to be detected is determined to be WebShell when any method identifies WebShell.
202. And performing WebShell detection on the uncertainty executive program in the command executive program by utilizing the machine learning model.
And filtering a normal executive program and the WebShell in the command executive program by using a black and white list filtering mechanism, and executing WebShell detection on the uncertain program by using a machine learning model after the uncertain executive program is obtained.
In the embodiment of the application, before the command execution program of the webpage file to be detected executes machine learning model detection, the command execution program is filtered and detected through a black-and-white list filtering mechanism, and the black-and-white list filtering mechanism has higher calculation speed, so that the WebShell detection efficiency in the command execution program of the webpage file to be detected can be improved.
Based on the embodiment described in fig. 1, step 101 is described in detail below, and fig. 3 is a detailed step of step 101:
301. representing basic code blocks in the command execution program by using nodes, wherein directed edges among the nodes represent paths of program control flow, and reverse edges among the nodes represent loops which may exist so as to construct the program control flow;
the program control flow is used for describing the sequence and calling relationship of the program logic execution.
For convenience of explanation, the process of building the program control flow is described below by way of fig. 4, where, when building the program control flow, the basic code blocks in the command execution program are represented by nodes, the directed edges (i.e., directed arrows) between the nodes represent the paths of the program control flow, and the reverse edges (i.e., reverse arrows) between the nodes represent loops that may exist, i.e., the control flow diagram in fig. 4 is formed.
302. And traversing the program control flow, recording an initialization point and a reference point of a variable, and storing parameter information and data information corresponding to the initialization point and the reference point to construct the program data flow.
The program data flow is used for describing the flow propagation mode and the behavior state of data in the program running process. When a program data stream is constructed, traversing the program control stream, recording an initialization point and a reference point of a variable, and storing parameter information and data information corresponding to the initialization point and the reference point to construct the program data stream. For ease of understanding, FIG. 5 provides a schematic diagram of the program data flow corresponding to FIG. 4.
In the embodiment of the application, the process of executing the program according to the command and constructing the program control stream and the program data stream is described in detail, so that the implementability of the embodiment of the application is improved.
Based on the embodiment described in fig. 1 and 3, step 102 is described in detail below, and fig. 6 is a detailed step of step 102:
601. and extracting at least one characteristic of a circulation condition, a judgment condition, an external input value or a variable related to the external input value in the judgment condition, a comparison object and a judgment result in the judgment condition in the program control flow as the control flow characteristic, wherein the comparison object is used for comparing with the external input value or the variable related to the external input value.
The control flow characteristic is a characteristic reflecting program control information, and can be various types of characteristics in the program control flow. As a preferable control flow characteristic, in the embodiment of the present application, at least one of a loop condition in a program control flow, a judgment condition, an external input value or a variable related to the external input value in the judgment condition, a comparison object in the judgment condition, and a judgment result is mainly used as the control flow characteristic, where the comparison object is used for comparing with the external input value or the variable related to the external input value.
Because an attacker typically sets a password to his WebShell (e.g., $ cmd ═ passswd in FIG. 4), malicious actions are triggered only when the received request contains the correct password ("passswd"). So that i prefer at least one of the characteristics of the judgment condition, the external input value or the variable related to the external input value in the judgment condition, the comparison object in the judgment condition, and the judgment result as the control flow characteristic.
It is easily understood that the web command execution program including the determination condition, the external input value or the variable related to the external input value in the determination condition, the comparison object in the determination condition, and one or more features in the determination result may be WebShell or non-WebShell.
602. Extracting a danger function in the program data stream, analyzing the danger function by utilizing taint propagation to judge whether parameters in the danger function can receive external input, and analyzing the external input by utilizing a reached fixed value to judge whether the external input can be transmitted to the danger function.
The data flow characteristics, i.e. the characteristics reflecting the data flow information, may also be various forms of characteristics in the program data flow. As a preferred data flow characteristic, in the embodiment of the present application, a hazard function in a program data flow, whether a parameter in the hazard function can receive an external input, and whether the external input can be transmitted to the hazard function are mainly used as the data flow characteristic.
Because attackers desire more flexible control over WebShell (e.g., executing arbitrary commands), the operations performed are often not pre-arranged in WebShell scripts. Conversely, an attacker receives external input inside the script and executes the external input (e.g., $ _ GET [ "cmd" ] to receive the external input in fig. 5). Therefore, regardless of the confusion made by the script, the commands associated with the external input are eventually executed.
Therefore, the embodiment of the application mainly uses the risk function, judges whether the parameters in the risk function can receive external input by utilizing the taint propagation analysis, and analyzes whether the external input can be transmitted to the risk function by utilizing the reached fixed value as the main data stream characteristics.
In the embodiment of the application, the process of extracting the control flow characteristics and the data flow characteristics from the program control flow and the program data flow is described in detail, so that the implementability of the embodiment of the application is improved.
Based on the embodiment described in fig. 6, in order to improve the accuracy of whether the parameters in the risk function can receive the external input identification in step 602, the following method can be further implemented, please refer to fig. 7, and fig. 7 is a detailed step of step 602:
701. and judging whether the parameters in the dangerous function can receive external input or not by judging whether the path condition and/or the function calling relation are effective or not.
Specifically, when determining whether the parameter in the hazard function can receive the external input, the determination may be performed by determining whether the path condition or the function call relationship is valid.
As in the path a, the variable X is used to receive an external input, the operation a is performed next, and the result of the operation a is transferred to the variable Y, but the operation a is a ═ X × 0, which corresponds to the operation a covering all external inputs with 0, and therefore the path condition is an invalid path condition, that is, the path a can receive an external input, but the last external input is all zeroed, and thus the path a corresponds to an invalid path.
For another example: in the function call relation, E calls an F function, the F function is used for receiving external input and executing G operation, when F receives the external input and executes the G operation, the function is a function related to the external input, the E is judged to be capable of receiving the external input, and when F receives the external input and executes the G operation, the function is a constant not related to the external input, the E is judged not to be capable of receiving the external input.
According to the method and the device, whether the parameters in the dangerous function can receive the external input or not is judged by judging whether the path conditions and/or the function calling relation are effective or not, and the accuracy of judging whether the dangerous function can receive the external input or not is improved.
Referring to fig. 8, the following describes a detection system for a WebShell script file in an embodiment of the present application, where an embodiment of the detection system for a WebShell script file in the embodiment of the present application includes:
a constructing unit 801, configured to execute a program according to a command of a webpage file in a training sample, and construct a program control stream and a program data stream;
an extracting unit 802, configured to extract a control flow feature and a data flow feature corresponding to the program control flow and the program data flow, respectively, where the control flow feature and the data flow feature accurately describe a script behavior of the command execution program;
a training unit 803, configured to perform training on the control flow features and the data flow features by using a machine learning algorithm, so as to generate a machine learning model;
the detecting unit 804 is configured to detect the WebShell in the to-be-detected web file command execution program by using the machine learning model.
Preferably, the system further comprises:
the filtering unit 805 is configured to construct a black and white list filtering mechanism to filter out a normal execution program and a WebShell in the command execution program, so as to obtain an uncertainty execution program in the command execution program;
the detecting unit 804 is specifically configured to:
and performing WebShell detection on the uncertainty executive program in the command executive program by utilizing the machine learning model.
Preferably, the filtering unit 805 is specifically configured to:
and constructing the black and white list filtering mechanism through at least one of rule matching, hash algorithm and word vector matching algorithm, wherein the hash algorithm comprises strong hash algorithm and/or weak hash algorithm.
The constructing unit 801 is specifically configured to:
representing basic code blocks in the command execution program by using nodes, wherein directed edges among the nodes represent paths of program control flow, and reverse edges among the nodes represent possible loops to construct the program control flow;
and traversing the program control flow, recording an initialization point and a reference point of a variable, and storing parameter information and data information corresponding to the initialization point and the reference point to construct the program data flow.
Preferably, the extracting unit 802 is specifically configured to:
and extracting at least one characteristic of a circulation condition, a judgment condition, an external input value or a variable related to the external input value in the judgment condition, a comparison object and a judgment result in the judgment condition in the program control flow as the control flow characteristic, wherein the comparison object is used for comparing with the external input value or the variable related to the external input value.
Preferably, the extracting unit 802 is specifically configured to:
extracting a danger function in the program data stream;
performing analysis on the risk function by utilizing taint propagation to judge whether parameters in the risk function can receive external input or not;
analyzing whether the external input can be transferred to the danger function by using the reached fixed value.
Preferably, the extracting unit 802 is specifically configured to:
and judging whether the parameters in the dangerous function can receive external input or not by judging whether the path condition and/or the function calling relation are effective or not.
In the embodiment of the application, a program control stream and a program data stream are constructed by a construction unit 801 aiming at a command execution program of a webpage file in a training sample; extracting corresponding control flow characteristics and data flow characteristics from the program control flow and the program data flow respectively through an extraction unit 802, wherein the control flow characteristics and the data flow characteristics accurately describe script behaviors of the command execution program; performing training on the control flow features and the data flow features by a training unit 803 using a machine learning algorithm to generate a machine learning model; the WebShell in the to-be-detected webpage file command execution program is detected by the detection unit 804 by using the machine learning model, because the machine learning model in the embodiment is generated by learning the control flow features and the data flow features in the training sample, and the control flow features and the data flow features accurately describe the script behaviors of the command execution program, the machine learning model in the embodiment has higher accuracy in the detection of the to-be-detected webpage command execution program in the identification of the WebShell.
The detection system of the WebShell script file in the embodiment of the present invention is described above from the perspective of the modular functional entity, and the computer apparatus in the embodiment of the present invention is described below from the perspective of hardware processing:
the computer device is used for realizing the function of a detection system of the WebShell script file, and one embodiment of the computer device in the embodiment of the invention comprises the following steps:
a processor and a memory;
the memory is used for storing the computer program, and the processor is used for realizing the following steps when executing the computer program stored in the memory:
constructing a program control stream and a program data stream aiming at a command execution program of a webpage file in a training sample;
extracting corresponding control flow characteristics and data flow characteristics in the program control flow and the program data flow respectively, wherein the control flow characteristics and the data flow characteristics accurately describe script behaviors of the command execution program;
training the control flow characteristics and the data flow characteristics by using a machine learning algorithm to generate a machine learning model;
and detecting the WebShell in the command execution program of the webpage file to be detected by utilizing the machine learning model.
In some embodiments of the present invention, the processor may be further configured to:
constructing a black and white list filtering mechanism to filter a normal executive program and WebShell in the command executive program to obtain an uncertain executive program in the command executive program;
the detection of the WebShell in the command execution program of the webpage file to be detected by using the machine learning model comprises the following steps:
and performing WebShell detection on the uncertainty executive program in the command executive program by utilizing the machine learning model.
In some embodiments of the present invention, the processor may be further configured to:
and constructing the black and white list filtering mechanism through at least one of rule matching, hash algorithm and word vector matching algorithm, wherein the hash algorithm comprises strong hash algorithm and/or weak hash algorithm.
In some embodiments of the present invention, the processor may be further configured to:
representing basic code blocks in the command execution program by using nodes, wherein directed edges among the nodes represent paths of program control flow, and reverse edges among the nodes represent possible loops to construct the program control flow;
and traversing the program control flow, recording an initialization point and a reference point of a variable, and storing parameter information and data information corresponding to the initialization point and the reference point to construct the program data flow.
In some embodiments of the present invention, the processor may be further configured to:
and extracting at least one characteristic of a circulation condition, a judgment condition, an external input value or a variable related to the external input value in the judgment condition, a comparison object and a judgment result in the judgment condition in the program control flow as the control flow characteristic, wherein the comparison object is used for comparing with the external input value or the variable related to the external input value.
In some embodiments of the present invention, the processor may be further configured to:
extracting a danger function in the program data stream;
performing analysis on the risk function by utilizing taint propagation to judge whether parameters in the risk function can receive external input or not;
analyzing whether the external input can be transferred to the danger function by using the reached fixed value.
In some embodiments of the present invention, the processor may be further configured to:
and judging whether the parameters in the dangerous function can receive external input or not by judging whether the path condition and/or the function calling relation are effective or not.
It is to be understood that, when the processor in the computer apparatus described above executes the computer program, the functions of each unit in the corresponding apparatus embodiments may also be implemented, and are not described herein again. Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the detection system of the WebShell script file. For example, the computer program may be divided into units in the detection system of the above-described WebShell script file, and each unit may implement a specific function as described in the detection system of the above-described corresponding WebShell script file.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing equipment. The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the processor, memory are merely examples of a computer apparatus and are not meant to be limiting, and that more or fewer components may be included, or certain components may be combined, or different components may be included, for example, the computer apparatus may also include input output devices, network access devices, buses, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the terminal, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The present invention also provides a computer-readable storage medium for implementing the functions of the WebShell script file detection system, on which a computer program is stored, which, when executed by a processor, may be configured to perform the following steps:
constructing a program control stream and a program data stream aiming at a command execution program of a webpage file in a training sample;
extracting corresponding control flow characteristics and data flow characteristics in the program control flow and the program data flow respectively, wherein the control flow characteristics and the data flow characteristics accurately describe script behaviors of the command execution program;
training the control flow characteristics and the data flow characteristics by using a machine learning algorithm to generate a machine learning model;
and detecting the WebShell in the command execution program of the webpage file to be detected by utilizing the machine learning model.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
constructing a black and white list filtering mechanism to filter a normal executive program and WebShell in the command executive program to obtain an uncertain executive program in the command executive program;
the detection of the WebShell in the command execution program of the webpage file to be detected by using the machine learning model comprises the following steps:
and performing WebShell detection on the uncertainty executive program in the command executive program by utilizing the machine learning model.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
and constructing the black and white list filtering mechanism through at least one of rule matching, hash algorithm and word vector matching algorithm, wherein the hash algorithm comprises strong hash algorithm and/or weak hash algorithm.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
representing basic code blocks in the command execution program by using nodes, wherein directed edges among the nodes represent paths of program control flow, and reverse edges among the nodes represent possible loops to construct the program control flow;
and traversing the program control flow, recording an initialization point and a reference point of a variable, and storing parameter information and data information corresponding to the initialization point and the reference point to construct the program data flow.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
and extracting at least one characteristic of a circulation condition, a judgment condition, an external input value or a variable related to the external input value in the judgment condition, a comparison object and a judgment result in the judgment condition in the program control flow as the control flow characteristic, wherein the comparison object is used for comparing with the external input value or the variable related to the external input value.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
extracting a danger function in the program data stream;
performing analysis on the risk function by utilizing taint propagation to judge whether parameters in the risk function can receive external input or not;
analyzing whether the external input can be transferred to the danger function by using the reached fixed value.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
and judging whether the parameters in the dangerous function can receive external input or not by judging whether the path condition and/or the function calling relation are effective or not.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (14)

1. A detection method of a WebShell script file is characterized by comprising the following steps:
constructing a program control stream and a program data stream aiming at a command execution program of a webpage file in a training sample;
extracting corresponding control flow characteristics and data flow characteristics in the program control flow and the program data flow respectively, wherein the control flow characteristics and the data flow characteristics accurately describe script behaviors of the command execution program;
training the control flow characteristics and the data flow characteristics by using a machine learning algorithm to generate a machine learning model;
and detecting the WebShell in the command execution program of the webpage file to be detected by utilizing the machine learning model.
2. The method according to claim 1, wherein before the detecting WebShell in the webpage file command execution program to be detected by using the machine learning model, the method further comprises:
constructing a black and white list filtering mechanism to filter a normal executive program and WebShell in the command executive program to obtain an uncertain executive program in the command executive program;
the detection of the WebShell in the command execution program of the webpage file to be detected by using the machine learning model comprises the following steps:
and performing WebShell detection on the uncertainty executive program in the command executive program by utilizing the machine learning model.
3. The method of claim 2, wherein the constructing the blacklist filter mechanism comprises:
and constructing the black and white list filtering mechanism through at least one of rule matching, hash algorithm and word vector matching algorithm, wherein the hash algorithm comprises strong hash algorithm and/or weak hash algorithm.
4. The method of claim 1, wherein said executing a program for said command, building a program control flow and a program data flow, comprises:
representing basic code blocks in the command execution program by using nodes, wherein directed edges among the nodes represent paths of program control flow, and reverse edges among the nodes represent possible loops to construct the program control flow;
and traversing the program control flow, recording an initialization point and a reference point of a variable, and storing parameter information and data information corresponding to the initialization point and the reference point to construct the program data flow.
5. The method of claim 4, wherein extracting corresponding control flow features in the program control flow comprises:
and extracting at least one characteristic of a circulation condition, a judgment condition, an external input value or a variable related to the external input value in the judgment condition, a comparison object and a judgment result in the judgment condition in the program control flow as the control flow characteristic, wherein the comparison object is used for comparing with the external input value or the variable related to the external input value.
6. The method of claim 4, wherein extracting corresponding data stream features in the program data stream comprises:
extracting a danger function in the program data stream;
performing analysis on the risk function by utilizing taint propagation to judge whether parameters in the risk function can receive external input or not;
analyzing whether the external input can be transferred to the danger function by using the reached fixed value.
7. The method of claim 6, wherein the performing an analysis of the risk function using taint propagation to determine whether parameters in the risk function can receive external inputs comprises:
and judging whether the parameters in the dangerous function can receive external input or not by judging whether the path condition and/or the function calling relation are effective or not.
8. A detection system for WebShell script files, comprising:
the building unit is used for executing a program aiming at the command of the webpage file in the training sample and building a program control stream and a program data stream;
the extracting unit is used for respectively extracting corresponding control flow characteristics and data flow characteristics in the program control flow and the program data flow, wherein the control flow characteristics and the data flow characteristics accurately describe script behaviors of the command execution program;
the training unit is used for performing training on the control flow characteristics and the data flow characteristics by using a machine learning algorithm to generate a machine learning model;
and the detection unit is used for detecting the WebShell in the webpage file command execution program to be detected by utilizing the machine learning model.
9. The system according to claim 8, characterized in that said building unit is specifically configured to:
representing basic code blocks in the command execution program by using nodes, wherein directed edges among the nodes represent paths of program control flow, and reverse edges among the nodes represent possible loops to construct the program control flow;
and traversing the program control flow, recording an initialization point and a reference point of a variable, and storing parameter information and data information corresponding to the initialization point and the reference point to construct the program data flow.
10. The system according to claim 9, wherein the extraction unit is specifically configured to:
and extracting at least one characteristic of a circulation condition, a judgment condition, an external input value or a variable related to the external input value in the judgment condition, a comparison object and a judgment result in the judgment condition in the program control flow as the control flow characteristic, wherein the comparison object is used for comparing with the external input value or the variable related to the external input value.
11. The system according to claim 10, wherein the extraction unit is specifically configured to:
extracting a danger function in the program data stream;
performing analysis on the risk function by utilizing taint propagation to judge whether parameters in the risk function can receive external input or not;
analyzing whether the external input can be transferred to the danger function by using the reached fixed value.
12. The system according to claim 11, wherein the extraction unit is specifically configured to:
and judging whether the parameters in the dangerous function can receive external input or not by judging whether the path condition and/or the function calling relation are effective or not.
13. A computer arrangement comprising a processor, characterized in that the processor, when executing a computer program stored on a memory, is adapted to implement the detection method of a WebShell script file as claimed in any of claims 1 to 7.
14. A computer-readable storage medium, on which a computer program is stored, for implementing a detection method of a WebShell script file as claimed in any one of claims 1 to 7 when the computer program is executed by a processor.
CN202010032760.6A 2020-01-13 2020-01-13 WebShell script file detection method and system Pending CN113110986A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010032760.6A CN113110986A (en) 2020-01-13 2020-01-13 WebShell script file detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010032760.6A CN113110986A (en) 2020-01-13 2020-01-13 WebShell script file detection method and system

Publications (1)

Publication Number Publication Date
CN113110986A true CN113110986A (en) 2021-07-13

Family

ID=76709994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010032760.6A Pending CN113110986A (en) 2020-01-13 2020-01-13 WebShell script file detection method and system

Country Status (1)

Country Link
CN (1) CN113110986A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407809A (en) * 2016-09-20 2017-02-15 四川大学 A Linux platform malicious software detection method
CN106961419A (en) * 2017-02-13 2017-07-18 深信服科技股份有限公司 WebShell detection methods, apparatus and system
CN107194251A (en) * 2017-04-01 2017-09-22 中国科学院信息工程研究所 Android platform malicious application detection method and device
CN107659570A (en) * 2017-09-29 2018-02-02 杭州安恒信息技术有限公司 Webshell detection methods and system based on machine learning and static and dynamic analysis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407809A (en) * 2016-09-20 2017-02-15 四川大学 A Linux platform malicious software detection method
CN106961419A (en) * 2017-02-13 2017-07-18 深信服科技股份有限公司 WebShell detection methods, apparatus and system
CN107194251A (en) * 2017-04-01 2017-09-22 中国科学院信息工程研究所 Android platform malicious application detection method and device
CN107659570A (en) * 2017-09-29 2018-02-02 杭州安恒信息技术有限公司 Webshell detection methods and system based on machine learning and static and dynamic analysis

Similar Documents

Publication Publication Date Title
US9621570B2 (en) System and method for selectively evolving phishing detection rules
JP6731988B2 (en) System and method for detecting malicious files using a trained machine learning model
JP6636096B2 (en) System and method for machine learning of malware detection model
JP6736532B2 (en) System and method for detecting malicious files using elements of static analysis
JP6715292B2 (en) System and method for detecting malicious files using machine learning
US8762948B1 (en) System and method for establishing rules for filtering insignificant events for analysis of software program
CN111382434B (en) System and method for detecting malicious files
KR101874373B1 (en) A method and apparatus for detecting malicious scripts of obfuscated scripts
CN107659570A (en) Webshell detection methods and system based on machine learning and static and dynamic analysis
JP2020115328A (en) System and method for classification of objects of computer system
KR101858620B1 (en) Device and method for analyzing javascript using machine learning
EP3474175A1 (en) System and method of managing computing resources for detection of malicious files based on machine learning model
US20170372069A1 (en) Information processing method and server, and computer storage medium
CN109214178B (en) APP application malicious behavior detection method and device
CN111222137A (en) Program classification model training method, program classification method and device
CN111538978A (en) System and method for executing tasks based on access rights determined from task risk levels
CN112817877B (en) Abnormal script detection method and device, computer equipment and storage medium
CN113111346A (en) Multi-engine WebShell script file detection method and system
CN113110986A (en) WebShell script file detection method and system
CN115310087A (en) Website backdoor detection method and system based on abstract syntax tree
CN111400708A (en) Method and device for malicious code detection
CN112380530B (en) Homologous APK detection method, terminal device and storage medium
Surendran et al. Android Malware Detection Based on Informative Syscall Subsequences
CN117251853A (en) WebShell detection method and device based on machine learning
US20210203677A1 (en) Learning method, learning device, and learning program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination