Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flowchart of an embodiment of a webshell file detection method of the present invention. As shown in fig. 1, the webshell file detection method in the embodiment of the invention includes:
s101, primarily matching the file to be detected according to a preset pre-set rule, wherein the pre-set rule is used for performing character string matching on the file to be detected.
The file to be detected may be a script file of a web page, for example, a jsp script file. Scripting languages are executable files written in a format using a specific descriptive language. Scripts may typically be invoked and executed temporarily by an application. Various scripts are now widely used in web server programs to distribute computing to clients and servers when providing customized user interfaces for web-based applications. After the normal web script file analyzes, calculates, stores and the like the data submitted by the user, the corresponding result is returned to the user browser through protocols such as HTTP and the like for the user to browse. And webshell files are web script files forged by hackers and are placed in a server for the hackers to access to manipulate programs and data in the server.
As an example, the pre-rules may be any rules set as needed.
A regular expression is a logical formula for operating on a character string, namely, a "regular character string" is formed by a plurality of specific characters defined in advance and combinations of the specific characters, and the "regular character string" is used for expressing a filtering logic for the character string.
A regular expression is a logical formula that operates on strings (including common characters (e.g., letters between a and z) and special characters (called "meta-characters")) by forming a "regular string" with specific characters defined in advance, and combinations of the specific characters, and this "regular string" is used to express a filtering logic for the string. A regular expression is a text pattern that describes one or more strings to be matched when searching text.
S102, if the matching with the preposed rule is successful, acquiring a regular rule corresponding to the preposed rule, performing full-text regular matching on the file to be detected and the regular rule, and if the matching with the regular rule is successful, determining that the file to be detected is a webshell file.
In some embodiments, the regular rule corresponding to the pre-determined rule may be stored in a table form, and the first regular rule may be determined by querying the table according to the pre-determined rule. The invention does not limit the storage mode and the searching mode of the corresponding relation between the prepositive rule and the regular rule.
Webshell, a backdoor that an attacker leaves in a web server with scripting language, is also known as a web Trojan horse. The code execution environment exists in the form of a webpage file such as asp, php, jsp or cgi and is mainly used for operations such as website management, server management, authority management and the like. The using method is that a code file is uploaded, and a plurality of daily operations can be carried out through website access, so that the management of a user on websites and servers is greatly facilitated. Therefore, the codes in the webshell file can be modified and used as a backdoor program, so that the purpose of controlling the website server is achieved. The regularization rule is to detect the webshell file which is modified and used as a back door program.
According to the web fingerprint matching method provided by the embodiment of the invention, the file to be detected is subjected to primary matching according to the preset pre-set rule, and the pre-set rule is used for carrying out character string matching on the file to be detected; and if the file to be detected is successfully matched with the regular rule, determining that the file to be detected is a webshell file. When the rules of the files to be detected are matched, first, the pre-set rules are used for primary screening, and then regular expression rules are used. Regular matching of a large number of normal files is avoided, regular matching of very consumed resources is performed on suspected webshell files screened by the preposed rules, and therefore detection speed is greatly improved, and system resources are saved.
Fig. 2 is a flowchart of another embodiment of the webshell file detection method of the present invention. As shown in fig. 2, the webshell file detection method in the embodiment of the invention includes:
s201, performing depth-first traversal on a prefix tree serving as the preset pre-set rule according to the file to be detected, wherein the prefix tree comprises at least one tail end leaf node, each tail end leaf node corresponds to at least one regular rule, and the pre-set rule is used for performing character string matching on the file to be detected.
The prefix tree is also called dictionary tree, word search tree, trie tree, which is a multi-path tree structure, is a variation of hash tree, and is a multi-path tree structure for quick search. Typical applications are for counting and ranking a large number of strings (but not limited to strings), so are often used by search engine systems for text word frequency statistics. Its advantages are less unnecessary comparison of character strings and high query efficiency. The core idea of the prefix tree is space time exchange, and the common prefix of the character string is utilized to reduce the expenditure of the query time so as to achieve the purpose of improving the efficiency.
Prefix trees generally have several features, and reference may be made to fig. 4-1:
1) The root node does not contain a character and each child node except the root node contains a character.
2) The characters on the path from the root node to a certain node are connected, namely the character string corresponding to the node.
3) All child nodes of each node contain different characters.
4) Each side corresponds to a letter. Each node corresponds to a prefix. The leaf node corresponds to the longest prefix, i.e., the word itself.
Prefix tree applications are also very widespread, such as fast retrieval of strings, string ordering, longest common prefix, automatically matching prefix display suffix, etc.
And performing depth-first traversal on the file to be detected according to the prefix tree, namely performing traversal matching on the nodes of each branch of the prefix tree.
The tree includes a root node: the topmost node of the tree; child node: nodes except the root node and connected with nodes under the root node; leaf nodes: the node (i.e., end) below which the node is no longer connected is called a leaf node (also called an end leaf node or end node).
In some embodiments, the file to be detected is traversed by depth-first of the prefix tree, and if a certain branch of the prefix tree is completely matched with the content in the file to be detected, the leaf node corresponding to the found branch is obtained, and the regular rule corresponding to the leaf node is found.
In some embodiments, each leaf node of the prefix tree may correspond to one regular rule or at least two regular rules.
S202, if the matching between the file to be detected and one tail end leaf node of the prefix tree is successful in depth-first traversal, at least one regular rule corresponding to the tail end leaf node is obtained, full-text regular matching is carried out on the file to be detected and the at least one regular rule in sequence, and if the matching between the file to be detected and one regular rule in the at least one regular rule is successful, the file to be detected is determined to be a webshell file.
In some embodiments, the file to be detected is traversed by depth-first of the prefix tree, and if a certain branch of the prefix tree is completely matched with the content in the file to be detected, the leaf node corresponding to the found branch is obtained, and the regular rule corresponding to the leaf node is found. The pre-compiled canonical rules are connected at the end leaf nodes of the prefix tree. The whole rule-conforming structure diagram is shown as 4-1, and the rule rules corresponding to the leaf nodes r of the prefix tree getClassLoader branch are three.
In some embodiments, a leaf node and the corresponding at least one regular rule may be stored in the form of a structural database, and the at least one regular rule may be determined by querying a table according to the leaf node, and in turn, matched with the file to be detected using the regular rule.
The overall matching flow is as shown in fig. 4-2: in the matching stage, the prefix tree is traversed in a depth-first mode, and regular rules are loaded for regular matching only when the end leaf nodes of the prefix tree are reached. The short text prefix tree and the regular rule jointly determine whether a file is a webshell file or not, so that most normal script files can be rapidly removed, and the calculation process of massive consumption of system resources, such as regular expression matching, is reduced.
In some embodiments, if the matching of the file to be tested and each of the at least one regular rule is unsuccessful, continuing to traverse the prefix tree in depth first until the traversing matching is completed, and determining that the file to be tested is a normal file if the regular rule matched with the file to be tested is not found; or the file to be detected is determined to be a webshell file after successful regular rule matching with a leaf node of a certain branch of the prefix tree.
By the mode, the detection speed of the script file can be greatly improved. The experiment shows that: the sample set consisting of 100 black samples and 117936 white samples is scanned by 250 common regular rules, compared with the sample set consisting of 100 black samples and 117936 white samples by 250 prefix tree+regular rules, the scanning time of the sample set is shortened by 53%, and the detection rate and the false alarm rate are completely consistent.
Compared with the webshell file detection method shown in fig. 1, the webshell file detection method provided by the embodiment of the invention embodies that a prefix tree serving as the preset pre-set rule can be subjected to depth-first traversal according to the file to be detected, wherein the prefix tree comprises at least one tail end leaf node, and each tail end leaf node corresponds to at least one regular rule; if the matching between the file to be detected and one tail end leaf node of the prefix tree is successful in depth-first traversal, at least one regular rule corresponding to the tail end leaf node is obtained, the file to be detected is subjected to full-text regular matching with the at least one regular rule in sequence, and if the matching between the file to be detected and one regular rule in the at least one regular rule is successful, the file to be detected is determined to be a webshell file. First, traversing the prefix tree in a depth-first mode, and only when the node of a certain branch in the prefix tree is completely matched, removing the regular expression rule of the leaf node of the corresponding certain branch node. Regular matching of a large number of normal files is avoided, regular matching of very consumed resources can be performed on suspected webshell files screened by prefix trees, and therefore detection speed is greatly improved, and system resources are saved. The suspected webshell files screened by the prefix tree are subjected to multiple regular matches of very consumed resources. Multiple regular matches can also more accurately detect webshell files.
FIG. 3 is a flow chart of some embodiments of how the present invention generates a prefix tree. As shown in fig. 3, the method of the embodiment of the present invention includes:
s301, generating a short text rule according to a preset webshell sample and a preset non-webshell sample.
Short text is text or characters (e.g., short text extracted in terms of words) extracted from a preset webshell sample and a preset non-webshell sample of relatively short length and having the properties of webshell files or non-webshell files.
And integrating the short text with the webshell sample attribute according to the short text extracted from the preset webshell sample and the preset non-webshell sample to obtain the short text rule.
The method for obtaining the text or character with the attribute of the webshell file or the non-webshell file can be a word segmentation method based on character string matching, which is also called a mechanical word segmentation method, a statistical word segmentation method, an understanding word segmentation method or a short text analysis based on deep learning, and the like.
In some alternative implementations, generating the short text rule from the preset webshell sample and the preset non-webshell sample includes: determining at least one high-frequency character string corresponding to the webshell sample and the non-webshell sample according to the preset webshell sample and the preset non-webshell sample; and generating a short text rule according to at least one high-frequency character string corresponding to the webshell sample.
Word frequency-reverse document frequency (TF-IDF), is a common weighting technique used for information retrieval and data mining. TF is the Term Frequency (Term Frequency) and IDF is the inverse text Frequency index (Inverse Document Frequency).
In some embodiments, the high frequency strings may be manually screened.
In some optional implementations, determining at least one high frequency string corresponding to the webshell sample and the non-webshell sample from the preset webshell sample and the preset non-webshell sample includes: and determining at least one high-frequency character string corresponding to the webshell sample and the non-webshell sample through a TF-IDF algorithm according to the preset webshell sample and the preset non-webshell sample.
TF-IDF is a statistical method used to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with its frequency of occurrence in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of correlation between documents and user queries.
In some embodiments, the short text rule is derived from a high-frequency character string obtained by machine learning based on TF-IDF of a large number of known webshell samples and normal script files (as shown in fig. 4-3, keywords in the webshell samples can be exhausted according to TF-IDF algorithm, or keywords in the normal script files can be exhausted, the exhausted keywords are used as the high-frequency character string, or the high-frequency character string is screened out of the exhausted keywords by setting a threshold value, or the high-frequency character string is screened out of the exhausted keywords by manual mode), the high-frequency character string contains common characteristics of some types of webshell scripts, and the main function is to perform preliminary screening according to prefix tree generated by the high-frequency character strings, so that most normal files which are obviously not webshells are eliminated. In the rule initialization stage, the short text rules are analyzed according to characters to generate a prefix tree structure, so that the matching speed can be improved.
And generating a short text rule according to at least one high-frequency character string corresponding to the webshell sample, namely integrating the at least one high-frequency character string together to obtain at least one word with higher importance in the webshell sample.
Referring to fig. 4-4, it can be seen that the prefix tree contains short text rules consisting of getHeader, getParameter, getClassLoader, getDecoder and the like short text, and each short text can be concatenated with one or more pre-compiled regularization rules.
S302, analyzing at least one character string in the short text rule to generate a prefix tree.
In some embodiments, referring to fig. 4-4, at least one string in the short text rule is parsed into characters, and then a prefix tree is constructed from the characters.
In some alternative implementations, parsing at least one string in the short text rule to generate a prefix tree includes: and determining the prefix tree by taking each character of at least one character string in the short text rule as a node of the prefix tree.
Each character of at least one character string in the short text rule is sequentially taken as a node of the prefix tree, and if the last character of the at least one character string is the same, the next character of the at least one character string is taken as a branch of the last character. If the characters of at least one character string are the same, the same character is expressed as the same node, if the characters are different, the different characters are expressed as different nodes, and so on.
In some embodiments of how to generate the prefix tree provided by the embodiments of the present invention, since the detection of the file content features of webshell files is generally implemented by regular expression matching, the problem that the regular expression rules cannot accurately match script types exists, and the problem that the efficiency is lower when large files are detected exists. Thus, by generating short text rules from a preset webshell sample and a preset non-webshell sample; at least one character string in the short text rule is parsed to generate a prefix tree. Therefore, when the file to be detected is detected, if the file to be detected is not successfully matched with the leaf node at the tail end of the prefix tree, regular expression rule matching is not performed, and therefore detection efficiency is greatly improved.
As an example, the file to be detected is a webshell file:
as an example, the regularization rule corresponding to the "getDeclaredMethod" branch of the prefix tree is:
\bClassLoader\.getSystemClassLoader\(\s*\)(?:.|\n){0,80}\bProxy\.class\.getDeclaredMethod\(\s*"defineClass0"
matching process: searching g characters according to the 'getDeclarledMethod' branch of the prefix tree, sequentially matching e, t, D, e, c, l from the position of the g characters to d characters, if the short text is completely matched, taking out the regular expression rule linked with the tail end of the short text, starting regular expression matching of the whole text, and if the short text is matched, recognizing the file as webshell file.
Fig. 5 is a schematic structural diagram of an embodiment of the webshell file detection device of the present invention. As shown in fig. 5, the apparatus 500 includes:
the first processing module 501 is configured to perform primary matching on a file to be detected according to a preset pre-set rule, where the pre-set rule is used to perform string matching on the file to be detected;
the second processing module 502 is configured to obtain a regular rule corresponding to the pre-rule if the matching with the pre-rule is successful, perform full-text regular matching on the file to be detected and the regular rule, and determine that the file to be detected is a webshell file if the matching with the regular rule is successful;
optionally, the first processing module 501 is further configured to:
performing depth-first traversal on a prefix tree serving as the preset pre-set rule according to the file to be detected, wherein the prefix tree comprises at least one tail end leaf node, and each tail end leaf node corresponds to at least one regular rule; and the second processing module 502 is further configured to:
if the matching between the file to be detected and one tail end leaf node of the prefix tree is successful in depth-first traversal, at least one regular rule corresponding to the tail end leaf node is obtained, the file to be detected is subjected to full-text regular matching with the at least one regular rule in sequence, and if the matching between the file to be detected and one regular rule in the at least one regular rule is successful, the file to be detected is determined to be a webshell file.
Optionally, the apparatus further comprises:
the third processing module is used for generating a short text rule according to a preset webshell sample and a preset non-webshell sample;
and the fourth processing module is used for analyzing at least one character string in the short text rule to generate a prefix tree.
Optionally, the fourth processing module is further configured to:
and determining the prefix tree by taking each character of at least one character string in the short text rule as a node of the prefix tree.
Optionally, the third processing module is further configured to:
determining at least one high-frequency character string corresponding to the webshell sample and the non-webshell sample according to the preset webshell sample and the preset non-webshell sample;
and generating a short text rule according to at least one high-frequency character string corresponding to the webshell sample.
Optionally, the third processing module is further configured to:
and determining at least one high-frequency character string corresponding to the webshell sample and the non-webshell sample through a TF-IDF algorithm according to the preset webshell sample and the preset non-webshell sample.
Examples are as follows:
fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 601, communication interface (Communications Interface) 602, memory 603 and communication bus 604, wherein processor 601, communication interface 602, memory 603 complete the communication between each other through communication bus 604. The processor 601 may call logic instructions in the memory 603 to perform the following method: performing primary matching on the file to be detected according to a preset pre-set rule, wherein the pre-set rule is used for performing character string matching on the file to be detected; and if the file to be detected is successfully matched with the regular rule, determining that the file to be detected is a webshell file.
Further, the logic instructions in the memory 603 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer readable storage medium, where the computer program includes program instructions, when the program instructions are executed by a computer, enable the computer to perform the webshell file detection method provided in the foregoing embodiments, for example, including: performing primary matching on the file to be detected according to a preset pre-set rule, wherein the pre-set rule is used for performing character string matching on the file to be detected; and if the file to be detected is successfully matched with the regular rule, determining that the file to be detected is a webshell file.
In still another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the webshell file detection method provided in the above embodiments, for example, including: performing primary matching on the file to be detected according to a preset pre-set rule, wherein the pre-set rule is used for performing character string matching on the file to be detected; and if the file to be detected is successfully matched with the regular rule, determining that the file to be detected is a webshell file.
The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.