CN116522329A - Webshell file detection method and device - Google Patents

Webshell file detection method and device Download PDF

Info

Publication number
CN116522329A
CN116522329A CN202210068318.8A CN202210068318A CN116522329A CN 116522329 A CN116522329 A CN 116522329A CN 202210068318 A CN202210068318 A CN 202210068318A CN 116522329 A CN116522329 A CN 116522329A
Authority
CN
China
Prior art keywords
file
webshell
rule
detected
regular
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210068318.8A
Other languages
Chinese (zh)
Inventor
闫雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qax Technology Group Inc
Secworld Information Technology Beijing Co Ltd
Original Assignee
Qax Technology Group Inc
Secworld Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qax Technology Group Inc, Secworld Information Technology Beijing Co Ltd filed Critical Qax Technology Group Inc
Priority to CN202210068318.8A priority Critical patent/CN116522329A/en
Publication of CN116522329A publication Critical patent/CN116522329A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a webshell file detection method and device. Wherein the method comprises the following steps: performing primary matching on the file to be detected according to a preset pre-set rule, wherein the pre-set rule is used for performing character string matching on the file to be detected; if the file to be detected is successfully matched with the regular rule, determining that the file to be detected is a webshell file. The detection speed is improved, and the system resources are saved.

Description

Webshell file detection method and device
Technical Field
The invention relates to the field of information security, in particular to a webshell file detection method and device.
Background
Webshell files are a Trojan backdoor, and after a hacker invades a website, the Trojan backdoor files are often placed in the web directory of the website server and mixed with normal webpage files.
At present, content detection of webshell files is generally performed by a regular expression feature matching method, and regular expression matching is a task which consumes very much system resources (including CPU resources and memory resources).
Because most of the files are normal files, when a large number of web script files are scanned using the regularization rules, the regularization rules for webshell files are not successfully matched, so a large amount of system resources are consumed above the otherwise unnecessary regularization matches, and the scanning process is slow.
Disclosure of Invention
Aiming at the problems in the prior art, the embodiment of the invention provides a webshell file detection method and device.
Specifically, the embodiment of the invention provides the following technical scheme:
in a first aspect, an embodiment of the present invention provides a webshell file detection method, including: performing primary matching on the file to be detected according to a preset pre-set rule, wherein the pre-set rule is used for performing character string matching on the file to be detected; and if the file to be detected is successfully matched with the regular rule, determining that the file to be detected is a webshell file.
Further, the primary matching of the file to be detected according to a preset pre-set rule includes: performing depth-first traversal on a prefix tree serving as the preset pre-set rule according to the file to be detected, wherein the prefix tree comprises at least one tail end leaf node, and each tail end leaf node corresponds to at least one regular rule; and if the matching with the pre-arranged rule is successful, acquiring a regular rule corresponding to the pre-arranged rule, performing full-text regular matching on the file to be detected and the regular rule, and if the matching with the regular rule is successful, determining that the file to be detected is a webshell file, including: if the matching between the file to be detected and one tail end leaf node of the prefix tree is successful in depth-first traversal, at least one regular rule corresponding to the tail end leaf node is obtained, the file to be detected is subjected to full-text regular matching with the at least one regular rule in sequence, and if the matching between the file to be detected and one regular rule in the at least one regular rule is successful, the file to be detected is determined to be a webshell file.
Further, before the primary matching of the file to be detected according to the preset pre-set rule, the method further includes: generating a short text rule according to a preset webshell sample and a preset non-webshell sample; and analyzing at least one character string in the short text rule to generate the prefix tree.
Further, the parsing at least one string in the short text rule to generate the prefix tree includes: and taking each character of at least one character string in the short text rule as a node of the prefix tree, and determining the prefix tree.
Further, the generating a short text rule according to the preset webshell sample and the preset non-webshell sample includes: determining at least one high-frequency character string corresponding to a webshell sample and a non-webshell sample according to the preset webshell sample and the preset non-webshell sample; and generating the short text rule according to at least one high-frequency character string corresponding to the webshell sample.
Further, the determining at least one high-frequency string corresponding to the webshell sample and the non-webshell sample according to the preset webshell sample and the preset non-webshell sample includes: and determining at least one high-frequency character string corresponding to the webshell sample and the non-webshell sample through a TF-IDF algorithm according to the preset webshell sample and the preset non-webshell sample.
In a second aspect, an embodiment of the present invention further provides a webshell file detection device, including: the first processing module is used for carrying out primary matching on the file to be detected according to a preset preposed rule, and the preposed rule is used for carrying out character string matching on the file to be detected; and the second processing module is used for acquiring a regular rule corresponding to the pre-arranged rule if the pre-arranged rule is successfully matched, carrying out full-text regular matching on the file to be detected and the regular rule, and determining that the file to be detected is a webshell file if the file to be detected is successfully matched with the regular rule.
In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the steps of the webshell file detection method according to the first aspect are implemented when the processor executes the program.
In a fourth aspect, embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the webshell file detection method according to the first aspect.
In a fifth aspect, embodiments of the present invention further provide a computer program product having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the steps of the webshell file detection method of the first aspect.
According to the webshell file detection method and device, the file to be detected is subjected to primary matching according to the preset pre-set rule, and the pre-set rule is used for carrying out character string matching on the file to be detected; and if the file to be detected is successfully matched with the regular rule, determining that the file to be detected is a webshell file. When the rule matching is carried out on the files to be detected, the pre-set rules are firstly used for carrying out primary screening, and then regular expression rule matching is carried out. Regular matching of a large number of normal files is avoided, regular matching of very consumed resources is performed on suspected webshell files screened by the preposed rules, and therefore detection speed is greatly improved, and system resources are saved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of an embodiment of a webshell file detection method of the present invention;
FIG. 2 is a flowchart of another embodiment of a webshell file detection method of the present invention;
FIG. 3 is a flow chart of some embodiments of how a prefix tree is generated;
FIG. 4-1 is a schematic diagram of an application scenario of a prefix tree;
fig. 4-2 is a schematic diagram of an application scenario of the webshell file detection method;
4-3 are diagrams of one application scenario for generating short text rules;
FIGS. 4-4 are schematic diagrams of an application scenario in which a prefix tree is generated by short text rules;
fig. 5 is a schematic structural diagram of an embodiment of the webshell file detecting device of the present invention;
fig. 6 is a schematic structural diagram of an embodiment of an electronic device according to the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flowchart of an embodiment of a webshell file detection method of the present invention. As shown in fig. 1, the webshell file detection method in the embodiment of the invention includes:
s101, primarily matching the file to be detected according to a preset pre-set rule, wherein the pre-set rule is used for performing character string matching on the file to be detected.
The file to be detected may be a script file of a web page, for example, a jsp script file. Scripting languages are executable files written in a format using a specific descriptive language. Scripts may typically be invoked and executed temporarily by an application. Various scripts are now widely used in web server programs to distribute computing to clients and servers when providing customized user interfaces for web-based applications. After the normal web script file analyzes, calculates, stores and the like the data submitted by the user, the corresponding result is returned to the user browser through protocols such as HTTP and the like for the user to browse. And webshell files are web script files forged by hackers and are placed in a server for the hackers to access to manipulate programs and data in the server.
As an example, the pre-rules may be any rules set as needed.
A regular expression is a logical formula for operating on a character string, namely, a "regular character string" is formed by a plurality of specific characters defined in advance and combinations of the specific characters, and the "regular character string" is used for expressing a filtering logic for the character string.
A regular expression is a logical formula that operates on strings (including common characters (e.g., letters between a and z) and special characters (called "meta-characters")) by forming a "regular string" with specific characters defined in advance, and combinations of the specific characters, and this "regular string" is used to express a filtering logic for the string. A regular expression is a text pattern that describes one or more strings to be matched when searching text.
S102, if the matching with the preposed rule is successful, acquiring a regular rule corresponding to the preposed rule, performing full-text regular matching on the file to be detected and the regular rule, and if the matching with the regular rule is successful, determining that the file to be detected is a webshell file.
In some embodiments, the regular rule corresponding to the pre-determined rule may be stored in a table form, and the first regular rule may be determined by querying the table according to the pre-determined rule. The invention does not limit the storage mode and the searching mode of the corresponding relation between the prepositive rule and the regular rule.
Webshell, a backdoor that an attacker leaves in a web server with scripting language, is also known as a web Trojan horse. The code execution environment exists in the form of a webpage file such as asp, php, jsp or cgi and is mainly used for operations such as website management, server management, authority management and the like. The using method is that a code file is uploaded, and a plurality of daily operations can be carried out through website access, so that the management of a user on websites and servers is greatly facilitated. Therefore, the codes in the webshell file can be modified and used as a backdoor program, so that the purpose of controlling the website server is achieved. The regularization rule is to detect the webshell file which is modified and used as a back door program.
According to the web fingerprint matching method provided by the embodiment of the invention, the file to be detected is subjected to primary matching according to the preset pre-set rule, and the pre-set rule is used for carrying out character string matching on the file to be detected; and if the file to be detected is successfully matched with the regular rule, determining that the file to be detected is a webshell file. When the rules of the files to be detected are matched, first, the pre-set rules are used for primary screening, and then regular expression rules are used. Regular matching of a large number of normal files is avoided, regular matching of very consumed resources is performed on suspected webshell files screened by the preposed rules, and therefore detection speed is greatly improved, and system resources are saved.
Fig. 2 is a flowchart of another embodiment of the webshell file detection method of the present invention. As shown in fig. 2, the webshell file detection method in the embodiment of the invention includes:
s201, performing depth-first traversal on a prefix tree serving as the preset pre-set rule according to the file to be detected, wherein the prefix tree comprises at least one tail end leaf node, each tail end leaf node corresponds to at least one regular rule, and the pre-set rule is used for performing character string matching on the file to be detected.
The prefix tree is also called dictionary tree, word search tree, trie tree, which is a multi-path tree structure, is a variation of hash tree, and is a multi-path tree structure for quick search. Typical applications are for counting and ranking a large number of strings (but not limited to strings), so are often used by search engine systems for text word frequency statistics. Its advantages are less unnecessary comparison of character strings and high query efficiency. The core idea of the prefix tree is space time exchange, and the common prefix of the character string is utilized to reduce the expenditure of the query time so as to achieve the purpose of improving the efficiency.
Prefix trees generally have several features, and reference may be made to fig. 4-1:
1) The root node does not contain a character and each child node except the root node contains a character.
2) The characters on the path from the root node to a certain node are connected, namely the character string corresponding to the node.
3) All child nodes of each node contain different characters.
4) Each side corresponds to a letter. Each node corresponds to a prefix. The leaf node corresponds to the longest prefix, i.e., the word itself.
Prefix tree applications are also very widespread, such as fast retrieval of strings, string ordering, longest common prefix, automatically matching prefix display suffix, etc.
And performing depth-first traversal on the file to be detected according to the prefix tree, namely performing traversal matching on the nodes of each branch of the prefix tree.
The tree includes a root node: the topmost node of the tree; child node: nodes except the root node and connected with nodes under the root node; leaf nodes: the node (i.e., end) below which the node is no longer connected is called a leaf node (also called an end leaf node or end node).
In some embodiments, the file to be detected is traversed by depth-first of the prefix tree, and if a certain branch of the prefix tree is completely matched with the content in the file to be detected, the leaf node corresponding to the found branch is obtained, and the regular rule corresponding to the leaf node is found.
In some embodiments, each leaf node of the prefix tree may correspond to one regular rule or at least two regular rules.
S202, if the matching between the file to be detected and one tail end leaf node of the prefix tree is successful in depth-first traversal, at least one regular rule corresponding to the tail end leaf node is obtained, full-text regular matching is carried out on the file to be detected and the at least one regular rule in sequence, and if the matching between the file to be detected and one regular rule in the at least one regular rule is successful, the file to be detected is determined to be a webshell file.
In some embodiments, the file to be detected is traversed by depth-first of the prefix tree, and if a certain branch of the prefix tree is completely matched with the content in the file to be detected, the leaf node corresponding to the found branch is obtained, and the regular rule corresponding to the leaf node is found. The pre-compiled canonical rules are connected at the end leaf nodes of the prefix tree. The whole rule-conforming structure diagram is shown as 4-1, and the rule rules corresponding to the leaf nodes r of the prefix tree getClassLoader branch are three.
In some embodiments, a leaf node and the corresponding at least one regular rule may be stored in the form of a structural database, and the at least one regular rule may be determined by querying a table according to the leaf node, and in turn, matched with the file to be detected using the regular rule.
The overall matching flow is as shown in fig. 4-2: in the matching stage, the prefix tree is traversed in a depth-first mode, and regular rules are loaded for regular matching only when the end leaf nodes of the prefix tree are reached. The short text prefix tree and the regular rule jointly determine whether a file is a webshell file or not, so that most normal script files can be rapidly removed, and the calculation process of massive consumption of system resources, such as regular expression matching, is reduced.
In some embodiments, if the matching of the file to be tested and each of the at least one regular rule is unsuccessful, continuing to traverse the prefix tree in depth first until the traversing matching is completed, and determining that the file to be tested is a normal file if the regular rule matched with the file to be tested is not found; or the file to be detected is determined to be a webshell file after successful regular rule matching with a leaf node of a certain branch of the prefix tree.
By the mode, the detection speed of the script file can be greatly improved. The experiment shows that: the sample set consisting of 100 black samples and 117936 white samples is scanned by 250 common regular rules, compared with the sample set consisting of 100 black samples and 117936 white samples by 250 prefix tree+regular rules, the scanning time of the sample set is shortened by 53%, and the detection rate and the false alarm rate are completely consistent.
Compared with the webshell file detection method shown in fig. 1, the webshell file detection method provided by the embodiment of the invention embodies that a prefix tree serving as the preset pre-set rule can be subjected to depth-first traversal according to the file to be detected, wherein the prefix tree comprises at least one tail end leaf node, and each tail end leaf node corresponds to at least one regular rule; if the matching between the file to be detected and one tail end leaf node of the prefix tree is successful in depth-first traversal, at least one regular rule corresponding to the tail end leaf node is obtained, the file to be detected is subjected to full-text regular matching with the at least one regular rule in sequence, and if the matching between the file to be detected and one regular rule in the at least one regular rule is successful, the file to be detected is determined to be a webshell file. First, traversing the prefix tree in a depth-first mode, and only when the node of a certain branch in the prefix tree is completely matched, removing the regular expression rule of the leaf node of the corresponding certain branch node. Regular matching of a large number of normal files is avoided, regular matching of very consumed resources can be performed on suspected webshell files screened by prefix trees, and therefore detection speed is greatly improved, and system resources are saved. The suspected webshell files screened by the prefix tree are subjected to multiple regular matches of very consumed resources. Multiple regular matches can also more accurately detect webshell files.
FIG. 3 is a flow chart of some embodiments of how the present invention generates a prefix tree. As shown in fig. 3, the method of the embodiment of the present invention includes:
s301, generating a short text rule according to a preset webshell sample and a preset non-webshell sample.
Short text is text or characters (e.g., short text extracted in terms of words) extracted from a preset webshell sample and a preset non-webshell sample of relatively short length and having the properties of webshell files or non-webshell files.
And integrating the short text with the webshell sample attribute according to the short text extracted from the preset webshell sample and the preset non-webshell sample to obtain the short text rule.
The method for obtaining the text or character with the attribute of the webshell file or the non-webshell file can be a word segmentation method based on character string matching, which is also called a mechanical word segmentation method, a statistical word segmentation method, an understanding word segmentation method or a short text analysis based on deep learning, and the like.
In some alternative implementations, generating the short text rule from the preset webshell sample and the preset non-webshell sample includes: determining at least one high-frequency character string corresponding to the webshell sample and the non-webshell sample according to the preset webshell sample and the preset non-webshell sample; and generating a short text rule according to at least one high-frequency character string corresponding to the webshell sample.
Word frequency-reverse document frequency (TF-IDF), is a common weighting technique used for information retrieval and data mining. TF is the Term Frequency (Term Frequency) and IDF is the inverse text Frequency index (Inverse Document Frequency).
In some embodiments, the high frequency strings may be manually screened.
In some optional implementations, determining at least one high frequency string corresponding to the webshell sample and the non-webshell sample from the preset webshell sample and the preset non-webshell sample includes: and determining at least one high-frequency character string corresponding to the webshell sample and the non-webshell sample through a TF-IDF algorithm according to the preset webshell sample and the preset non-webshell sample.
TF-IDF is a statistical method used to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with its frequency of occurrence in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of correlation between documents and user queries.
In some embodiments, the short text rule is derived from a high-frequency character string obtained by machine learning based on TF-IDF of a large number of known webshell samples and normal script files (as shown in fig. 4-3, keywords in the webshell samples can be exhausted according to TF-IDF algorithm, or keywords in the normal script files can be exhausted, the exhausted keywords are used as the high-frequency character string, or the high-frequency character string is screened out of the exhausted keywords by setting a threshold value, or the high-frequency character string is screened out of the exhausted keywords by manual mode), the high-frequency character string contains common characteristics of some types of webshell scripts, and the main function is to perform preliminary screening according to prefix tree generated by the high-frequency character strings, so that most normal files which are obviously not webshells are eliminated. In the rule initialization stage, the short text rules are analyzed according to characters to generate a prefix tree structure, so that the matching speed can be improved.
And generating a short text rule according to at least one high-frequency character string corresponding to the webshell sample, namely integrating the at least one high-frequency character string together to obtain at least one word with higher importance in the webshell sample.
Referring to fig. 4-4, it can be seen that the prefix tree contains short text rules consisting of getHeader, getParameter, getClassLoader, getDecoder and the like short text, and each short text can be concatenated with one or more pre-compiled regularization rules.
S302, analyzing at least one character string in the short text rule to generate a prefix tree.
In some embodiments, referring to fig. 4-4, at least one string in the short text rule is parsed into characters, and then a prefix tree is constructed from the characters.
In some alternative implementations, parsing at least one string in the short text rule to generate a prefix tree includes: and determining the prefix tree by taking each character of at least one character string in the short text rule as a node of the prefix tree.
Each character of at least one character string in the short text rule is sequentially taken as a node of the prefix tree, and if the last character of the at least one character string is the same, the next character of the at least one character string is taken as a branch of the last character. If the characters of at least one character string are the same, the same character is expressed as the same node, if the characters are different, the different characters are expressed as different nodes, and so on.
In some embodiments of how to generate the prefix tree provided by the embodiments of the present invention, since the detection of the file content features of webshell files is generally implemented by regular expression matching, the problem that the regular expression rules cannot accurately match script types exists, and the problem that the efficiency is lower when large files are detected exists. Thus, by generating short text rules from a preset webshell sample and a preset non-webshell sample; at least one character string in the short text rule is parsed to generate a prefix tree. Therefore, when the file to be detected is detected, if the file to be detected is not successfully matched with the leaf node at the tail end of the prefix tree, regular expression rule matching is not performed, and therefore detection efficiency is greatly improved.
As an example, the file to be detected is a webshell file:
as an example, the regularization rule corresponding to the "getDeclaredMethod" branch of the prefix tree is:
\bClassLoader\.getSystemClassLoader\(\s*\)(?:.|\n){0,80}\bProxy\.class\.getDeclaredMethod\(\s*"defineClass0"
matching process: searching g characters according to the 'getDeclarledMethod' branch of the prefix tree, sequentially matching e, t, D, e, c, l from the position of the g characters to d characters, if the short text is completely matched, taking out the regular expression rule linked with the tail end of the short text, starting regular expression matching of the whole text, and if the short text is matched, recognizing the file as webshell file.
Fig. 5 is a schematic structural diagram of an embodiment of the webshell file detection device of the present invention. As shown in fig. 5, the apparatus 500 includes:
the first processing module 501 is configured to perform primary matching on a file to be detected according to a preset pre-set rule, where the pre-set rule is used to perform string matching on the file to be detected;
the second processing module 502 is configured to obtain a regular rule corresponding to the pre-rule if the matching with the pre-rule is successful, perform full-text regular matching on the file to be detected and the regular rule, and determine that the file to be detected is a webshell file if the matching with the regular rule is successful;
optionally, the first processing module 501 is further configured to:
performing depth-first traversal on a prefix tree serving as the preset pre-set rule according to the file to be detected, wherein the prefix tree comprises at least one tail end leaf node, and each tail end leaf node corresponds to at least one regular rule; and the second processing module 502 is further configured to:
if the matching between the file to be detected and one tail end leaf node of the prefix tree is successful in depth-first traversal, at least one regular rule corresponding to the tail end leaf node is obtained, the file to be detected is subjected to full-text regular matching with the at least one regular rule in sequence, and if the matching between the file to be detected and one regular rule in the at least one regular rule is successful, the file to be detected is determined to be a webshell file.
Optionally, the apparatus further comprises:
the third processing module is used for generating a short text rule according to a preset webshell sample and a preset non-webshell sample;
and the fourth processing module is used for analyzing at least one character string in the short text rule to generate a prefix tree.
Optionally, the fourth processing module is further configured to:
and determining the prefix tree by taking each character of at least one character string in the short text rule as a node of the prefix tree.
Optionally, the third processing module is further configured to:
determining at least one high-frequency character string corresponding to the webshell sample and the non-webshell sample according to the preset webshell sample and the preset non-webshell sample;
and generating a short text rule according to at least one high-frequency character string corresponding to the webshell sample.
Optionally, the third processing module is further configured to:
and determining at least one high-frequency character string corresponding to the webshell sample and the non-webshell sample through a TF-IDF algorithm according to the preset webshell sample and the preset non-webshell sample.
Examples are as follows:
fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 601, communication interface (Communications Interface) 602, memory 603 and communication bus 604, wherein processor 601, communication interface 602, memory 603 complete the communication between each other through communication bus 604. The processor 601 may call logic instructions in the memory 603 to perform the following method: performing primary matching on the file to be detected according to a preset pre-set rule, wherein the pre-set rule is used for performing character string matching on the file to be detected; and if the file to be detected is successfully matched with the regular rule, determining that the file to be detected is a webshell file.
Further, the logic instructions in the memory 603 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer readable storage medium, where the computer program includes program instructions, when the program instructions are executed by a computer, enable the computer to perform the webshell file detection method provided in the foregoing embodiments, for example, including: performing primary matching on the file to be detected according to a preset pre-set rule, wherein the pre-set rule is used for performing character string matching on the file to be detected; and if the file to be detected is successfully matched with the regular rule, determining that the file to be detected is a webshell file.
In still another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the webshell file detection method provided in the above embodiments, for example, including: performing primary matching on the file to be detected according to a preset pre-set rule, wherein the pre-set rule is used for performing character string matching on the file to be detected; and if the file to be detected is successfully matched with the regular rule, determining that the file to be detected is a webshell file.
The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. The webshell file detection method is characterized by comprising the following steps:
performing primary matching on the file to be detected according to a preset pre-set rule, wherein the pre-set rule is used for performing character string matching on the file to be detected;
and if the file to be detected is successfully matched with the regular rule, determining that the file to be detected is a webshell file.
2. The webshell file detection method according to claim 1, wherein the primary matching of the file to be detected according to a preset pre-set rule includes:
performing depth-first traversal on a prefix tree serving as the preset pre-set rule according to the file to be detected, wherein the prefix tree comprises at least one tail end leaf node, and each tail end leaf node corresponds to at least one regular rule; and
if the matching with the pre-arranged rule is successful, acquiring a regular rule corresponding to the pre-arranged rule, performing full-text regular matching on the file to be detected and the regular rule, and if the matching with the regular rule is successful, determining that the file to be detected is a webshell file, including:
if the matching between the file to be detected and one tail end leaf node of the prefix tree is successful in depth-first traversal, at least one regular rule corresponding to the tail end leaf node is obtained, the file to be detected is subjected to full-text regular matching with the at least one regular rule in sequence, and if the matching between the file to be detected and one regular rule in the at least one regular rule is successful, the file to be detected is determined to be a webshell file.
3. The webshell file detection method according to claim 2, wherein before the file to be detected is subjected to primary matching according to a preset pre-set rule, the method further comprises:
generating a short text rule according to a preset webshell sample and a preset non-webshell sample;
and analyzing at least one character string in the short text rule to generate the prefix tree.
4. The webshell file detection method of claim 3, wherein parsing at least one string in the short text rule to generate the prefix tree includes:
and taking each character of at least one character string in the short text rule as a node of the prefix tree, and determining the prefix tree.
5. The webshell file detection method according to claim 3, wherein the generating short text rules according to the preset webshell samples and the preset non-webshell samples includes:
determining at least one high-frequency character string corresponding to a webshell sample and a non-webshell sample according to the preset webshell sample and the preset non-webshell sample;
and generating the short text rule according to at least one high-frequency character string corresponding to the webshell sample.
6. The webshell file detection method according to claim 5, wherein the determining at least one high frequency string corresponding to the webshell sample and the non-webshell sample according to the preset webshell sample and the preset non-webshell sample includes:
and determining at least one high-frequency character string corresponding to the webshell sample and the non-webshell sample through a TF-IDF algorithm according to the preset webshell sample and the preset non-webshell sample.
7. The webshell file detection device is characterized by comprising the following steps:
the first processing module is used for carrying out primary matching on the file to be detected according to a preset preposed rule, and the preposed rule is used for carrying out character string matching on the file to be detected;
and the second processing module is used for acquiring a regular rule corresponding to the pre-arranged rule if the pre-arranged rule is successfully matched, carrying out full-text regular matching on the file to be detected and the regular rule, and determining that the file to be detected is a webshell file if the file to be detected is successfully matched with the regular rule.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the webshell file detection method of any of claims 1 to 6 when the program is executed.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the webshell file detection method of any of claims 1 to 6.
10. A computer program product having stored thereon executable instructions which, when executed by a processor, cause the processor to carry out the steps of the webshell file detection method according to any of claims 1 to 6.
CN202210068318.8A 2022-01-20 2022-01-20 Webshell file detection method and device Pending CN116522329A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210068318.8A CN116522329A (en) 2022-01-20 2022-01-20 Webshell file detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210068318.8A CN116522329A (en) 2022-01-20 2022-01-20 Webshell file detection method and device

Publications (1)

Publication Number Publication Date
CN116522329A true CN116522329A (en) 2023-08-01

Family

ID=87394513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210068318.8A Pending CN116522329A (en) 2022-01-20 2022-01-20 Webshell file detection method and device

Country Status (1)

Country Link
CN (1) CN116522329A (en)

Similar Documents

Publication Publication Date Title
US11463476B2 (en) Character string classification method and system, and character string classification device
US12058173B2 (en) Intelligent signature-based anti-cloaking web recrawling
Opara et al. HTMLPhish: Enabling phishing web page detection by applying deep learning techniques on HTML analysis
CN107707545B (en) Abnormal webpage access fragment detection method, device, equipment and storage medium
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
US20150095359A1 (en) Volume Reducing Classifier
US20120005184A1 (en) Regular expression optimizer
US8200670B1 (en) Efficient document clustering
US8825620B1 (en) Behavioral word segmentation for use in processing search queries
CN103324886B (en) A kind of extracting method of fingerprint database in network intrusion detection and system
KR20210084204A (en) Malware Crawling Method and System
CN112580331A (en) Method and system for establishing knowledge graph of policy text
CN112115266A (en) Malicious website classification method and device, computer equipment and readable storage medium
CN114650176A (en) Phishing website detection method and device, computer equipment and storage medium
CN107786529B (en) Website detection method, device and system
CN106202349B (en) Webpage classification dictionary generation method and device
CN113067792A (en) XSS attack identification method, device, equipment and medium
Zhang et al. Effective and Fast Near Duplicate Detection via Signature‐Based Compression Metrics
CN117056347A (en) SQL sentence true injection detection method, SQL sentence true injection detection device, SQL sentence true injection detection computer equipment and SQL sentence true injection detection storage medium
US10380195B1 (en) Grouping documents by content similarity
CN116522329A (en) Webshell file detection method and device
Lei et al. Design and implementation of an automatic scanning tool of SQL injection vulnerability based on Web crawler
US20220207049A1 (en) Methods, devices and systems for processing and analysing data from multiple sources
CN110825976B (en) Website page detection method and device, electronic equipment and medium
CN114372265A (en) Malicious program detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088

Applicant after: QAX Technology Group Inc.

Applicant after: Qianxin Wangshen information technology (Beijing) Co.,Ltd.

Address before: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088

Applicant before: QAX Technology Group Inc.

Country or region before: China

Applicant before: LEGENDSEC INFORMATION TECHNOLOGY (BEIJING) Inc.