CN113821448A - Webshell code detection method and device and readable storage medium - Google Patents

Webshell code detection method and device and readable storage medium Download PDF

Info

Publication number
CN113821448A
CN113821448A CN202111382367.0A CN202111382367A CN113821448A CN 113821448 A CN113821448 A CN 113821448A CN 202111382367 A CN202111382367 A CN 202111382367A CN 113821448 A CN113821448 A CN 113821448A
Authority
CN
China
Prior art keywords
code data
code
preset
preprocessed
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111382367.0A
Other languages
Chinese (zh)
Inventor
徐钟豪
陈伟
谢忱
刘伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Douxiang Information Technology Co ltd
Original Assignee
Shanghai Douxiang Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Douxiang Information Technology Co ltd filed Critical Shanghai Douxiang Information Technology Co ltd
Priority to CN202111382367.0A priority Critical patent/CN113821448A/en
Publication of CN113821448A publication Critical patent/CN113821448A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3616Software analysis for verifying properties of programs using software metrics

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application provides a Webshell code detection method and device and a readable storage medium. The Webshell code detection method comprises the following steps: acquiring code data to be detected; preprocessing the code data to be detected to obtain preprocessed code data; performing semantic abstraction processing on the preprocessed code data to obtain code data subjected to the semantic abstraction processing; the code data after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the code data subjected to the semantic abstraction; determining a detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model; and the detection result is used for representing whether the code data to be detected is Webshell code data or not. The detection method is used for realizing effective and accurate detection of Webshell.

Description

Webshell code detection method and device and readable storage medium
Technical Field
The application relates to the technical field of network security, in particular to a Webshell code detection method and device and a readable storage medium.
Background
Webshell is a code execution environment in the form of web page files such as ASP (Active Server Pages), PHP (Hypertext Preprocessor), JSP (Java Server Pages), and may also be called a web page backdoor. After the hacker invades the website, the ASP or PHP backdoor file is usually mixed with the normal web page file in the website server directory, and then the browser can be used to access the ASP or PHP backdoor to obtain a command execution environment, so as to achieve the purpose of controlling the website server. The WebShell backdoor has extremely strong concealment, can pass through a firewall of a server, generally does not leave records in a system log by using the WebShell, only leaves some data submission records in the log of a website, and an inexperienced administrator is not easy to find invasion traces.
Most of the existing Webshell detection technologies carry out detection based on rules, and the method has the advantages of high resource overhead, multiple restricted conditions, high missing report rate and easiness in bypassing, for example, a rule-based detection method can be easily bypassed by adding some confusion characters or confusion symbols into codes. Therefore, the existing Webshell detection technology cannot realize effective and accurate detection of Webshell.
Disclosure of Invention
The embodiment of the application aims to provide a detection method and device for a Webshell code and a readable storage medium, which are used for realizing effective and accurate detection of the Webshell.
In a first aspect, an embodiment of the present application provides a method for detecting a Webshell code, including: acquiring code data to be detected; preprocessing the code data to be detected to obtain preprocessed code data; performing semantic abstraction processing on the preprocessed code data to obtain code data subjected to the semantic abstraction processing; the code data after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the code data subjected to the semantic abstraction; determining a detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model; and the detection result is used for representing whether the code data to be detected is Webshell code data or not.
In the embodiment of the application, compared with the prior art, the Webshell is detected by using the code data, the code data is preprocessed and then subjected to semantic abstraction processing, and the code data with a uniform code format is obtained; and then, extracting a plurality of preset features from the processed code data, and determining a detection result by using a pre-trained detection model and the plurality of preset features. In the detection mode, the characteristics of the code data are combined with a detection model trained in advance, and the method has the advantages of flexible characteristics, low false alarm rate, low maintenance cost, difficulty in bypassing and the like. Therefore, effective and accurate detection of Webshell can be finally realized.
As a possible implementation manner, the preprocessing the code data to be detected to obtain preprocessed code data includes: deleting useless information in the code data to be detected; the useless information comprises annotation information and HTML (Hypertext Markup Language) code fragments; extracting PHP code segments in the code data to be detected after the useless information is deleted; the PHP code segment is the preprocessed code data.
In the embodiment of the application, the useless information in the code data is deleted, the PHP code segment is extracted, the code data is effectively processed, and further more flexible feature extraction can be realized based on the processed code data.
As a possible implementation manner, the performing semantic abstraction processing on the preprocessed code data to obtain the code data after the semantic abstraction processing includes: replacing the self-defined character string content in the preprocessed code data with first preset content; replacing variable declaration content in the preprocessed code data with second preset content; and replacing the parameter value corresponding to the method calling content in the preprocessed code data with a preset parameter value.
In the embodiment of the application, the user-defined character string is replaced by the first preset content, the variable statement content is replaced by the second preset content, the method calling content is replaced by the preset parameter value, effective semantic abstract processing of the code data is achieved, and further more flexible feature extraction can be achieved based on the processed code data.
As a possible implementation, the preset features include: and extracting a first preset feature through a preset bag-of-words model.
In the embodiment of the application, effective extraction of some basic features can be realized through a preset bag-of-words model.
As a possible implementation, the preset features include: the overall characteristics of the code data after the semantic abstraction processing; the general features include: length, entropy, number of special symbols, and number of preset special symbols; sensitive key word characteristics contained in the code data after the semantic abstraction processing; the sensitive keyword features include: the number of character string confusion type keywords, the number of character string deformation processing type keywords, the number of command execution type keywords and the number of all sensitive keywords; confusion deformation data characteristics in the code data after the semantic abstraction processing; the obfuscated morphing-type data features include: the number of pure number parameters, the number of words repeated for multiple times, the number of sensitive keywords in the 3 words with the largest occurrence number, whether the code has readability, the number of continuous symbols and the number of the longest continuous symbols.
In the embodiment of the application, effective and accurate detection of Webshell can be realized by extracting the features.
As a possible implementation manner, the determining a detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model includes: normalizing the preset features to obtain normalized features; the characteristic value of the normalized characteristic is between 0 and 1; and inputting the normalized features into the pre-trained detection model to obtain a detection result of the code data to be detected.
In the embodiment of the application, the preset features are subjected to normalization processing and then input into the detection model trained in advance, so that the detection precision and the detection efficiency of the detection model are improved.
As a possible implementation manner, the detection method further includes: generating a front-end display result according to the code data subjected to the semantic abstraction, the preset features and the detection result; and displaying the front-end display result.
In the embodiment of the application, after the detection result is obtained, a corresponding display result can be generated and displayed according to data in the processing process, namely code data and preset features after semantic abstraction processing, and the detection result, so that flexible display of the front-end display result is realized.
As a possible implementation manner, the detection method further includes: acquiring a training data set; the training dataset comprises: normal code data and abnormal code data; the normal code data does not comprise a Webshell code, and the abnormal code data comprises the Webshell code; preprocessing the training data set to obtain a preprocessed training data set; performing semantic abstraction processing on the preprocessed training data set to obtain a training data set subjected to the semantic abstraction processing; the code data in the training data set after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the data subjected to the semantic abstraction; and training the initial detection model according to the preset characteristics to obtain the trained detection model.
In the embodiment of the application, a series of processing is performed on the normal code data and the abnormal code data, corresponding features are extracted, and then the features are input into the detection model for training, so that the obtained trained detection model can realize effective and accurate detection of Webshell.
In a second aspect, an embodiment of the present application provides a detection apparatus for Webshell code, including: the functional modules are used for implementing the first aspect and any one of the possible implementation manners of the first aspect.
In a third aspect, an embodiment of the present application provides a readable storage medium, where a computer program is stored on the readable storage medium, and when the computer program is executed by a computer, the method for detecting a Webshell code is performed as described in the first aspect and any one of the possible implementation manners of the first aspect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a flowchart of a Webshell code detection method provided in an embodiment of the present application;
fig. 2 is a schematic structural diagram of a Webshell code detection device provided in an embodiment of the present application.
Icon: 200-detection means of Webshell code; 210-an obtaining module; 220-processing module.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
The technical scheme provided by the embodiment of the application can be applied to various application scenes needing Webshell code detection, such as: the method is applied to network security detection systems, internet security maintenance systems and the like. Moreover, because the Webshell code detection is performed, the technical scheme mainly detects the Webshell code of the PHP type, and the Webshell code mentioned in the subsequent embodiment is understood as the Webshell code of the PHP type.
Based on the application scenario, the hardware operating environment corresponding to the technical solution may be hardware with data processing capability and data storage capability, for example: a server, a browser, a client, a server + browser, a server + client, and the like, which are not limited in the embodiments of the present application.
Based on the introduction of the application scenario, referring to fig. 1, a flowchart of a Webshell code detection method provided in an embodiment of the present application is shown, where the detection method includes:
step 110: and acquiring code data to be detected.
Step 120: and preprocessing the code data to be detected to obtain preprocessed code data.
Step 130: and performing semantic abstraction processing on the preprocessed code data to obtain the code data after the semantic abstraction processing. The code data after the semantic abstraction processing has a uniform code format.
Step 140: and extracting a plurality of preset features from the code data after the semantic abstraction processing.
Step 150: and determining the detection result of the code data to be detected based on the plurality of preset characteristics and the pre-trained detection model. And the detection result is used for representing whether the code data to be detected is Webshell code data or not.
In the embodiment of the application, compared with the prior art, the Webshell is detected by using the code data, the code data is preprocessed and then subjected to semantic abstraction processing, and the code data with a uniform code format is obtained; and then, extracting a plurality of preset features from the processed code data, and determining a detection result by using a pre-trained detection model and the plurality of preset features. In the detection mode, the characteristics of the code data are combined with a detection model trained in advance, and the method has the advantages of flexible characteristics, low false alarm rate, low maintenance cost, difficulty in bypassing and the like. Therefore, effective and accurate detection of Webshell can be finally realized.
Next, a detailed embodiment of the detection method will be described.
In step 110, the acquired code data to be detected may be code data captured in the monitoring process of network security; or code data and the like which are uploaded by the user and need to be detected; the examples of the present application are not intended to be limiting.
In step 120, the code data to be detected is preprocessed, which is mainly to remove useless information from the code data to be detected and extract the PHP code mentioned in the foregoing embodiments. Thus, as an alternative embodiment, step 120 includes: deleting useless information in the code data to be detected; the garbage information comprises annotation information and HTML (Hypertext Markup Language) code fragments; extracting PHP code segments in the code data to be detected after the useless information is deleted; the PHP code segment is preprocessed code data.
It can be understood that, no matter the comment information, the HTML code segment, or the PHP code segment, there is a fixed information representation form, or information format, so that the field format or field form corresponding to each preset type of information can be used to determine the information and process the information accordingly.
For example: when deleting the useless information, determining the useless information from the code data to be detected by using a preset field format or field information corresponding to the useless information, and then deleting the useless information. For another example: when the PHP code segment is extracted, the PHP code segment is determined from the code data to be detected after the useless information is deleted by using the field format or the field information corresponding to the preset PHP code segment, and then the PHP code segment is extracted from the code data to be detected after the useless information is deleted.
Furthermore, the PHP code segment finally extracted from the code data to be detected after deleting the garbage information is preprocessed code data, which corresponds to the detection of the Webshell code for the PHP code in the foregoing embodiment.
In the embodiment of the application, the useless information in the code data is deleted, the PHP code segment is extracted, the code data is effectively processed, and further more flexible feature extraction can be realized based on the processed code data.
In step 130, semantic abstraction processing is performed on the preprocessed code data to obtain code data with a uniform code format. It can be understood that in normal code, many user-defined variable names, character strings and other contents are contained, and these user-defined variable names and user-defined method names are usually meaningless, so that the code data needs to be abstracted to convert into a standard and unified format.
As an alternative embodiment, step 130 includes: replacing the self-defined character string content in the preprocessed code data with first preset content; replacing variable declaration content in the preprocessed code data with second preset content; and replacing the parameter value corresponding to the method calling content in the preprocessed code data with a preset parameter value.
String type, mostly meaningless; therefore, the custom string content can be replaced with a fixed value (i.e., the first preset content). Wherein, the first preset content may be string.
Variable declaration content, such as $ name, is also generally meaningless. Therefore, the variable declaration may be replaced with a fixed value (i.e., the second preset content). Wherein, the second preset content may be user _ define _ variable.
The method calls the content, keep the original method name, do not carry on the abstract conversion. The parameter value (i.e., the parameter value corresponding to the method call content) transmitted by the method call content, such as Add (1,2), may be replaced with a fixed value (i.e., a preset parameter value); for example, the parameters 1,2 are replaced by a fixed value, which may be user _ define _ param.
In the embodiment of the application, the user-defined character string is replaced by the first preset content, the variable statement content is replaced by the second preset content, the method calling content is replaced by the preset parameter value, effective semantic abstract processing of the code data is achieved, and further more flexible feature extraction can be achieved based on the processed code data.
After the semantic abstraction process is implemented in step 130, a plurality of preset features are extracted from the semantically abstracted code data in step 140. It should be noted that the features are usually expressed in the form of feature vectors, and therefore, in the description of the following embodiments, the features should be understood as feature vectors.
In the embodiment of the present application, the preset features may include two parts, one part is features extracted by using a conventional feature extraction method; and another part is a specific feature.
As an optional implementation manner, the first partial feature is a feature extracted by a preset bag-of-words model. The bag-of-words model is a common feature extraction model of NPL (Natural Language Processing), and m sets of multidimensional vectors can be extracted through the model, and the m sets of multidimensional vectors can be used as features of the first part.
In the embodiment of the application, effective extraction of some basic features can be realized through a preset bag-of-words model.
Of course, in practical applications, the extraction of the first partial feature may also be implemented by using other implementable feature models, which are not limited in the embodiments of the present application.
For the second partial features, as an optional implementation, the number of features is 14 in total, and all the finally extracted features may constitute m 14-dimensional feature vectors as the second partial features.
In the embodiments of the present application, the second partial feature includes three types:
a first class of features, the overall features of the code, includes: length, entropy, number of special symbols, number of preset special symbols. With reference to the foregoing embodiment, the length may be understood as the length of the code after the semantic abstraction process, and the entropy may be understood as the entropy of the code after the semantic abstraction process. The number of special symbols refers to the number of all special characters. The preset special agreement refers to the number of special symbols (e.g., $) assigned.
The second category of features, sensitive keyword features. It is understood that Webshell code typically uses string obfuscation, transformation, etc. techniques to hide command execution statements therein, and therefore, this feature requires extraction processing from the code. Specifically, the sensitive keyword features include: the number of character string confusion type keywords, the number of character string deformation processing type keywords, the number of command execution type keywords and the number of all sensitive keywords.
A third class of features, obfuscating the morphed class data features, comprising: the number of pure number parameters, the number of words repeated for multiple times, the number of sensitive keywords in the 3 words with the largest occurrence number, whether the code has readability, the number of continuous symbols and the number of the longest continuous symbols.
Wherein, whether the code has readability can be represented by 0 or 1, if the code has readability, the value of the characteristic of the item is 1, and if the code has no readability, the value of the characteristic of the item is 0.
In addition, the above-mentioned features relate to the number, or the like of the features, and the corresponding features can be identified by statistics. For example: the number of the character string confusion type keywords can be determined by counting the number of the character string confusion type keywords in the code data.
And, to realize statistics for each of the above features, it is necessary to determine each object first, for example: further statistics can be performed only by determining the character string confusion type keywords in the code data. As an optional implementation manner, when determining each object, the determination may be performed by using a preset field format or field form of each object, which may specifically refer to an implementation manner of determining the useless information and the PHP code fragment in the foregoing embodiment, and a description is not repeated here.
Based on the two parts of features, the two parts of features are extracted according to respective corresponding feature extraction modes, and then the two parts of features are combined to obtain the final preset features.
After the plurality of preset features are obtained in step 140, in step 150, a detection result of the code data to be detected is determined based on the plurality of preset features and a pre-trained detection model.
For the pre-trained detection model, it may be a model based on a supervised machine learning algorithm, including but not limited to: random forest algorithm, support vector machine algorithm, logistic regression algorithm, etc.
In selecting a particular algorithm, the selection may be made by evaluating the effect of the algorithm, for example: and selecting an optimal algorithm in a cross validation mode and the like.
As an alternative embodiment, the training process of the detection model includes: acquiring a training data set; the training dataset includes: normal code data and abnormal code data; the normal code data does not include Webshell codes, and the abnormal code data includes Webshell codes; preprocessing a training data set to obtain a preprocessed training data set; performing semantic abstraction processing on the preprocessed training data set to obtain a training data set subjected to the semantic abstraction processing; the code data in the training data set after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the data subjected to semantic abstraction processing; and training the initial detection model according to a plurality of preset characteristics to obtain the trained detection model.
The normal code data and the abnormal code data can be collected through Github (a code hosting platform), the normal code data can be malicious Webshell source code data collected through Github, and the abnormal code data can be codes in an open source project collected through Github.
Based on the normal code data and the abnormal code data, a training data set may be composed. The subsequent preprocessing, semantic abstraction, and preset feature extraction processes based on the training data set are the same as those in the previous embodiments from step 120 to step 140, and will not be described again.
Furthermore, a plurality of preset features are input into the initial detection model for training, and the trained detection model can be obtained.
In the training process of the detection model, some embodiments may be adopted to extract the accuracy of the detection model. For example: and presetting the training times, and stopping training the detection model after the preset training times are reached. For another example: after the model training, the model is tested, the detection accuracy of the model is tested, and the detection model is adjusted based on the detection accuracy, or further trained.
In the embodiment of the application, a series of processing is performed on the normal code data and the abnormal code data, corresponding features are extracted, and then the features are input into the detection model for training, so that the obtained trained detection model can realize effective and accurate detection of Webshell.
Based on the above description of the training process of the detection model, in step 150, a plurality of preset features are input into the trained detection model, so as to obtain the detection result output by the detection model.
In the embodiment of the application, the features can be better learned and normalized in order to detect the model. Thus, as an alternative embodiment, step 150 includes: normalizing the preset features to obtain normalized features; the characteristic value of the normalized characteristic is between 0 and 1; inputting the normalized features into a pre-trained detection model to obtain a detection result of the code data to be detected.
The normalization process can be implemented by using a normalization process technology mature in the field, and is not described in detail in the embodiments of the present application.
In the embodiment of the application, the preset features are subjected to normalization processing and then input into the detection model trained in advance, so that the detection precision and the detection efficiency of the detection model are improved.
Similarly, when the detection model is trained, the preset features input into the initial detection model may also be normalized features.
The detection result obtained in step 150 may represent whether the code to be detected is Webshell code data, and if the code to be detected is Webshell code data, the code to be detected is an abnormal code; and if the code data is not Webshell code data, the code to be detected is normal code.
In order to embody the interpretability of the detection model, in the embodiment of the present application, corresponding front-end feedback can also be performed. In the front-end feedback, as an optional implementation manner, the server sends the corresponding display content to the front end (e.g., a browser and a client), and then the front end performs display based on the display content. As another optional implementation manner, if the hardware operating environment corresponding to the detection method is a front end, the front end may perform presentation based on the corresponding presentation content.
Therefore, as an optional implementation manner, the detection method further includes: generating a front-end display result according to the code data subjected to the semantic abstraction, the plurality of preset features and the detection result; and displaying the front-end display result.
As an optional implementation, the front-end display result includes: code data after semantic abstraction, designated features in a plurality of preset features and detection results.
Wherein the specified feature of the plurality of preset features may include: a length; entropy; the number of special symbols; the number of character string confusion type keywords; the number of the deformation processing keywords; the number of the command execution type keywords; code readability; the number of consecutive symbols.
Of course, the specific feature in the preset features may be other features, and is not limited in the embodiment of the present application.
And based on the content to be displayed in the front-end display result, generating a corresponding front-end display result according to a preset display form or a preset display mode, and then displaying at the front end.
In the embodiment of the application, after the detection result is obtained, a corresponding display result can be generated and displayed according to data in the processing process, namely code data and preset features after semantic abstraction processing, and the detection result, so that flexible display of the front-end display result is realized.
Based on the same inventive concept, please refer to fig. 2, an embodiment of the present application further provides a device 200 for detecting a Webshell code, including: an acquisition module 210 and a processing module 220.
An obtaining module 210, configured to: and acquiring code data to be detected. A processing module 220 for: preprocessing the code data to be detected to obtain preprocessed code data; performing semantic abstraction processing on the preprocessed code data to obtain code data subjected to the semantic abstraction processing; the code data after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the code data subjected to the semantic abstraction; determining a detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model; and the detection result is used for representing whether the code data to be detected is Webshell code data or not.
In this embodiment of the application, the processing module 220 is specifically configured to: deleting useless information in the code data to be detected; the useless information comprises annotation information and HTML code fragments; extracting PHP code segments in the code data to be detected after the useless information is deleted; the PHP code segment is the preprocessed code data.
In this embodiment of the application, the processing module 220 is further specifically configured to: replacing the self-defined character string content in the preprocessed code data with first preset content; replacing variable declaration content in the preprocessed code data with second preset content; and replacing the parameter value corresponding to the method calling content in the preprocessed code data with a preset parameter value.
In this embodiment of the application, the processing module 220 is further specifically configured to: normalizing the preset features to obtain normalized features; the characteristic value of the normalized characteristic is between 0 and 1; and inputting the normalized features into the pre-trained detection model to obtain a detection result of the code data to be detected.
In an embodiment of the present application, the processing module 220 is further configured to: generating a front-end display result according to the code data subjected to the semantic abstraction, the preset features and the detection result; and displaying the front-end display result.
In this embodiment of the present application, the obtaining module 210 is further configured to: acquiring a training data set; the training dataset comprises: normal code data and abnormal code data; the normal code data does not comprise a Webshell code, and the abnormal code data comprises the Webshell code; the processing module 220 is further configured to: preprocessing the training data set to obtain a preprocessed training data set; performing semantic abstraction processing on the preprocessed training data set to obtain a training data set subjected to the semantic abstraction processing; the code data in the training data set after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the data subjected to the semantic abstraction; and training the initial detection model according to the preset characteristics to obtain the trained detection model.
The Webshell code detection apparatus 200 corresponds to a Webshell code detection method, and each functional module corresponds to each step of the detection method, so that the implementation of each functional module refers to the description in the foregoing embodiments, and the description is not repeated here.
Based on the same inventive concept, embodiments of the present application further provide a readable storage medium, where a computer program is stored on the readable storage medium, and when the computer program is executed by a computer, the detection method of the Webshell code described in the foregoing embodiments is executed.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A detection method of Webshell codes is characterized by comprising the following steps:
acquiring code data to be detected;
preprocessing the code data to be detected to obtain preprocessed code data;
performing semantic abstraction processing on the preprocessed code data to obtain code data subjected to the semantic abstraction processing; the code data after the semantic abstraction processing has a uniform code format;
extracting a plurality of preset features from the code data subjected to the semantic abstraction;
determining a detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model; the detection result is used for representing whether the code data to be detected is Webshell code data or not;
performing semantic abstraction processing on the preprocessed code data to obtain code data after the semantic abstraction processing, including:
replacing the self-defined character string content in the preprocessed code data with first preset content;
replacing variable declaration content in the preprocessed code data with second preset content;
and replacing the parameter value corresponding to the method calling content in the preprocessed code data with a preset parameter value.
2. The detection method according to claim 1, wherein the preprocessing the code data to be detected to obtain preprocessed code data includes:
deleting useless information in the code data to be detected; the useless information comprises annotation information and HTML code fragments;
extracting PHP code segments in the code data to be detected after the useless information is deleted; the PHP code segment is the preprocessed code data.
3. The detection method according to claim 1, wherein the preset features comprise:
and extracting a first preset feature through a preset bag-of-words model.
4. The detection method according to claim 1 or 3, wherein the preset features comprise:
the overall characteristics of the code data after the semantic abstraction processing; the general features include: length, entropy, number of special symbols, and number of preset special symbols;
sensitive key word characteristics contained in the code data after the semantic abstraction processing; the sensitive keyword features include: the number of character string confusion type keywords, the number of character string deformation processing type keywords, the number of command execution type keywords and the number of all sensitive keywords;
confusion deformation data characteristics in the code data after the semantic abstraction processing; the obfuscated morphing-type data features include: the number of pure number parameters, the number of words repeated for multiple times, the number of sensitive keywords in the 3 words with the largest occurrence number, whether the code has readability, the number of continuous symbols and the number of the longest continuous symbols.
5. The detection method according to claim 1, wherein the determining the detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model comprises:
normalizing the preset features to obtain normalized features; the characteristic value of the normalized characteristic is between 0 and 1;
and inputting the normalized features into the pre-trained detection model to obtain a detection result of the code data to be detected.
6. The detection method according to claim 1, further comprising:
generating a front-end display result according to the code data subjected to the semantic abstraction, the preset features and the detection result;
and displaying the front-end display result.
7. The detection method according to claim 1, further comprising:
acquiring a training data set; the training dataset comprises: normal code data and abnormal code data; the normal code data does not comprise a Webshell code, and the abnormal code data comprises the Webshell code;
preprocessing the training data set to obtain a preprocessed training data set;
performing semantic abstraction processing on the preprocessed training data set to obtain a training data set subjected to the semantic abstraction processing; the code data in the training data set after the semantic abstraction processing has a uniform code format;
extracting a plurality of preset features from the data subjected to the semantic abstraction;
and training the initial detection model according to the preset characteristics to obtain the trained detection model.
8. An apparatus for detecting Webshell code, comprising:
an acquisition module to: acquiring code data to be detected;
a processing module to: preprocessing the code data to be detected to obtain preprocessed code data; performing semantic abstraction processing on the preprocessed code data to obtain code data subjected to the semantic abstraction processing; the code data after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the code data subjected to the semantic abstraction; determining a detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model; the detection result is used for representing whether the code data to be detected is Webshell code data or not;
the processing module is specifically configured to: replacing the self-defined character string content in the preprocessed code data with first preset content; replacing variable declaration content in the preprocessed code data with second preset content; and replacing the parameter value corresponding to the method calling content in the preprocessed code data with a preset parameter value.
9. The detection apparatus according to claim 8, wherein the processing module is further configured to: deleting useless information in the code data to be detected; the useless information comprises annotation information and HTML code fragments; extracting PHP code segments in the code data to be detected after the useless information is deleted; the PHP code segment is the preprocessed code data.
10. A readable storage medium, having stored thereon a computer program which, when executed by a computer, performs the Webshell code detection method of any one of claims 1 to 7.
CN202111382367.0A 2021-11-22 2021-11-22 Webshell code detection method and device and readable storage medium Pending CN113821448A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111382367.0A CN113821448A (en) 2021-11-22 2021-11-22 Webshell code detection method and device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111382367.0A CN113821448A (en) 2021-11-22 2021-11-22 Webshell code detection method and device and readable storage medium

Publications (1)

Publication Number Publication Date
CN113821448A true CN113821448A (en) 2021-12-21

Family

ID=78917952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111382367.0A Pending CN113821448A (en) 2021-11-22 2021-11-22 Webshell code detection method and device and readable storage medium

Country Status (1)

Country Link
CN (1) CN113821448A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975392A (en) * 2016-04-29 2016-09-28 国家计算机网络与信息安全管理中心 Duplicated code detection method and device based on abstract syntax tree
CN108763931A (en) * 2018-05-28 2018-11-06 上海交通大学 Leak detection method based on Bi-LSTM and text similarity
CN109067708A (en) * 2018-06-29 2018-12-21 北京奇虎科技有限公司 A kind of detection method, device, equipment and the storage medium at webpage back door
CN109598124A (en) * 2018-12-11 2019-04-09 厦门服云信息科技有限公司 A kind of webshell detection method and device
CN110245496A (en) * 2019-05-27 2019-09-17 华中科技大学 A kind of source code leak detection method and detector and its training method and system
GB201917161D0 (en) * 2019-08-23 2020-01-08 Praetorian System and method for automatically detecting a security vulnerability in a source code using a machine learning model
CN111614599A (en) * 2019-02-25 2020-09-01 北京金睛云华科技有限公司 Webshell detection method and device based on artificial intelligence
CN112989348A (en) * 2021-04-15 2021-06-18 中国电子信息产业集团有限公司第六研究所 Attack detection method, model training method, device, server and storage medium
CN113094706A (en) * 2020-01-08 2021-07-09 深信服科技股份有限公司 WebShell detection method, device, equipment and readable storage medium
CN113190849A (en) * 2021-04-28 2021-07-30 重庆邮电大学 Webshell script detection method and device, electronic equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975392A (en) * 2016-04-29 2016-09-28 国家计算机网络与信息安全管理中心 Duplicated code detection method and device based on abstract syntax tree
CN108763931A (en) * 2018-05-28 2018-11-06 上海交通大学 Leak detection method based on Bi-LSTM and text similarity
CN109067708A (en) * 2018-06-29 2018-12-21 北京奇虎科技有限公司 A kind of detection method, device, equipment and the storage medium at webpage back door
CN109598124A (en) * 2018-12-11 2019-04-09 厦门服云信息科技有限公司 A kind of webshell detection method and device
CN111614599A (en) * 2019-02-25 2020-09-01 北京金睛云华科技有限公司 Webshell detection method and device based on artificial intelligence
CN110245496A (en) * 2019-05-27 2019-09-17 华中科技大学 A kind of source code leak detection method and detector and its training method and system
GB201917161D0 (en) * 2019-08-23 2020-01-08 Praetorian System and method for automatically detecting a security vulnerability in a source code using a machine learning model
US20210056211A1 (en) * 2019-08-23 2021-02-25 Praetorian System and method for automatically detecting a security vulnerability in a source code using a machine learning model
CN113094706A (en) * 2020-01-08 2021-07-09 深信服科技股份有限公司 WebShell detection method, device, equipment and readable storage medium
CN112989348A (en) * 2021-04-15 2021-06-18 中国电子信息产业集团有限公司第六研究所 Attack detection method, model training method, device, server and storage medium
CN113190849A (en) * 2021-04-28 2021-07-30 重庆邮电大学 Webshell script detection method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周祖德、陈幼平: "《现代机械制造系统的监控与故障诊断》", 30 June 1990, 华中理工大学出版社 *
林峰等: "一种基于多视角特征融合的Webshell检测方法", 《电信科学》 *

Similar Documents

Publication Publication Date Title
CN112800427B (en) Webshell detection method and device, electronic equipment and storage medium
WO2011032094A1 (en) Extracting information from unstructured data and mapping the information to a structured schema using the naive bayesian probability model
KR20150038738A (en) Detection of confidential information
CN113486350B (en) Method, device, equipment and storage medium for identifying malicious software
CN110730164B (en) Safety early warning method, related equipment and computer readable storage medium
CN111049819A (en) Threat information discovery method based on threat modeling and computer equipment
CN112989348B (en) Attack detection method, model training method, device, server and storage medium
CN111813960A (en) Data security audit model device and method based on knowledge graph and terminal equipment
US20220253526A1 (en) Incremental updates to malware detection models
CN112131249A (en) Attack intention identification method and device
CN112003834B (en) Abnormal behavior detection method and device
CN111881398A (en) Page type determination method, device and equipment and computer storage medium
CN112685738A (en) Malicious confusion script static detection method based on multi-stage voting mechanism
CN111552792A (en) Information query method and device, electronic equipment and storage medium
CN113067792A (en) XSS attack identification method, device, equipment and medium
CN112817877B (en) Abnormal script detection method and device, computer equipment and storage medium
CN113971284B (en) JavaScript-based malicious webpage detection method, equipment and computer readable storage medium
CN109791563B (en) Information collection system, information collection method, and recording medium
CN113821448A (en) Webshell code detection method and device and readable storage medium
CN114707026A (en) Network model training method, character string detection method, device and electronic equipment
JP7140268B2 (en) WARNING DEVICE, CONTROL METHOD AND PROGRAM
CN114169540A (en) Webpage user behavior detection method and system based on improved machine learning
CN111984970B (en) SQL injection detection method and system, electronic equipment and storage medium
CN114666078A (en) Method and system for detecting SQL injection attack, electronic equipment and storage medium
CN116719986B (en) Python-based data grabbing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211221

RJ01 Rejection of invention patent application after publication