CN113821448A

CN113821448A - Webshell code detection method and device and readable storage medium

Info

Publication number: CN113821448A
Application number: CN202111382367.0A
Authority: CN
Inventors: 徐钟豪; 陈伟; 谢忱; 刘伟
Original assignee: Shanghai Douxiang Information Technology Co ltd
Current assignee: Shanghai Douxiang Information Technology Co ltd
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2021-12-21

Abstract

The application provides a Webshell code detection method and device and a readable storage medium. The Webshell code detection method comprises the following steps: acquiring code data to be detected; preprocessing the code data to be detected to obtain preprocessed code data; performing semantic abstraction processing on the preprocessed code data to obtain code data subjected to the semantic abstraction processing; the code data after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the code data subjected to the semantic abstraction; determining a detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model; and the detection result is used for representing whether the code data to be detected is Webshell code data or not. The detection method is used for realizing effective and accurate detection of Webshell.

Description

Webshell code detection method and device and readable storage medium

Technical Field

The application relates to the technical field of network security, in particular to a Webshell code detection method and device and a readable storage medium.

Background

Webshell is a code execution environment in the form of web page files such as ASP (Active Server Pages), PHP (Hypertext Preprocessor), JSP (Java Server Pages), and may also be called a web page backdoor. After the hacker invades the website, the ASP or PHP backdoor file is usually mixed with the normal web page file in the website server directory, and then the browser can be used to access the ASP or PHP backdoor to obtain a command execution environment, so as to achieve the purpose of controlling the website server. The WebShell backdoor has extremely strong concealment, can pass through a firewall of a server, generally does not leave records in a system log by using the WebShell, only leaves some data submission records in the log of a website, and an inexperienced administrator is not easy to find invasion traces.

Most of the existing Webshell detection technologies carry out detection based on rules, and the method has the advantages of high resource overhead, multiple restricted conditions, high missing report rate and easiness in bypassing, for example, a rule-based detection method can be easily bypassed by adding some confusion characters or confusion symbols into codes. Therefore, the existing Webshell detection technology cannot realize effective and accurate detection of Webshell.

Disclosure of Invention

The embodiment of the application aims to provide a detection method and device for a Webshell code and a readable storage medium, which are used for realizing effective and accurate detection of the Webshell.

In a first aspect, an embodiment of the present application provides a method for detecting a Webshell code, including: acquiring code data to be detected; preprocessing the code data to be detected to obtain preprocessed code data; performing semantic abstraction processing on the preprocessed code data to obtain code data subjected to the semantic abstraction processing; the code data after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the code data subjected to the semantic abstraction; determining a detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model; and the detection result is used for representing whether the code data to be detected is Webshell code data or not.

In the embodiment of the application, compared with the prior art, the Webshell is detected by using the code data, the code data is preprocessed and then subjected to semantic abstraction processing, and the code data with a uniform code format is obtained; and then, extracting a plurality of preset features from the processed code data, and determining a detection result by using a pre-trained detection model and the plurality of preset features. In the detection mode, the characteristics of the code data are combined with a detection model trained in advance, and the method has the advantages of flexible characteristics, low false alarm rate, low maintenance cost, difficulty in bypassing and the like. Therefore, effective and accurate detection of Webshell can be finally realized.

As a possible implementation manner, the preprocessing the code data to be detected to obtain preprocessed code data includes: deleting useless information in the code data to be detected; the useless information comprises annotation information and HTML (Hypertext Markup Language) code fragments; extracting PHP code segments in the code data to be detected after the useless information is deleted; the PHP code segment is the preprocessed code data.

In the embodiment of the application, the useless information in the code data is deleted, the PHP code segment is extracted, the code data is effectively processed, and further more flexible feature extraction can be realized based on the processed code data.

As a possible implementation manner, the performing semantic abstraction processing on the preprocessed code data to obtain the code data after the semantic abstraction processing includes: replacing the self-defined character string content in the preprocessed code data with first preset content; replacing variable declaration content in the preprocessed code data with second preset content; and replacing the parameter value corresponding to the method calling content in the preprocessed code data with a preset parameter value.

In the embodiment of the application, the user-defined character string is replaced by the first preset content, the variable statement content is replaced by the second preset content, the method calling content is replaced by the preset parameter value, effective semantic abstract processing of the code data is achieved, and further more flexible feature extraction can be achieved based on the processed code data.

As a possible implementation, the preset features include: and extracting a first preset feature through a preset bag-of-words model.

In the embodiment of the application, effective extraction of some basic features can be realized through a preset bag-of-words model.

As a possible implementation, the preset features include: the overall characteristics of the code data after the semantic abstraction processing; the general features include: length, entropy, number of special symbols, and number of preset special symbols; sensitive key word characteristics contained in the code data after the semantic abstraction processing; the sensitive keyword features include: the number of character string confusion type keywords, the number of character string deformation processing type keywords, the number of command execution type keywords and the number of all sensitive keywords; confusion deformation data characteristics in the code data after the semantic abstraction processing; the obfuscated morphing-type data features include: the number of pure number parameters, the number of words repeated for multiple times, the number of sensitive keywords in the 3 words with the largest occurrence number, whether the code has readability, the number of continuous symbols and the number of the longest continuous symbols.

In the embodiment of the application, effective and accurate detection of Webshell can be realized by extracting the features.

As a possible implementation manner, the determining a detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model includes: normalizing the preset features to obtain normalized features; the characteristic value of the normalized characteristic is between 0 and 1; and inputting the normalized features into the pre-trained detection model to obtain a detection result of the code data to be detected.

In the embodiment of the application, the preset features are subjected to normalization processing and then input into the detection model trained in advance, so that the detection precision and the detection efficiency of the detection model are improved.

As a possible implementation manner, the detection method further includes: generating a front-end display result according to the code data subjected to the semantic abstraction, the preset features and the detection result; and displaying the front-end display result.

In the embodiment of the application, after the detection result is obtained, a corresponding display result can be generated and displayed according to data in the processing process, namely code data and preset features after semantic abstraction processing, and the detection result, so that flexible display of the front-end display result is realized.

As a possible implementation manner, the detection method further includes: acquiring a training data set; the training dataset comprises: normal code data and abnormal code data; the normal code data does not comprise a Webshell code, and the abnormal code data comprises the Webshell code; preprocessing the training data set to obtain a preprocessed training data set; performing semantic abstraction processing on the preprocessed training data set to obtain a training data set subjected to the semantic abstraction processing; the code data in the training data set after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the data subjected to the semantic abstraction; and training the initial detection model according to the preset characteristics to obtain the trained detection model.

In the embodiment of the application, a series of processing is performed on the normal code data and the abnormal code data, corresponding features are extracted, and then the features are input into the detection model for training, so that the obtained trained detection model can realize effective and accurate detection of Webshell.

In a second aspect, an embodiment of the present application provides a detection apparatus for Webshell code, including: the functional modules are used for implementing the first aspect and any one of the possible implementation manners of the first aspect.

In a third aspect, an embodiment of the present application provides a readable storage medium, where a computer program is stored on the readable storage medium, and when the computer program is executed by a computer, the method for detecting a Webshell code is performed as described in the first aspect and any one of the possible implementation manners of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart of a Webshell code detection method provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of a Webshell code detection device provided in an embodiment of the present application.

Icon: 200-detection means of Webshell code; 210-an obtaining module; 220-processing module.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The technical scheme provided by the embodiment of the application can be applied to various application scenes needing Webshell code detection, such as: the method is applied to network security detection systems, internet security maintenance systems and the like. Moreover, because the Webshell code detection is performed, the technical scheme mainly detects the Webshell code of the PHP type, and the Webshell code mentioned in the subsequent embodiment is understood as the Webshell code of the PHP type.

Based on the application scenario, the hardware operating environment corresponding to the technical solution may be hardware with data processing capability and data storage capability, for example: a server, a browser, a client, a server + browser, a server + client, and the like, which are not limited in the embodiments of the present application.

Based on the introduction of the application scenario, referring to fig. 1, a flowchart of a Webshell code detection method provided in an embodiment of the present application is shown, where the detection method includes:

step 110: and acquiring code data to be detected.

Step 120: and preprocessing the code data to be detected to obtain preprocessed code data.

Step 130: and performing semantic abstraction processing on the preprocessed code data to obtain the code data after the semantic abstraction processing. The code data after the semantic abstraction processing has a uniform code format.

Step 140: and extracting a plurality of preset features from the code data after the semantic abstraction processing.

Step 150: and determining the detection result of the code data to be detected based on the plurality of preset characteristics and the pre-trained detection model. And the detection result is used for representing whether the code data to be detected is Webshell code data or not.

Next, a detailed embodiment of the detection method will be described.

In step 110, the acquired code data to be detected may be code data captured in the monitoring process of network security; or code data and the like which are uploaded by the user and need to be detected; the examples of the present application are not intended to be limiting.

In step 120, the code data to be detected is preprocessed, which is mainly to remove useless information from the code data to be detected and extract the PHP code mentioned in the foregoing embodiments. Thus, as an alternative embodiment, step 120 includes: deleting useless information in the code data to be detected; the garbage information comprises annotation information and HTML (Hypertext Markup Language) code fragments; extracting PHP code segments in the code data to be detected after the useless information is deleted; the PHP code segment is preprocessed code data.

It can be understood that, no matter the comment information, the HTML code segment, or the PHP code segment, there is a fixed information representation form, or information format, so that the field format or field form corresponding to each preset type of information can be used to determine the information and process the information accordingly.

For example: when deleting the useless information, determining the useless information from the code data to be detected by using a preset field format or field information corresponding to the useless information, and then deleting the useless information. For another example: when the PHP code segment is extracted, the PHP code segment is determined from the code data to be detected after the useless information is deleted by using the field format or the field information corresponding to the preset PHP code segment, and then the PHP code segment is extracted from the code data to be detected after the useless information is deleted.

Furthermore, the PHP code segment finally extracted from the code data to be detected after deleting the garbage information is preprocessed code data, which corresponds to the detection of the Webshell code for the PHP code in the foregoing embodiment.

In step 130, semantic abstraction processing is performed on the preprocessed code data to obtain code data with a uniform code format. It can be understood that in normal code, many user-defined variable names, character strings and other contents are contained, and these user-defined variable names and user-defined method names are usually meaningless, so that the code data needs to be abstracted to convert into a standard and unified format.

As an alternative embodiment, step 130 includes: replacing the self-defined character string content in the preprocessed code data with first preset content; replacing variable declaration content in the preprocessed code data with second preset content; and replacing the parameter value corresponding to the method calling content in the preprocessed code data with a preset parameter value.

String type, mostly meaningless; therefore, the custom string content can be replaced with a fixed value (i.e., the first preset content). Wherein, the first preset content may be string.

Variable declaration content, such as $ name, is also generally meaningless. Therefore, the variable declaration may be replaced with a fixed value (i.e., the second preset content). Wherein, the second preset content may be user _ define _ variable.

The method calls the content, keep the original method name, do not carry on the abstract conversion. The parameter value (i.e., the parameter value corresponding to the method call content) transmitted by the method call content, such as Add (1,2), may be replaced with a fixed value (i.e., a preset parameter value); for example, the parameters 1,2 are replaced by a fixed value, which may be user _ define _ param.

After the semantic abstraction process is implemented in step 130, a plurality of preset features are extracted from the semantically abstracted code data in step 140. It should be noted that the features are usually expressed in the form of feature vectors, and therefore, in the description of the following embodiments, the features should be understood as feature vectors.

In the embodiment of the present application, the preset features may include two parts, one part is features extracted by using a conventional feature extraction method; and another part is a specific feature.

As an optional implementation manner, the first partial feature is a feature extracted by a preset bag-of-words model. The bag-of-words model is a common feature extraction model of NPL (Natural Language Processing), and m sets of multidimensional vectors can be extracted through the model, and the m sets of multidimensional vectors can be used as features of the first part.

Of course, in practical applications, the extraction of the first partial feature may also be implemented by using other implementable feature models, which are not limited in the embodiments of the present application.

For the second partial features, as an optional implementation, the number of features is 14 in total, and all the finally extracted features may constitute m 14-dimensional feature vectors as the second partial features.

In the embodiments of the present application, the second partial feature includes three types:

a first class of features, the overall features of the code, includes: length, entropy, number of special symbols, number of preset special symbols. With reference to the foregoing embodiment, the length may be understood as the length of the code after the semantic abstraction process, and the entropy may be understood as the entropy of the code after the semantic abstraction process. The number of special symbols refers to the number of all special characters. The preset special agreement refers to the number of special symbols (e.g., $) assigned.

The second category of features, sensitive keyword features. It is understood that Webshell code typically uses string obfuscation, transformation, etc. techniques to hide command execution statements therein, and therefore, this feature requires extraction processing from the code. Specifically, the sensitive keyword features include: the number of character string confusion type keywords, the number of character string deformation processing type keywords, the number of command execution type keywords and the number of all sensitive keywords.

A third class of features, obfuscating the morphed class data features, comprising: the number of pure number parameters, the number of words repeated for multiple times, the number of sensitive keywords in the 3 words with the largest occurrence number, whether the code has readability, the number of continuous symbols and the number of the longest continuous symbols.

Wherein, whether the code has readability can be represented by 0 or 1, if the code has readability, the value of the characteristic of the item is 1, and if the code has no readability, the value of the characteristic of the item is 0.

In addition, the above-mentioned features relate to the number, or the like of the features, and the corresponding features can be identified by statistics. For example: the number of the character string confusion type keywords can be determined by counting the number of the character string confusion type keywords in the code data.

And, to realize statistics for each of the above features, it is necessary to determine each object first, for example: further statistics can be performed only by determining the character string confusion type keywords in the code data. As an optional implementation manner, when determining each object, the determination may be performed by using a preset field format or field form of each object, which may specifically refer to an implementation manner of determining the useless information and the PHP code fragment in the foregoing embodiment, and a description is not repeated here.

Based on the two parts of features, the two parts of features are extracted according to respective corresponding feature extraction modes, and then the two parts of features are combined to obtain the final preset features.

After the plurality of preset features are obtained in step 140, in step 150, a detection result of the code data to be detected is determined based on the plurality of preset features and a pre-trained detection model.

For the pre-trained detection model, it may be a model based on a supervised machine learning algorithm, including but not limited to: random forest algorithm, support vector machine algorithm, logistic regression algorithm, etc.

In selecting a particular algorithm, the selection may be made by evaluating the effect of the algorithm, for example: and selecting an optimal algorithm in a cross validation mode and the like.

As an alternative embodiment, the training process of the detection model includes: acquiring a training data set; the training dataset includes: normal code data and abnormal code data; the normal code data does not include Webshell codes, and the abnormal code data includes Webshell codes; preprocessing a training data set to obtain a preprocessed training data set; performing semantic abstraction processing on the preprocessed training data set to obtain a training data set subjected to the semantic abstraction processing; the code data in the training data set after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the data subjected to semantic abstraction processing; and training the initial detection model according to a plurality of preset characteristics to obtain the trained detection model.

The normal code data and the abnormal code data can be collected through Github (a code hosting platform), the normal code data can be malicious Webshell source code data collected through Github, and the abnormal code data can be codes in an open source project collected through Github.

Based on the normal code data and the abnormal code data, a training data set may be composed. The subsequent preprocessing, semantic abstraction, and preset feature extraction processes based on the training data set are the same as those in the previous embodiments from step 120 to step 140, and will not be described again.

Furthermore, a plurality of preset features are input into the initial detection model for training, and the trained detection model can be obtained.

In the training process of the detection model, some embodiments may be adopted to extract the accuracy of the detection model. For example: and presetting the training times, and stopping training the detection model after the preset training times are reached. For another example: after the model training, the model is tested, the detection accuracy of the model is tested, and the detection model is adjusted based on the detection accuracy, or further trained.

Based on the above description of the training process of the detection model, in step 150, a plurality of preset features are input into the trained detection model, so as to obtain the detection result output by the detection model.

In the embodiment of the application, the features can be better learned and normalized in order to detect the model. Thus, as an alternative embodiment, step 150 includes: normalizing the preset features to obtain normalized features; the characteristic value of the normalized characteristic is between 0 and 1; inputting the normalized features into a pre-trained detection model to obtain a detection result of the code data to be detected.

The normalization process can be implemented by using a normalization process technology mature in the field, and is not described in detail in the embodiments of the present application.

Similarly, when the detection model is trained, the preset features input into the initial detection model may also be normalized features.

The detection result obtained in step 150 may represent whether the code to be detected is Webshell code data, and if the code to be detected is Webshell code data, the code to be detected is an abnormal code; and if the code data is not Webshell code data, the code to be detected is normal code.

In order to embody the interpretability of the detection model, in the embodiment of the present application, corresponding front-end feedback can also be performed. In the front-end feedback, as an optional implementation manner, the server sends the corresponding display content to the front end (e.g., a browser and a client), and then the front end performs display based on the display content. As another optional implementation manner, if the hardware operating environment corresponding to the detection method is a front end, the front end may perform presentation based on the corresponding presentation content.

Therefore, as an optional implementation manner, the detection method further includes: generating a front-end display result according to the code data subjected to the semantic abstraction, the plurality of preset features and the detection result; and displaying the front-end display result.

As an optional implementation, the front-end display result includes: code data after semantic abstraction, designated features in a plurality of preset features and detection results.

Wherein the specified feature of the plurality of preset features may include: a length; entropy; the number of special symbols; the number of character string confusion type keywords; the number of the deformation processing keywords; the number of the command execution type keywords; code readability; the number of consecutive symbols.

Of course, the specific feature in the preset features may be other features, and is not limited in the embodiment of the present application.

And based on the content to be displayed in the front-end display result, generating a corresponding front-end display result according to a preset display form or a preset display mode, and then displaying at the front end.

Based on the same inventive concept, please refer to fig. 2, an embodiment of the present application further provides a device 200 for detecting a Webshell code, including: an acquisition module 210 and a processing module 220.

An obtaining module 210, configured to: and acquiring code data to be detected. A processing module 220 for: preprocessing the code data to be detected to obtain preprocessed code data; performing semantic abstraction processing on the preprocessed code data to obtain code data subjected to the semantic abstraction processing; the code data after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the code data subjected to the semantic abstraction; determining a detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model; and the detection result is used for representing whether the code data to be detected is Webshell code data or not.

In this embodiment of the application, the processing module 220 is specifically configured to: deleting useless information in the code data to be detected; the useless information comprises annotation information and HTML code fragments; extracting PHP code segments in the code data to be detected after the useless information is deleted; the PHP code segment is the preprocessed code data.

In this embodiment of the application, the processing module 220 is further specifically configured to: replacing the self-defined character string content in the preprocessed code data with first preset content; replacing variable declaration content in the preprocessed code data with second preset content; and replacing the parameter value corresponding to the method calling content in the preprocessed code data with a preset parameter value.

In this embodiment of the application, the processing module 220 is further specifically configured to: normalizing the preset features to obtain normalized features; the characteristic value of the normalized characteristic is between 0 and 1; and inputting the normalized features into the pre-trained detection model to obtain a detection result of the code data to be detected.

In an embodiment of the present application, the processing module 220 is further configured to: generating a front-end display result according to the code data subjected to the semantic abstraction, the preset features and the detection result; and displaying the front-end display result.

In this embodiment of the present application, the obtaining module 210 is further configured to: acquiring a training data set; the training dataset comprises: normal code data and abnormal code data; the normal code data does not comprise a Webshell code, and the abnormal code data comprises the Webshell code; the processing module 220 is further configured to: preprocessing the training data set to obtain a preprocessed training data set; performing semantic abstraction processing on the preprocessed training data set to obtain a training data set subjected to the semantic abstraction processing; the code data in the training data set after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the data subjected to the semantic abstraction; and training the initial detection model according to the preset characteristics to obtain the trained detection model.

The Webshell code detection apparatus 200 corresponds to a Webshell code detection method, and each functional module corresponds to each step of the detection method, so that the implementation of each functional module refers to the description in the foregoing embodiments, and the description is not repeated here.

Based on the same inventive concept, embodiments of the present application further provide a readable storage medium, where a computer program is stored on the readable storage medium, and when the computer program is executed by a computer, the detection method of the Webshell code described in the foregoing embodiments is executed.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A detection method of Webshell codes is characterized by comprising the following steps:

acquiring code data to be detected;

preprocessing the code data to be detected to obtain preprocessed code data;

performing semantic abstraction processing on the preprocessed code data to obtain code data subjected to the semantic abstraction processing; the code data after the semantic abstraction processing has a uniform code format;

extracting a plurality of preset features from the code data subjected to the semantic abstraction;

determining a detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model; the detection result is used for representing whether the code data to be detected is Webshell code data or not;

performing semantic abstraction processing on the preprocessed code data to obtain code data after the semantic abstraction processing, including:

replacing the self-defined character string content in the preprocessed code data with first preset content;

replacing variable declaration content in the preprocessed code data with second preset content;

and replacing the parameter value corresponding to the method calling content in the preprocessed code data with a preset parameter value.

2. The detection method according to claim 1, wherein the preprocessing the code data to be detected to obtain preprocessed code data includes:

deleting useless information in the code data to be detected; the useless information comprises annotation information and HTML code fragments;

extracting PHP code segments in the code data to be detected after the useless information is deleted; the PHP code segment is the preprocessed code data.

3. The detection method according to claim 1, wherein the preset features comprise:

and extracting a first preset feature through a preset bag-of-words model.

4. The detection method according to claim 1 or 3, wherein the preset features comprise:

the overall characteristics of the code data after the semantic abstraction processing; the general features include: length, entropy, number of special symbols, and number of preset special symbols;

sensitive key word characteristics contained in the code data after the semantic abstraction processing; the sensitive keyword features include: the number of character string confusion type keywords, the number of character string deformation processing type keywords, the number of command execution type keywords and the number of all sensitive keywords;

confusion deformation data characteristics in the code data after the semantic abstraction processing; the obfuscated morphing-type data features include: the number of pure number parameters, the number of words repeated for multiple times, the number of sensitive keywords in the 3 words with the largest occurrence number, whether the code has readability, the number of continuous symbols and the number of the longest continuous symbols.

5. The detection method according to claim 1, wherein the determining the detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model comprises:

normalizing the preset features to obtain normalized features; the characteristic value of the normalized characteristic is between 0 and 1;

and inputting the normalized features into the pre-trained detection model to obtain a detection result of the code data to be detected.

6. The detection method according to claim 1, further comprising:

generating a front-end display result according to the code data subjected to the semantic abstraction, the preset features and the detection result;

and displaying the front-end display result.

7. The detection method according to claim 1, further comprising:

acquiring a training data set; the training dataset comprises: normal code data and abnormal code data; the normal code data does not comprise a Webshell code, and the abnormal code data comprises the Webshell code;

preprocessing the training data set to obtain a preprocessed training data set;

performing semantic abstraction processing on the preprocessed training data set to obtain a training data set subjected to the semantic abstraction processing; the code data in the training data set after the semantic abstraction processing has a uniform code format;

extracting a plurality of preset features from the data subjected to the semantic abstraction;

and training the initial detection model according to the preset characteristics to obtain the trained detection model.

8. An apparatus for detecting Webshell code, comprising:

an acquisition module to: acquiring code data to be detected;

a processing module to: preprocessing the code data to be detected to obtain preprocessed code data; performing semantic abstraction processing on the preprocessed code data to obtain code data subjected to the semantic abstraction processing; the code data after the semantic abstraction processing has a uniform code format; extracting a plurality of preset features from the code data subjected to the semantic abstraction; determining a detection result of the code data to be detected based on the plurality of preset features and a pre-trained detection model; the detection result is used for representing whether the code data to be detected is Webshell code data or not;

the processing module is specifically configured to: replacing the self-defined character string content in the preprocessed code data with first preset content; replacing variable declaration content in the preprocessed code data with second preset content; and replacing the parameter value corresponding to the method calling content in the preprocessed code data with a preset parameter value.

9. The detection apparatus according to claim 8, wherein the processing module is further configured to: deleting useless information in the code data to be detected; the useless information comprises annotation information and HTML code fragments; extracting PHP code segments in the code data to be detected after the useless information is deleted; the PHP code segment is the preprocessed code data.

10. A readable storage medium, having stored thereon a computer program which, when executed by a computer, performs the Webshell code detection method of any one of claims 1 to 7.