CN111695117B - Webshell script detection method and device - Google Patents
Webshell script detection method and device Download PDFInfo
- Publication number
- CN111695117B CN111695117B CN202010534994.0A CN202010534994A CN111695117B CN 111695117 B CN111695117 B CN 111695117B CN 202010534994 A CN202010534994 A CN 202010534994A CN 111695117 B CN111695117 B CN 111695117B
- Authority
- CN
- China
- Prior art keywords
- webshell
- script
- features
- normal web
- script set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013515 script Methods 0.000 title claims abstract description 390
- 238000001514 detection method Methods 0.000 title claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 75
- 238000012216 screening Methods 0.000 claims abstract description 31
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 13
- 238000000605 extraction Methods 0.000 claims description 55
- 238000000034 method Methods 0.000 claims description 50
- 230000008569 process Effects 0.000 claims description 33
- 230000006870 function Effects 0.000 claims description 24
- 238000012545 processing Methods 0.000 claims description 12
- 238000004140 cleaning Methods 0.000 claims description 7
- 230000008030 elimination Effects 0.000 claims description 7
- 238000003379 elimination reaction Methods 0.000 claims description 7
- 230000035945 sensitivity Effects 0.000 claims description 6
- 230000003993 interaction Effects 0.000 claims description 3
- 230000009471 action Effects 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 241000721701 Lynx Species 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/563—Static detection by source code analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a webshell script detection method and device, which ensure the diversity of features used for training an SVM model by extracting the features of a plurality of different setting types, screen the extracted features of the different setting types based on a Fisher scoring algorithm, train the SVM model by utilizing the screened features, and further improve the classification accuracy of the SVM model. On the basis, extracting the characteristics of a plurality of different setting types from the script to be detected, screening the extracted characteristics of the plurality of different setting types by using a Fisher scoring algorithm, and inputting the screened characteristics into the SVM model, so that the accuracy of the classification result output by the SVM model can be improved.
Description
Technical Field
The application relates to the technical field of information security, in particular to a webshell script detection method and device.
Background
The webshell script is a backdoor program installed on a successfully invaded computer, and an attacker can use the webshell script to permanently access the invaded computer and make a series of malicious uses, such as executing system commands, stealing and falsifying user data, modifying website homepages and the like. Therefore, it is highly necessary to detect webshell scripts.
However, the accuracy of the current webshell script detection method needs to be improved.
Disclosure of Invention
In order to solve the technical problems, the embodiment of the application provides a method and a device for detecting webshell script, so as to achieve the purpose of improving the accuracy of webshell script detection, and the technical scheme is as follows:
a webshell script detection method, the method comprising:
extracting features from the script to be detected according to a preset feature extraction template;
screening out the characteristics which accord with a set rule from the characteristics, and taking the screened out characteristics as characteristics to be used;
inputting the features to be used into a pre-trained SVM model to obtain a classification result output by the SVM model, wherein the pre-trained SVM model is obtained by training by utilizing webshell training features and normal web training features;
the obtaining process of the webshell training feature and the normal web training feature comprises the following steps:
acquiring a webshell script set and a normal web script set;
extracting features from webshell scripts in the webshell script set and normal web scripts in the normal web script set respectively according to the preset feature extraction templates to obtain to-be-processed webshell features and to-be-processed normal web features;
And respectively screening out the characteristics conforming to the set rule from the webshell characteristics to be processed and the normal web characteristics to be processed, and respectively taking the screened characteristics as webshell training characteristics and normal web training characteristics.
Preferably, the feature extraction template includes:
extracting templates of a plurality of features of different setting types;
the extracting the features from the script to be detected according to the preset feature extraction template comprises the following steps:
extracting a plurality of features of different setting types from the script to be detected according to templates for extracting the features of a plurality of different setting types;
extracting features from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to the preset feature extraction templates respectively, wherein the feature extraction comprises the following steps:
and extracting the characteristics of the different setting types from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to templates for extracting the characteristics of the different setting types.
Preferably, the setting rule includes:
rules for feature ranking above a set ranking threshold;
the feature ranking is ranking obtained by scoring the features based on a Fisher scoring algorithm and sorting the features from low score to high score.
Preferably, the extracting the features from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to the preset feature extraction templates respectively includes:
denoising the webshell script set and the normal web script set respectively to obtain a first webshell script set to be processed and a first normal web script set to be processed;
performing redundancy elimination processing on the first webshell script set to be processed and the first normal web script set to be processed respectively to obtain a second webshell script set to be processed and a second normal web script set to be processed;
clustering the second webshell script set to be processed and the second normal web script set to be processed respectively to obtain at least one webshell target script set and at least one normal web script set;
and extracting features from each webshell target script set and each normal web script set according to the preset feature extraction templates respectively.
Preferably, the denoising processing for the webshell script set and the normal web script set respectively includes:
respectively taking the script with the script length in the webshell script set and the normal web script set which meet a set script length threshold as a noise script, and removing the noise script;
And cleaning BASE64 codes in scripts remaining after the noise scripts are removed in the webshell script set and the normal web script set respectively by using an anti-aliasing technology.
A webshell script detection device, comprising:
the extraction module is used for extracting features from the script to be detected according to a preset feature extraction template;
the screening module is used for screening out the characteristics which accord with the set rule from the characteristics, and taking the screened characteristics as the characteristics to be used;
the classification module is used for inputting the characteristics to be used into a pre-trained SVM model to obtain a classification result output by the SVM model; the pre-trained SVM model is obtained by training a training module through webshell training features and normal web training features;
the obtaining process of the webshell training feature and the normal web training feature comprises the following steps:
acquiring a webshell script set and a normal web script set;
extracting features from webshell scripts in the webshell script set and normal web scripts in the normal web script set respectively according to the preset feature extraction templates to obtain to-be-processed webshell features and to-be-processed normal web features;
And respectively screening out the characteristics conforming to the set rule from the webshell characteristics to be processed and the normal web characteristics to be processed, and respectively taking the screened characteristics as webshell training characteristics and normal web training characteristics.
Preferably, the feature extraction template includes:
extracting templates of a plurality of features of different setting types;
the extraction module is specifically configured to: extracting a plurality of features of different setting types from the script to be detected according to templates for extracting the features of a plurality of different setting types;
the training module is specifically configured to:
and extracting the characteristics of the different setting types from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to templates for extracting the characteristics of the different setting types.
Preferably, the setting rule includes:
rules for feature ranking above a set ranking threshold;
the feature ranking is ranking obtained by scoring the features based on a Fisher scoring algorithm and sorting the features from low score to high score.
Preferably, the training module is specifically configured to:
denoising the webshell script set and the normal web script set respectively to obtain a first webshell script set to be processed and a first normal web script set to be processed;
Performing redundancy elimination processing on the first webshell script set to be processed and the first normal web script set to be processed respectively to obtain a second webshell script set to be processed and a second normal web script set to be processed;
clustering the second webshell script set to be processed and the second normal web script set to be processed respectively to obtain at least one webshell target script set and at least one normal web script set;
and extracting features from each webshell target script set and each normal web script set according to the preset feature extraction templates respectively.
Preferably, the training module is specifically configured to:
respectively taking the script with the script length in the webshell script set and the normal web script set which meet a set script length threshold as a noise script, and removing the noise script;
and cleaning BASE64 codes in scripts remaining after the noise scripts are removed in the webshell script set and the normal web script set respectively by using an anti-aliasing technology.
Compared with the prior art, the application has the beneficial effects that:
in the application, the data processing mode adopted during SVM model training, namely the mode of preprocessing, feature extraction and feature screening is adopted to process the script to be detected to obtain the feature to be used, and the feature to be used is input into the SVM model trained in advance, so that the feature to be used can be more accurately classified by the SVM model, and the accuracy of webshell script detection is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.
Fig. 1 is a flowchart of a webshell script detection method provided in embodiment 1 of the present application;
FIG. 2 is a flowchart of a webshell training feature and the process of obtaining the normal web training feature provided in embodiment 1 of the present application;
fig. 3 is a flowchart of a webshell script detection method provided in embodiment 2 of the present application;
FIG. 4 is a flowchart of a webshell training feature and the process of obtaining the normal web training feature provided in embodiment 2 of the present application;
fig. 5 is a flowchart of a webshell script detection method provided in embodiment 3 of the present application;
FIG. 6 is a flowchart of a webshell training feature and the process of obtaining the normal web training feature provided in embodiment 3 of the present application;
fig. 7 is a schematic logic structure diagram of a webshell script detecting device provided by the application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description.
Referring to fig. 1, a flowchart of a webshell script detection method provided in embodiment 1 of the present application is shown in fig. 1, and the method may include, but is not limited to, the following steps:
and S11, extracting features from the script to be detected according to a preset feature extraction template.
The feature extraction template may be set as needed, and is not limited in this embodiment.
It should be noted that, according to the preset feature extraction template, features are extracted from the script to be detected, so that the feature extraction efficiency can be improved.
And step S12, screening out the characteristics which accord with the set rule from the characteristics, and taking the screened out characteristics as the characteristics to be used.
The setting rule may be set as needed, and is not limited in this embodiment.
And screening out the features which accord with the set rule from the features, and taking the screened features as the features to be used, so that the workload of webshell script detection can be at least reduced.
And S13, inputting the features to be used into a pre-trained SVM model to obtain a classification result output by the SVM model, wherein the pre-trained SVM model is obtained by training by utilizing webshell training features and normal web training features.
The obtaining process of the webshell training feature and the normal web training feature may refer to fig. 2, and the obtaining process may include:
s131, acquiring a webshell script set and a normal web script set.
In this embodiment, the process of obtaining the webshell script set and the normal web script set may include:
s1311, collecting a webshell script set and a normal web script set from a network (e.g., a gatherer) using a crawler attack.
Of course, the process of obtaining the webshell script set and the normal web script set may also include:
s1312, collecting webshell script sets and normal web script sets from a network (e.g., a gitub) using a crawler attack.
S1313, screening out webshell script sets meeting set conditions from the collected webshell script sets.
The setting conditions may be set as needed, and are not limited in this embodiment. For example, the setting condition may be set to a script language as a setting language (e.g., PHP language).
S1314, screening out the normal web script set meeting the set condition from the collected normal web script set.
The setting conditions in this step are the same as those in step S1313, and will not be described here.
S132, extracting features from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set respectively according to the preset feature extraction templates to obtain the to-be-processed webshell features and the to-be-processed normal web features.
Extracting features from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to the preset feature extraction templates respectively to obtain to-be-processed webshell features and to-be-processed normal web features, which can be understood as: extracting features from webshell scripts in the webshell script set according to the preset feature extraction template to obtain webshell features to be processed, and extracting features from normal web scripts in the normal web script set according to the preset feature extraction template to obtain normal web features to be processed.
The preset feature extraction template in this step is the same as the feature extraction template in step S11, and will not be described here again.
S133, screening out the characteristics conforming to the set rule from the webshell characteristics to be processed and the normal web characteristics to be processed, and taking the screened characteristics as webshell training characteristics and normal web training characteristics respectively.
The features conforming to the set rule are respectively screened from the webshell features to be processed and the normal web features to be processed, and the screened features are respectively used as webshell training features and normal web training features, which can be understood as follows: screening out characteristics conforming to the set rule from the characteristics of the webshell to be processed, and taking the screened characteristics as webshell training characteristics; and screening out the characteristics conforming to the set rule from the normal web characteristics to be processed, and taking the screened characteristics as normal web training characteristics.
The setting in this step is the same as the feature extraction template in step S11, and will not be described here.
In the application, the data processing mode adopted during SVM model training, namely the mode of preprocessing, feature extraction and feature screening is adopted to process the script to be detected to obtain the feature to be used, and the feature to be used is input into the SVM model trained in advance, so that the feature to be used can be more accurately classified by the SVM model, and the accuracy of webshell script detection is improved.
As another alternative embodiment of the present application, referring to fig. 3, a flowchart of a webshell script detection method provided in embodiment 2 of the present application is mainly a refinement of the webshell script detection method described in embodiment 1 above, and as shown in fig. 3, the method may include, but is not limited to, the following steps:
and S21, extracting the characteristics of a plurality of different setting types from the script to be detected according to templates for extracting the characteristics of a plurality of different setting types.
The template for extracting the features of the plurality of different setting types is one embodiment of the feature extraction template set in advance in example 1.
Step S21 is a specific embodiment of step S11 in example 1.
The features of the different setting types may include, but are not limited to: lexical features, syntactic features, and abstract features.
The lexical features can be understood as: and analyzing global variables required by the webshell according to the characteristic that the webshell receives various commands and performs information interaction with the victim server, and taking the number of the global variables representing the received information as a characteristic. Global variables may include, but are not limited to: keywords in the web such as $_get (F1), $_post (F2), $_cookie (F3), $_request (F4), $_file (F5), and $_session (F6).
Syntactic features, can be understood as: through analysis of expression modes used by an attacker when writing webshell scripts, the determined webshell can automatically adapt to various operating systems and automatically try to acquire the features of the authorities of related software. Syntactic features may include, but are not limited to: conditional statement duty cycle: representing the percentage of conditional statements in all statements of the script, e.g., if (F7), else (F8), else (F9), case (F10) in the script; and/or, the cyclic sentence duty cycle: representing the percentage of loop statements in all statements of the script, e.g., for (F11), while (F12) and foreach (F13) in the script
Abstract features, which can be understood as: sensitivity function matching degree. The matching degree of the sensitive function can be determined by judging whether the sensitive function exists in the script to be detected. If the script to be detected contains a sensitive function, the matching degree of the sensitive function can be set to be 1; if the script to be detected does not contain the sensitive function, the matching degree of the sensitive function can be set to 0. The sensitivity function matching degree can represent the application condition of some keywords in PHP language, such as disguised execution function (eval), file acquisition function (wget, curl, lynx, get, fetch), reverse connection function (perl, python, gcc, chmod, nohup, nc), information collection function (uname, id, ver, sysctl, whoami, $OSTYPE, pwd), etc., which are often used by webshell to execute some suspicious behaviors. The present invention is characterized by including these types of functions, namely, including a camouflage execution function (F14), including a file acquisition function (F15), including a reverse connection function (F16), and including an information collection function (F17). In addition, three common features of maximum length of word in script (F18), maximum length of line in php source code (F19) and information entropy (F20) can be added.
And S22, screening out the characteristics which accord with the set rule from the characteristics, and taking the screened out characteristics as the characteristics to be used.
And S23, inputting the features to be used into a pre-trained SVM model to obtain a classification result output by the SVM model, wherein the pre-trained SVM model is obtained by training by utilizing webshell training features and normal web training features.
The detailed procedure of steps S22-S23 can be referred to in the related description of steps S12-S13 in embodiment 1, and will not be described herein.
In this embodiment, the obtaining process of the webshell training feature and the normal web training feature may refer to fig. 4, and the obtaining process may include:
step S231, acquiring a webshell script set and a normal web script set.
The detailed process of step S231 can be referred to the related description of step S131 in embodiment 1, and will not be repeated here.
And step 232, extracting the characteristics of a plurality of different setting types from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to templates for extracting the characteristics of the plurality of different setting types.
The template for extracting the plurality of features of different setting types is the same as the template for extracting the plurality of features of different setting types in step S21, and will not be described here.
In this embodiment, the process of extracting features from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to the preset feature extraction templates respectively may include:
s2321, denoising the webshell script set and the normal web script set respectively to obtain a first webshell script set to be processed and a first normal web script set to be processed.
S2322, performing redundancy elimination processing on the first webshell script set to be processed and the first normal web script set to be processed respectively to obtain a second webshell script set to be processed and a second normal web script set to be processed.
The redundancy removal processing is performed on the first webshell script set to be processed and the first normal web script set to be processed respectively, which can be understood as follows: and respectively carrying out standardized operation on the first webshell script set to be processed and the first normal web script set to be processed, wherein the standardized operation comprises deleting all code notes and empty lines, calculating text similarity by utilizing a tf-idf model, and only reserving one script with similarity exceeding a threshold value.
S2323, clustering the second webshell script set to be processed and the second normal web script set to be processed respectively to obtain at least one webshell target script set and at least one normal web script set.
Because the functional modules between the Webshell scripts or between the normal web scripts may overlap, the second to-be-processed Webshell script set and the second to-be-processed normal web script set may be clustered by using the concept of transferring closures, specifically: webshell scripts or normal web scripts with the similarity of the functional modules exceeding a set clustering threshold are clustered into one type.
More specifically, the second set of webshell scripts to be processed and the second set of normal web scripts to be processed may be clustered according to a family of scripts, respectively. The scripts in the same family have higher similarity, and the scripts in different families have lower similarity. For example, a webshell has a family called C99, and a plurality of webshell scripts exist in the C99 family, their codes are similar, functions are similar (i.e., the functional modules are close), and the different scripts may differ by several lines of codes. And the other family is called r57, and webshells in the r57 family are also similar. But the webshell script in r57 is very different from the webshell script in C99.
For example, if the scripts m1 and m2 have a part of the same functional modules, and the scripts m2 and m3 have a part of the same functional modules, it can be inferred that m1, m2, and m3 are all belonging to a family. This transfer characteristic makes it possible to obtain all member information of the same family. Meanwhile, in this process, the corresponding functions of webshells in one family are also gradually clarified.
In this embodiment, the set clustering threshold may be set to 30%, that is, 30% of the functional modules are identical and are grouped into one class.
Steps S2321-S2323 may be understood as: and (3) carrying out data preprocessing on the webshell scripts in the webshell script set and the normal web scripts in the normal web script set.
S2324, extracting features from each webshell target script set and each normal web script set according to the preset feature extraction templates respectively.
The detailed process of step S2324 may refer to the related description of step S21, which is not described herein.
In this embodiment, the process of extracting features from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to the preset feature extraction templates respectively may also include:
s2325, respectively taking scripts with script lengths in the webshell script set and the normal web script set meeting a set script length threshold as noise scripts, and removing the noise scripts.
The script length threshold may be set as needed, and is not limited in this embodiment. For example, the set script length threshold may be set as, but not limited to: 3 megabytes.
S2326, cleaning BASE64 codes in the scripts remaining after the noise scripts are removed in the webshell script set and the normal web script set respectively by utilizing an anti-aliasing technology, so as to obtain a first webshell script set to be processed and a first normal web script set to be processed.
Cleaning BASE64 codes in scripts remaining after the noise scripts are removed in the webshell script set and the normal web script set respectively by using an anti-aliasing technology, which can be understood as follows: and respectively utilizing an anti-aliasing technology to centralize the webshell scripts and restore BASE64 codes in the scripts remained after the noise scripts are removed respectively in the normal web scripts to be original PHP scripts.
Since the BASE64 code has no semantic information, features cannot be extracted from the BASE64 code, and thus the BASE64 code needs to be restored to the PHP script to ensure that features can be extracted from the PHP script.
Steps S2325-S2326 are a specific embodiment of step S2321.
S2327, performing redundancy elimination processing on the first webshell script set to be processed and the first normal web script set to be processed respectively to obtain a second webshell script set to be processed and a second normal web script set to be processed.
S2328, clustering the second webshell script set to be processed and the second normal web script set to be processed respectively to obtain at least one webshell target script set and at least one normal web script set.
S2329, extracting features from each webshell target script set and each normal web script set according to the preset feature extraction templates respectively.
The detailed process of steps S2327-S2329 may be referred to in the description of steps S2322-S2324, and will not be described herein.
And S233, screening out the characteristics conforming to the set rule from the webshell characteristics to be processed and the normal web characteristics to be processed, and taking the screened characteristics as webshell training characteristics and normal web training characteristics respectively.
The detailed process of step S233 can be referred to the related description of step S133 in embodiment 1, and will not be repeated here.
In this embodiment, the feature extraction of the plurality of different set types ensures the diversity of features used for training the SVM model, and the feature extraction of the plurality of different set types is screened, and the SVM model is trained by using the screened features, so that the classification accuracy of the SVM model can be further improved. On the basis, the characteristics of a plurality of different setting types are extracted from the script to be detected, the extracted characteristics of the plurality of different setting types are screened, and the screened characteristics are input into the SVM model, so that the accuracy of the classification result output by the SVM model can be improved.
As another alternative embodiment of the present application, referring to fig. 5, a flowchart of a webshell script detection method provided in embodiment 3 of the present application is mainly a refinement of the webshell script detection method described in embodiment 2 above, and as shown in fig. 5, the method may include, but is not limited to, the following steps:
and S31, extracting a plurality of features of different setting types from the script to be detected according to templates for extracting the features of the plurality of different setting types.
The detailed process of step S31 can be referred to the related description of step S21 in embodiment 2, and will not be repeated here.
And S32, screening out the features which accord with the rule that the feature ranking is higher than the set ranking threshold value from the features, and taking the screened features as the features to be used.
The feature ranking in the rule that the feature ranking is higher than the set ranking threshold is ranking obtained by scoring the features based on a Fisher scoring algorithm and sorting the features from low score to high score.
The setting of the ranking threshold may be set as needed, and is not limited in this embodiment.
The process of screening features from the features that meet a rule that the feature ranks above a set ranking threshold may include:
S321, scoring the characteristics based on a relational expression of the following Fisher algorithm:
wherein,,mean value of the ith feature on the dataset,/->Represent the first
i numberAverage value of features in k-th class, n k Representing the number of samples in the k-th class,
a value representing the j-th position of the i-th feature in the k-th class.
If F (F) i ) The larger the variance of all corresponding values for a corresponding feature in the same class, the better the feature.
S322, sorting the scores from low to high to obtain feature ranks, and selecting features with feature ranks higher than a set ranking threshold.
The set ranking threshold may be set as, but is not limited to: 16.
the embodiment calculates the data set of the feature sum extracted by the Fisher scoring algorithm to verify that the feature extracted by the Fisher scoring algorithm can well distinguish the normal script from the malicious script. The specific verification method can be as follows: for each dimension data of the feature, first, calculate the center point of each script (normal and malicious) and record as c respectively n And c m . Then, corresponding dimension data in the data samples of the normal class and the malicious class script are recorded to a center point c n And c m The average radius of Euclidean distances of (2) are respectively denoted as r n And r m . Finally, calculating Euclidean distance between normal and malicious script center points, and marking as dc m,n . If r n And r m Value ratio dc of (2) m,n Much smaller, the declarative features can distinguish well between normal and malicious scripts.
And step S33, inputting the features to be used into a pre-trained SVM model to obtain a classification result output by the SVM model, wherein the pre-trained SVM model is obtained by training by utilizing webshell training features and normal web training features.
The detailed process of step S33 can be referred to the related description of step S23 in embodiment 2, and will not be described herein.
In this embodiment, the obtaining process of the webshell training feature and the normal web training feature may refer to fig. 6, and the obtaining process may include:
step S331, acquiring a webshell script set and a normal web script set.
Step S332, extracting the features of the plurality of different setting types from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to templates for extracting the features of the plurality of different setting types.
The detailed process of steps S331 to S332 can be referred to in the related description of steps S231 to S232 in embodiment 2, and will not be described herein.
Step S333, screening out features meeting the rule that feature ranking is higher than a set ranking threshold from the webshell features to be processed and the normal web features to be processed, and taking the screened features as webshell training features and normal web training features, respectively.
The rule that the feature rank is higher than the set rank threshold in this step may refer to the related description of the set rule in step S32, which is not described herein.
And respectively screening out the characteristics which accord with the rule that the characteristic ranking is higher than the set ranking threshold value from the webshell characteristics to be processed and the normal web characteristics to be processed, and respectively taking the screened characteristics as detailed processes of the webshell training characteristics and the normal web training characteristics, wherein the detailed description of the step S32 can be referred to, and the detailed description is omitted.
In this embodiment, the diversity of features used for training the SVM model is ensured by extracting features of a plurality of different setting types, the features of the plurality of different setting types are screened based on the fischer scoring algorithm, and the SVM model is trained by using the screened features, so that the accuracy of classification of the SVM model can be further improved. On the basis, extracting the characteristics of a plurality of different setting types from the script to be detected, screening the extracted characteristics of the plurality of different setting types by using a Fisher scoring algorithm, and inputting the screened characteristics into the SVM model, so that the accuracy of the classification result output by the SVM model can be improved.
The webshell script detection device provided by the application is introduced, and the webshell script detection device introduced below and the webshell script detection method introduced above can be correspondingly referred to each other.
Referring to fig. 7, the webshell script detecting device includes: the device comprises an extraction module 11, a screening module 12, a classification module 13, a training module 14 and a feature obtaining module 15.
The extracting module 11 is configured to extract features from the script to be detected according to a preset feature extraction template.
And the screening module 12 is used for screening out the characteristics conforming to the set rule from the characteristics, and taking the screened characteristics as the characteristics to be used.
The classification module 13 is used for inputting the characteristics to be used into a pre-trained SVM model to obtain a classification result output by the SVM model; the pre-trained SVM model is trained using the webshell training features and the normal web training features using the training module 14.
The feature obtaining module 15 is configured to:
acquiring a webshell script set and a normal web script set;
extracting features from webshell scripts in the webshell script set and normal web scripts in the normal web script set respectively according to the preset feature extraction templates to obtain to-be-processed webshell features and to-be-processed normal web features;
And respectively screening out the characteristics conforming to the set rule from the webshell characteristics to be processed and the normal web characteristics to be processed, and respectively taking the screened characteristics as webshell training characteristics and normal web training characteristics.
In this embodiment, the feature extraction template may include:
extracting templates of a plurality of features of different setting types;
accordingly, the extraction module 11 may be specifically configured to: extracting a plurality of features of different setting types from the script to be detected according to templates for extracting the features of a plurality of different setting types;
the feature obtaining module 15 may specifically be configured to:
and extracting the characteristics of the different setting types from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to templates for extracting the characteristics of the different setting types.
In this embodiment, the setting rule may include:
rules for feature ranking above a set ranking threshold;
the feature ranking is ranking obtained by scoring the features based on a Fisher scoring algorithm and sorting the features from low score to high score.
In this embodiment, the process of extracting features from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set by the feature obtaining module 15 according to the preset feature extraction template respectively may include:
Denoising the webshell script set and the normal web script set respectively to obtain a first webshell script set to be processed and a first normal web script set to be processed;
performing redundancy elimination processing on the first webshell script set to be processed and the first normal web script set to be processed respectively to obtain a second webshell script set to be processed and a second normal web script set to be processed;
clustering the second webshell script set to be processed and the second normal web script set to be processed respectively to obtain at least one webshell target script set and at least one normal web script set;
and extracting features from each webshell target script set and each normal web script set according to the preset feature extraction templates respectively.
In this embodiment, the process of the feature obtaining module 15 performing denoising processing on the webshell script set and the normal web script set respectively may include:
respectively taking the script with the script length in the webshell script set and the normal web script set which meet a set script length threshold as a noise script, and removing the noise script;
and cleaning BASE64 codes in scripts remaining after the noise scripts are removed in the webshell script set and the normal web script set respectively by using an anti-aliasing technology.
It should be noted that, in each embodiment, the differences from the other embodiments are emphasized, and the same similar parts between the embodiments are referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.
From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.
The webshell script detection method and device provided by the application are described in detail, and specific examples are applied to illustrate the principle and implementation of the application, and the description of the above examples is only used for helping to understand the method and core ideas of the application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.
Claims (8)
1. The webshell script detection method is characterized by comprising the following steps:
extracting features from the script to be detected according to a preset feature extraction template; the feature extraction template comprises: extracting templates of a plurality of features of different setting types;
the extracting the features from the script to be detected according to the preset feature extraction template comprises the following steps: extracting a plurality of features of different setting types from the script to be detected according to templates for extracting the features of a plurality of different setting types; the features of the different setting types include: lexical features, syntactic features, and abstract features; the lexical feature is characterized in that global variables required by the webshell are analyzed according to the characteristic that the webshell receives various commands and performs information interaction with a victim server, the number of the global variables representing the received information is used as a feature, the syntactic feature is a feature that the webshell can automatically adapt to various operating systems and automatically try to acquire the authority of related software, the abstract feature is a sensitivity function matching degree, and the sensitivity function matching degree is used for representing the application condition of keywords in PHP language;
screening out the characteristics which accord with a set rule from the characteristics, and taking the screened out characteristics as characteristics to be used;
Inputting the features to be used into a pre-trained SVM model to obtain a classification result output by the SVM model, wherein the pre-trained SVM model is obtained by training by utilizing webshell training features and normal web training features;
the obtaining process of the webshell training feature and the normal web training feature comprises the following steps:
acquiring a webshell script set and a normal web script set;
extracting features from webshell scripts in the webshell script set and normal web scripts in the normal web script set respectively according to the preset feature extraction templates to obtain to-be-processed webshell features and to-be-processed normal web features; extracting features from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to the preset feature extraction templates respectively, wherein the feature extraction comprises the following steps: extracting the characteristics of a plurality of different setting types from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to templates for extracting the characteristics of the plurality of different setting types;
and respectively screening out the characteristics conforming to the set rule from the webshell characteristics to be processed and the normal web characteristics to be processed, and respectively taking the screened characteristics as webshell training characteristics and normal web training characteristics.
2. The method of claim 1, wherein the setting the rule comprises:
rules for feature ranking above a set ranking threshold;
the feature ranking is ranking obtained by scoring the features based on a Fisher scoring algorithm and sorting the features from low score to high score.
3. The method according to any one of claims 1-2, wherein extracting features from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to the predetermined feature extraction templates, respectively, comprises:
denoising the webshell script set and the normal web script set respectively to obtain a first webshell script set to be processed and a first normal web script set to be processed;
performing redundancy elimination processing on the first webshell script set to be processed and the first normal web script set to be processed respectively to obtain a second webshell script set to be processed and a second normal web script set to be processed;
clustering the second webshell script set to be processed and the second normal web script set to be processed respectively to obtain at least one webshell target script set and at least one normal web script set;
And extracting features from each webshell target script set and each normal web script set according to the preset feature extraction templates respectively.
4. A method according to claim 3, wherein said denoising said webshell script set and said normal web script set, respectively, comprises:
respectively taking the script with the script length in the webshell script set and the normal web script set which meet a set script length threshold as a noise script, and removing the noise script;
and cleaning BASE64 codes in scripts remaining after the noise scripts are removed in the webshell script set and the normal web script set respectively by using an anti-aliasing technology.
5. The webshell script detection device is characterized by comprising:
the extraction module is used for extracting features from the script to be detected according to a preset feature extraction template; the feature extraction template comprises:
extracting templates of a plurality of features of different setting types;
the extraction module is specifically configured to: extracting a plurality of features of different setting types from the script to be detected according to templates for extracting the features of a plurality of different setting types; the features of the different setting types include: lexical features, syntactic features, and abstract features; the lexical feature is characterized in that global variables required by the webshell are analyzed according to the characteristic that the webshell receives various commands and performs information interaction with a victim server, the number of the global variables representing the received information is used as a feature, the syntactic feature is a feature that the webshell can automatically adapt to various operating systems and automatically try to acquire the authority of related software, the abstract feature is a sensitivity function matching degree, and the sensitivity function matching degree is used for representing the application condition of keywords in PHP language;
The screening module is used for screening out the characteristics which accord with the set rule from the characteristics, and taking the screened characteristics as the characteristics to be used;
the classification module is used for inputting the characteristics to be used into a pre-trained SVM model to obtain a classification result output by the SVM model; the pre-trained SVM model is obtained by training a training module through webshell training features and normal web training features;
the feature obtaining module is used for:
acquiring a webshell script set and a normal web script set;
extracting features from webshell scripts in the webshell script set and normal web scripts in the normal web script set respectively according to the preset feature extraction templates to obtain to-be-processed webshell features and to-be-processed normal web features;
the feature obtaining module is specifically configured to:
extracting the characteristics of a plurality of different setting types from the webshell scripts in the webshell script set and the normal web scripts in the normal web script set according to templates for extracting the characteristics of the plurality of different setting types;
and respectively screening out the characteristics conforming to the set rule from the webshell characteristics to be processed and the normal web characteristics to be processed, and respectively taking the screened characteristics as webshell training characteristics and normal web training characteristics.
6. The apparatus of claim 5, wherein the setting the rule comprises:
rules for feature ranking above a set ranking threshold;
the feature ranking is ranking obtained by scoring the features based on a Fisher scoring algorithm and sorting the features from low score to high score.
7. The apparatus according to any of the claims 5-6, wherein the feature acquisition module is specifically configured to:
denoising the webshell script set and the normal web script set respectively to obtain a first webshell script set to be processed and a first normal web script set to be processed;
performing redundancy elimination processing on the first webshell script set to be processed and the first normal web script set to be processed respectively to obtain a second webshell script set to be processed and a second normal web script set to be processed;
clustering the second webshell script set to be processed and the second normal web script set to be processed respectively to obtain at least one webshell target script set and at least one normal web script set;
and extracting features from each webshell target script set and each normal web script set according to the preset feature extraction templates respectively.
8. The apparatus according to claim 7, wherein the feature acquisition module is specifically configured to:
respectively taking the script with the script length in the webshell script set and the normal web script set which meet a set script length threshold as a noise script, and removing the noise script;
and cleaning BASE64 codes in scripts remaining after the noise scripts are removed in the webshell script set and the normal web script set respectively by using an anti-aliasing technology.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010534994.0A CN111695117B (en) | 2020-06-12 | 2020-06-12 | Webshell script detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010534994.0A CN111695117B (en) | 2020-06-12 | 2020-06-12 | Webshell script detection method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111695117A CN111695117A (en) | 2020-09-22 |
CN111695117B true CN111695117B (en) | 2023-10-03 |
Family
ID=72480538
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010534994.0A Active CN111695117B (en) | 2020-06-12 | 2020-06-12 | Webshell script detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111695117B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113393063A (en) * | 2021-08-17 | 2021-09-14 | 深圳市信润富联数字科技有限公司 | Match result prediction method, system, program product and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108604228A (en) * | 2016-02-09 | 2018-09-28 | 国际商业机器公司 | System and method for the language feature generation that multilayer word indicates |
CN109462575A (en) * | 2018-09-28 | 2019-03-12 | 东巽科技(北京)有限公司 | A kind of webshell detection method and device |
CN109598124A (en) * | 2018-12-11 | 2019-04-09 | 厦门服云信息科技有限公司 | A kind of webshell detection method and device |
CN109657459A (en) * | 2018-10-11 | 2019-04-19 | 平安科技(深圳)有限公司 | Webpage back door detection method, equipment, storage medium and device |
CN109905385A (en) * | 2019-02-19 | 2019-06-18 | 中国银行股份有限公司 | A kind of webshell detection method, apparatus and system |
CN110427755A (en) * | 2018-10-16 | 2019-11-08 | 新华三信息安全技术有限公司 | A kind of method and device identifying script file |
WO2020000743A1 (en) * | 2018-06-27 | 2020-01-02 | 平安科技(深圳)有限公司 | Webshell detection method and related device |
-
2020
- 2020-06-12 CN CN202010534994.0A patent/CN111695117B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108604228A (en) * | 2016-02-09 | 2018-09-28 | 国际商业机器公司 | System and method for the language feature generation that multilayer word indicates |
WO2020000743A1 (en) * | 2018-06-27 | 2020-01-02 | 平安科技(深圳)有限公司 | Webshell detection method and related device |
CN109462575A (en) * | 2018-09-28 | 2019-03-12 | 东巽科技(北京)有限公司 | A kind of webshell detection method and device |
CN109657459A (en) * | 2018-10-11 | 2019-04-19 | 平安科技(深圳)有限公司 | Webpage back door detection method, equipment, storage medium and device |
CN110427755A (en) * | 2018-10-16 | 2019-11-08 | 新华三信息安全技术有限公司 | A kind of method and device identifying script file |
CN109598124A (en) * | 2018-12-11 | 2019-04-09 | 厦门服云信息科技有限公司 | A kind of webshell detection method and device |
CN109905385A (en) * | 2019-02-19 | 2019-06-18 | 中国银行股份有限公司 | A kind of webshell detection method, apparatus and system |
Also Published As
Publication number | Publication date |
---|---|
CN111695117A (en) | 2020-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102310487B1 (en) | Apparatus and method for review analysis per attribute | |
US20100211551A1 (en) | Method, system, and computer readable recording medium for filtering obscene contents | |
CN113011889B (en) | Account anomaly identification method, system, device, equipment and medium | |
JP2005523533A (en) | Processing mixed numeric and / or non-numeric data | |
CN107844533A (en) | A kind of intelligent Answer System and analysis method | |
KR20170035892A (en) | Recognition of behavioural changes of online services | |
CN111143838A (en) | Database user abnormal behavior detection method | |
CN110991246A (en) | Video detection method and system | |
Qin et al. | Finger-vein quality assessment based on deep features from grayscale and binary images | |
KR20200063067A (en) | Apparatus and method for validating self-propagated unethical text | |
CN111695117B (en) | Webshell script detection method and device | |
CN114329455A (en) | User abnormal behavior detection method and device based on heterogeneous graph embedding | |
Truskinger et al. | Decision support for the efficient annotation of bioacoustic events | |
CN116578700A (en) | Log classification method, log classification device, equipment and medium | |
Wang et al. | Malware detection using cnn via word embedding in cloud computing infrastructure | |
CN114048770B (en) | Automatic detection method and system for digital audio deletion and insertion tampering operation | |
Thanos et al. | Combined deep learning and traditional NLP approaches for fire burst detection based on twitter posts | |
CN116186255A (en) | Method for training unknown intention detection model, unknown intention detection method and device | |
CN113610080B (en) | Cross-modal perception-based sensitive image identification method, device, equipment and medium | |
CN115858776A (en) | Variant text classification recognition method, system, storage medium and electronic equipment | |
CN114417860A (en) | Information detection method, device and equipment | |
JP2003058861A (en) | Method and device for detecting data error, software and storage medium therefor | |
Rigoni et al. | Cleaner categories improve object detection and visual-textual grounding | |
Tan et al. | Artificial speech detection using image-based features and random forest classifier | |
Ferrari et al. | A clustering-based approach for discovering flaws in requirements specifications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |