CN110427755A - A kind of method and device identifying script file - Google Patents
A kind of method and device identifying script file Download PDFInfo
- Publication number
- CN110427755A CN110427755A CN201811202216.0A CN201811202216A CN110427755A CN 110427755 A CN110427755 A CN 110427755A CN 201811202216 A CN201811202216 A CN 201811202216A CN 110427755 A CN110427755 A CN 110427755A
- Authority
- CN
- China
- Prior art keywords
- script file
- feature
- sample
- word combination
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/566—Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
- G06F2221/033—Test or assess software
Abstract
This application provides a kind of method and devices for identifying script file, are related to technical field of network security, and method includes: to obtain multiple sample script files for marking and having;Each sample script file is converted into machine instruction sequence;Feature word combination is extracted from the machine instruction sequence of sample script file, obtains the fisrt feature set of sample script file;For fisrt feature set, according to preset words-frequency feature algorithm, the words-frequency feature of each feature word combination in fisrt feature set is calculated separately, the feature vector of sample script file is obtained;According to the label training script identification model of the feature vector of each sample script file and each sample script file;When getting script file to be identified, script file to be identified is identified using script identification model, determines whether script file to be identified is malicious script file.The accuracy of identification webshell file can be improved using the application.
Description
Technical field
This application involves technical field of network security, more particularly to a kind of method and device for identifying script file.
Background technique
Webshell is with ASP (Active Server Pages, Active Server Page), PHP (Hypertext
Preprocessor, HyperText Preprocessor), JSP (Java Server Pages, the java server page), python or
A kind of order existing for the page scripts document forms such as CGI (Common Gateway Interface, common gateway interface) is held
Row environment may also be referred to as a kind of webpage back door.Webshell usually utilizes the operation for obtaining Website server by invader
Permission, hacker is after the server for invading certain website, it will usually will be normal under webshell file and Website server web catalogue
Web page files mix, webshell file is then accessed using browser, obtains webshell order performing environment,
To obtain the operating right in a way to server, to achieve the purpose that control server.Therefore, in order to safeguard net
The safety of site server needs to detect webshell file, and removes webshell file in time.
In the prior art, detection webshell file handling procedure is as follows: technical staff can preset feature database, the spy
Levy the feature that library includes a variety of files of malicious script for identification (i.e. webshell file), such as dangerous function, dangerous file
The information such as suffix, sensitive document name and content-keyword.The file content that script file to be identified can be included by server,
It is matched with the characteristic item in feature database, then determines whether the script file is malicious script according to matching result.For example,
If the number of the characteristic item to match with file content be greater than preset threshold, determine the script file for malicious script, or
Person, the matching times that a certain characteristic item is matched are more than preset threshold, then determine the script file for malicious script.
However, feature database in the prior art is formerly to be set by technical staff is unified, the feature that feature database includes is inadequate
Comprehensively and validity is poor, this makes the accuracy of existing webshell file detection mode lower.
Summary of the invention
The embodiment of the present application is designed to provide a kind of method and device for identifying script file, can be improved
The accuracy of webshell file detection mode.Specific technical solution is as follows:
In a first aspect, providing a kind of method for identifying script file, which comprises
Multiple sample script files for marking and having are obtained, the label includes being used to indicate web-page requests as malice net
The label of page request is used to indicate the label that web-page requests are non-malicious web-page requests;
Each sample script file is converted into machine instruction sequence;
For each sample script file, using preset feature extraction rule, from the machine of the sample script file
Feature word combination is extracted in instruction sequence, obtains the fisrt feature set of the sample script file;
For fisrt feature set, according to preset words-frequency feature algorithm, calculate separately each in the fisrt feature set
The words-frequency feature of feature word combination, and according to the words-frequency feature of feature word combination each in the fisrt feature set, determine described in
The feature vector of sample script file;
Based on machine learning algorithm, according to the mark of the feature vector of each sample script file and each sample script file
Sign training script identification model;
When getting script file to be identified, the script file to be identified is identified using the script identification model,
Determine whether the script file to be identified is malicious script file.
Optionally, described according to preset words-frequency feature algorithm, calculate separately each Feature Words in the fisrt feature set
Combined words-frequency feature, comprising:
For each feature word combination that the fisrt feature set includes, determine the specific word combination in the sample foot
Frequency of occurrence in this document, the first ratio of the total number for the feature word combination for including with the sample script file;
In preset corpus, the first number of script file of the determination comprising the specific word combination, and described in determination
Second ratio of the total number for the script file that corpus includes and first number, wherein the corpus includes multiple
The machine instruction sequence of the machine instruction sequence of non-malicious script file and multiple malicious script files;
According to first ratio and second ratio, the words-frequency feature of the specific word combination is calculated.
It is optionally, described that the words-frequency feature of the specific word combination is calculated according to first ratio and second ratio,
Include:
First ratio is determined using following formula:
Wherein, nwIt is characterized frequency of occurrence of the word combination w in sample script file, N is that sample script file includes
The total number of feature word combination;
In preset corpus, the first number of the script file comprising the specific word combination w is determined, and determine corpus
Second ratio of the total number for the script file that library includes and the first number;
The words-frequency feature of the specific word combination w is determined using following formula:
(TF-IDF)w=TFw*IDFw;
Wherein, (TF-IDF)wIt is characterized the words-frequency feature of word combination w, TFwIndicate feature word combination w
Optionally, the method also includes:
The sample script file is extracted using preset foundation characteristic extraction algorithm for each sample script file
Foundation characteristic, the foundation characteristic include comentropy, longest word length, are overlapped one of index and compression ratio or a variety of;
The words-frequency feature according to feature word combination each in the fisrt feature set, determines the sample script file
Feature vector, comprising:
According to the word frequency of each feature word combination in the foundation characteristic of the sample script file and the fisrt feature set
Feature constitutes the feature vector of the sample script file.
Optionally, the words-frequency feature according to feature word combination each in the fisrt feature set, determines the sample
The feature vector of script file, comprising:
By the words-frequency feature of feature word combination each in the fisrt feature set, the feature of the sample script file is constituted
Vector.
Second aspect provides a kind of method for identifying script file, which comprises
Obtain the first script file;
First script file is converted into machine instruction sequence;
Using preset feature extraction rule, feature phrase is extracted from the machine instruction sequence of first script file
It closes, obtains the fisrt feature set of first script file;
According to preset words-frequency feature algorithm, each feature in the fisrt feature set of first script file is calculated separately
The words-frequency feature of word combination, and according to the word frequency of feature word combination each in the fisrt feature set of first script file spy
Sign, determines the feature vector of first script file;
The feature vector of first script file is input in script identification model, first script file is obtained
Recognition result, the script identification model according to the feature vector and machine learning algorithm of sample script file training obtain,
The feature vector of the sample script file is determined according to the words-frequency feature of the sample script file.
Optionally, described according to preset words-frequency feature algorithm, calculate separately the fisrt feature of first script file
The words-frequency feature for each feature word combination that set includes, comprising:
Each feature word combination that fisrt feature set for first script file includes, determines this feature phrase
Close the frequency of occurrence in first script file, the total number for the feature word combination for including with first script file
First ratio;
In preset corpus, the first number of script file of the determination comprising the specific word combination, and described in determination
Second ratio of the total number for the script file that corpus includes and first number, wherein the corpus includes multiple
The machine instruction sequence of the machine instruction sequence of non-malicious script file and multiple malicious script files;
According to first ratio and second ratio, the words-frequency feature of the specific word combination is calculated.
Optionally, the method also includes:
Using preset foundation characteristic extraction algorithm, the foundation characteristic of first script file is extracted, the basis is special
Sign includes comentropy, longest word length, is overlapped one of index and compression ratio or a variety of;
The words-frequency feature of each feature word combination, determines institute in the fisrt feature set according to first script file
State the feature vector of the first script file, comprising:
According to each spy in the fisrt feature set of the foundation characteristic of first script file and first script file
The words-frequency feature of word combination is levied, the feature vector of first script file is constituted.
Optionally, the word frequency of each feature word combination is special in the fisrt feature set according to first script file
Sign, determines the feature vector of first script file, comprising:
By the words-frequency feature of feature word combination each in the fisrt feature set of first script file, described first is constituted
The feature vector of script file.
The third aspect, provides a kind of device for identifying script file, and described device includes:
Module is obtained, the sample script file having for obtaining multiple labels, the label includes being used to indicate net
Page request is the label of malicious web pages request or is used to indicate the label that web-page requests are non-malicious web-page requests;
Conversion module, for each sample script file to be converted to machine instruction sequence;
First extraction module is used for for each sample script file, using preset feature extraction rule, from the sample
Feature word combination is extracted in the machine instruction sequence of this script file, obtains the fisrt feature set of the sample script file;
First determining module, for according to preset words-frequency feature algorithm, calculating separately described for fisrt feature set
The words-frequency feature of each feature word combination in fisrt feature set, and according to the word of feature word combination each in the fisrt feature set
Frequency feature determines the feature vector of the sample script file;
Training module, for being based on machine learning algorithm, according to the feature vector of each sample script file as every
The label training script identification model of this script file;
Second determining module, for identifying institute using the script identification model when getting script file to be identified
Script file to be identified is stated, determines whether the script file to be identified is malicious script file.
Optionally, first determining module, is specifically used for:
For each feature word combination that the fisrt feature set includes, determine the specific word combination in the sample foot
Frequency of occurrence in this document, the first ratio of the total number for the feature word combination for including with the sample script file;
In preset corpus, the first number of script file of the determination comprising the specific word combination, and described in determination
Second ratio of the total number for the script file that corpus includes and first number, wherein the corpus includes multiple
The machine instruction sequence of the machine instruction sequence of non-malicious script file and multiple malicious script files;
According to first ratio and second ratio, the words-frequency feature of the specific word combination is calculated.
Optionally, first determining module, is specifically used for:
First ratio is determined using following formula:
Wherein, nwIt is characterized frequency of occurrence of the word combination w in sample script file, N is that sample script file includes
The total number of feature word combination;
In preset corpus, the first number of the script file comprising the specific word combination w is determined, and determine corpus
Second ratio of the total number for the script file that library includes and the first number;
The words-frequency feature of the specific word combination w is determined using following formula:
(TF-IDF)w=TFw*IDFw;
Wherein, (TF-IDF)wIt is characterized the words-frequency feature of word combination w, TFwIndicate feature word combination w
Optionally, described device further include:
Second extraction module, using preset foundation characteristic extraction algorithm, is extracted for being directed to each sample script file
The foundation characteristic of the sample script file, the foundation characteristic include comentropy, longest word length, are overlapped index and compression ratio
One of or it is a variety of;
First determining module, is specifically used for:
According to the word frequency of each feature word combination in the foundation characteristic of the sample script file and the fisrt feature set
Feature constitutes the feature vector of the sample script file.
Optionally, first determining module, is specifically used for:
By the words-frequency feature of feature word combination each in the fisrt feature set, the feature of the sample script file is constituted
Vector.
Fourth aspect, provides a kind of device for identifying script file, and described device includes:
Module is obtained, for obtaining the first script file;
Conversion module, for first script file to be converted to machine instruction sequence;
First extraction module, for regular using preset feature extraction, from the machine instruction of first script file
Feature word combination is extracted in sequence, obtains the fisrt feature set of first script file;
Determining module, for according to preset words-frequency feature algorithm, calculate separately first script file first to be special
The words-frequency feature of each feature word combination in collection conjunction, and according to each Feature Words in the fisrt feature set of first script file
Combined words-frequency feature determines the feature vector of first script file;
Input module obtains institute for the feature vector of first script file to be input in script identification model
State the recognition result of the first script file, the script identification model is according to the feature vector and machine learning of sample script file
Algorithm training obtains, and the feature vector of the sample script file is determined according to the words-frequency feature of the sample script file.
Optionally, the determining module, is specifically used for:
Each feature word combination that fisrt feature set for first script file includes, determines this feature phrase
Close the frequency of occurrence in first script file, the total number for the feature word combination for including with first script file
First ratio;
In preset corpus, the first number of script file of the determination comprising the specific word combination, and described in determination
Second ratio of the total number for the script file that corpus includes and first number, wherein the corpus includes multiple
The machine instruction sequence of the machine instruction sequence of non-malicious script file and multiple malicious script files;
According to first ratio and second ratio, the words-frequency feature of the specific word combination is calculated.
Optionally, described device further include:
Second extraction module extracts the base of first script file for utilizing preset foundation characteristic extraction algorithm
Plinth feature, the foundation characteristic include comentropy, longest word length, are overlapped one of index and compression ratio or a variety of;
The determining module, is specifically used for:
According to each spy in the fisrt feature set of the foundation characteristic of first script file and first script file
The words-frequency feature of word combination is levied, the feature vector of first script file is constituted.
Optionally, the determining module, is specifically used for:
By the words-frequency feature of feature word combination each in the fisrt feature set of first script file, described first is constituted
The feature vector of script file.
5th aspect, provides a kind of network equipment, including processor, communication interface, memory and communication bus,
In, processor, communication interface, memory completes mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, is realized described in above-mentioned first aspect or second aspect
Identification script file method and step.
6th aspect, provides a kind of machine readable storage medium, is stored with machine-executable instruction, by processor tune
When with executing, the machine-executable instruction promotes the processor: realizing described in above-mentioned first aspect or second aspect
Identify the method and step of script file.
7th aspect, provides a kind of computer program product comprising instruction, when run on a computer, so that
Computer executes the method that script file is identified described in above-mentioned first aspect or second aspect.
Based on the embodiment of the present application, sample script file is converted into the machine instruction sequence that machine is understood that, is passed through
The words-frequency feature of the machine instruction sequence is analyzed, to generate the feature vector of the sample script file, then passes through machine learning
The feature vector of algorithm and sample script file carrys out training script identification model.Since script file by encryption or hides calling
Afterwards, although its document code can vary widely, have not regulation, the word frequency of its machine instruction sequence converted is special
Sign but still has certain regularity, is based on this, can be according to words-frequency feature to malicious script file and non-malicious script file
It distinguishes.In the present solution, generating feature vector according to the words-frequency feature of machine instruction sequence, then training script identifies mould
Type compensates for feature in this way, the script identification model trained can recognize that the webshell of encryption and hiding call method
The deficiency of library detection, has adaptivity strong, the high advantage of discrimination, also, script text is identified by script identification model
Part gets rid of limitation (do not need technical staff set feature database) of the conventional method dependent on feature database, and computation complexity is low,
Recognition accuracy is higher, meets the engineer application needs of current webshell detection field.Certainly, implement any of the application
It is not absolutely required to reach above all advantages simultaneously for product or method.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of application for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of flow chart of method for identifying script file provided by the embodiments of the present application;
Fig. 2 is a kind of exemplary flow chart of method of training script identification model provided by the embodiments of the present application;
Fig. 3 is a kind of flow chart of method for identifying script file provided by the embodiments of the present application;
Fig. 4 is a kind of exemplary flow chart of method for identifying script file provided by the embodiments of the present application;
Fig. 5 is a kind of structural schematic diagram of device for identifying script file provided by the embodiments of the present application;
Fig. 6 is a kind of structural schematic diagram of device for identifying script file provided by the embodiments of the present application;
Fig. 7 is a kind of structural schematic diagram of device for identifying script file provided by the embodiments of the present application;
Fig. 8 is a kind of structural schematic diagram of device for identifying script file provided by the embodiments of the present application;
Fig. 9 is a kind of structural schematic diagram of the network equipment provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on
Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall in the protection scope of this application.
The embodiment of the present application provides a kind of method for identifying script file, and this method can be applied to the network equipment, should
The network equipment can be with the background server of certain website, or is also possible to the safety equipment of certain website.The network equipment can be used
In identification webshell file, that is, can be used for identifying malicious script file.
As shown in Figure 1, the treatment process of this method can be as follows.
Step 101, multiple sample script files for marking and having are obtained.
Wherein, label includes being used to indicate the label or be used to indicate web-page requests and be that web-page requests are malicious web pages request
The label of non-malicious web-page requests.
In the embodiment of the present application, multiple samples of the available technical staff's input of the network equipment or acquisition equipment acquisition
Script file.Sample script file may include non-malicious script file, malicious script file.Wherein, malicious script file sample
It originally may include the script file of big horse, the script file of pony and script file of a word wooden horse etc..Technical staff can be with
For each sample script file, the label of the sample script file is marked, which may include being used to indicate web-page requests
For malicious web pages request label or be used to indicate the label that web-page requests are non-malicious web-page requests, so as to non-malicious script
File and malicious script file distinguish.
Step 102, each sample script file is converted into machine instruction sequence.
In the embodiment of the present application, it is contemplated that many webshell files will do it encryption and hiding function call, therefore,
Sample script file can be compiled the operation code (i.e. machine instruction sequence) for being converted into machine and being understood that by the network equipment.Network is set
File compiler algorithm can be previously stored in standby, this document compiler algorithm can be calculated using file in the prior art compiling
Method.Due to different types of script file, need to be converted using different file compiler algorithms, therefore, in the network equipment
It can be with the corresponding relationship of storage file type and file compiler algorithm.
For each sample script file, after the network equipment gets sample script file, the sample script can be identified
The file type of file, and then the corresponding relationship of file type according to the pre-stored data and file compiler algorithm, determine sample foot
The corresponding file compiler algorithm of the file type of this document, then by the file compiler algorithm determined, by sample script text
Part is converted to machine instruction sequence.For example, using VLD (Vulcan if sample script file is the file of PHP type
Logic Dumper, for one kind in Zend engine, that is realized in a manner of hook is used to export the intermediate code of PHP script generation
The extension of (execution unit)), the file of PHP type is converted to the machine instruction sequence of Opcode type.For another example, if sample
Script file is the file of ASP type, then using JSPC, (Java Server Pagescompiler, java the server page is compiled
Translate device), the file of ASP type is converted to the machine instruction sequence of Bytecode type.In this way, technical staff can be write
Code translation be the instruction (i.e. machine instruction sequence) that is understood that of machine.
For example, the code in the original document of certain webshell are as follows:
< php
$ new_array=array_map (" ass x65rt ", (array) $ _ REQUEST [' op']);
>
Its Opcode sequence after converting are as follows: SEND_VAL FETCH_R FETCH_DIM_R CAST SEND_VAL DO_
FCALL ASSIGN RETURN。
Step 103, for each sample script file, using preset feature extraction rule, from the sample script file
Machine instruction sequence in extract feature word combination, obtain the fisrt feature set of sample script file.
In the embodiment of the present application, feature extraction rule can be stored in advance in the network equipment.For example, can use
Ngram phrase extraction algorithm obtains Ngram set (i.e. characteristic set).It, can be from machine instruction based on this feature extracting rule
In sequence, the word combination comprising preset number (i.e. N) a continuous word is extracted.
Wherein, the value of N can be any setting, such as 2,4,6 or 8 etc., and the embodiment of the present application is without limitation.Separately
Outside, in the embodiment of the present application, feature extraction can be carried out using one or more feature extraction rules, correspondingly, fisrt feature
The number of set can be one or more.
By taking machine instruction sequence SEND_VAL FETCH_R FETCH_DIM_R as an example, wherein character " _ " and space character
For the separating character between two words, use 2gram extract to obtain feature word combination for SEND VAL, VAL FETCH,
FETCHR, R FETCH, FETCHDIM and DIMR;3gram is used to extract to obtain feature word combination as SENDVAL FETCH, VAL
FETCHR, FETCHR FETCH, R FETCHDIM and FETCHDIMR.
For each sample script file, the network equipment can use preset feature extraction rule, from the sample script
Feature word combination is extracted in the machine instruction sequence of file, then, the network equipment can remove the feature word combination extracted
It handles again, obtains the fisrt feature set of the sample script file.
In the case of feature extraction rule is multiple, sample script file is converted to machine instruction sequence by the network equipment
Afterwards, feature word combination can be extracted from machine instruction sequence, obtains the sample script respectively according to multiple feature extraction rules
Multiple fisrt feature set of file.
Step 104, the fisrt feature set of sample script file is counted respectively according to preset words-frequency feature algorithm
The words-frequency feature of each feature word combination in the fisrt feature set of sample script file is calculated, and according to the first of sample script file
The words-frequency feature of each feature word combination, determines the feature vector of sample script file in characteristic set.
In the embodiment of the present application, words-frequency feature algorithm can be previously stored in the network equipment, the words-frequency feature algorithm
It can be used for calculating the words-frequency feature of a certain word or a certain feature word combination, the embodiment of the present application uses TF-IDF (term
Frequency-inverse document frequency, word frequency-inverse document frequency) words-frequency feature is calculated, specifically
Calculation is subsequent to will do it detailed description.
For each sample script file, after the network equipment gets the fisrt feature set of the sample script file, needle
To each feature word combination in the fisrt feature set of sample script file, the network equipment can be according to preset words-frequency feature
Algorithm calculates the specific word and combines corresponding words-frequency feature.In this way, the network equipment can determine the first of the sample script file
The corresponding words-frequency feature of each feature word combination that characteristic set includes, obtains the words-frequency feature set of the sample script file.
Then, the network equipment according to the words-frequency feature set of the sample script file, can determine the feature of the sample script file to
Amount.There is the case where multiple fisrt feature set for a sample script file, correspondingly, the network equipment can be according to multiple first
Characteristic set determines multiple words-frequency feature set, then determines the sample script file according to multiple words-frequency feature set
Feature vector, specific treatment process is subsequent to will do it detailed description.
In this way, extracting multiple characteristic sets using various features extracting rule, the feature vector of script file is formed, it can be with
The robustness of enhancing identification script file, improves the accuracy of identification.
Optionally, the concrete processing procedure for calculating words-frequency feature using TF-IDF is as follows.
Step 1: determining this feature for each feature word combination that the fisrt feature set of sample script file includes
Frequency of occurrence of the word combination in sample script file, the first of the total number for the feature word combination for including with sample script file
Ratio.
In the embodiment of the present application, for each sample script file, the network equipment gets the sample script file
After the fisrt feature set of sample script file, for each feature phrase in the fisrt feature set of the sample script file
It closes, the network equipment can count the specific word and combine the frequency of occurrence in the sample script file.In addition, the network equipment may be used also
To count the total number for the feature word combination that the sample script file includes, and then the frequency of occurrence combined with the specific word, remove
With the total number for the feature word combination that the sample script file includes, the first ratio is obtained.Specific calculation formula is as follows:
Wherein, nwIt is characterized frequency of occurrence of the word combination w in the sample script file, N is the sample script file packet
The total number of the feature word combination contained.
By taking machine instruction sequence SEND_VAL FETCH_R FETCH_DIM_R as an example, extract to obtain feature using 2gram
Phrase is combined into SENDVAL, VAL FETCH, FETCHR, R FETCH, FETCHDIM and DIMR, and Feature Words combination S ENDVAL's goes out
Occurrence number is 1, and the total number for the feature word combination which includes is 6, then the first ratio is 1/6.
Step 2 determines the first number of the script file comprising the specific word combination, and really in preset corpus
Determine the total number for the script file that corpus includes and the second ratio of the first number.
Wherein, corpus includes the machine instruction sequence of multiple non-malicious script files and the machine of multiple malicious script files
Device instruction sequence.
In the embodiment of the present application, the machine instruction of the machine instruction sequence of non-malicious script file and malicious script file
Sequence can have some apparent differences, and therefore, the embodiment of the present application is obtaining the positive and negative example sample machine instruction sequence of magnanimity
On the basis of construction feature engineering.Corpus can be previously provided in the network equipment, corpus may include multiple non-malicious scripts
The machine instruction sequence of the machine instruction sequence of file and multiple malicious script files.Consider that malicious script file will not generally lead to
Crossing a machine instruction sequence can identify, therefore, comprehensively considered in this application big horse (it is larger to generally refer to code amount,
Multiple functional, possess the multi-functional rocking horses of interaction page) script file, pony (generally refer to that code amount is moderate, has a single function simultaneously
Small-sized wooden horse equipped with simple interaction page, the executable single operation such as upper transmitting file, data are packaged, drag library, reaches special
Determine purpose) script file, a word wooden horse (is often referred to the wooden horse that a line or a few line codes are constituted, cooperates other tools can
To realize the control completed to destination host) script file etc..That is, the machine instruction sequence of the malicious script file of corpus
Column, the machine instruction sequence of the script file including big horse, the machine instruction sequence of the script file of pony, a word wooden horse
The machine instruction sequence etc. of script file.
For each feature word combination in the fisrt feature set of sample script file, the network equipment can be according to the spy
Sign word combination carries out matched and searched in corpus, so that it is determined that corresponding machine instruction sequence includes the foot of the specific word combination
This document, and then the number (i.e. the first number) of the script file comprising the specific word combination is counted, in addition, the network equipment may be used also
Include the total number of script file with the current corpus of real-time statistics, then calculates the ratio of the total number and the first number (i.e.
Second ratio).
Step 3 calculates the words-frequency feature of the specific word combination according to the first ratio and the second ratio.
It in the embodiment of the present application, can be according to the first ratio after the network equipment calculates the first ratio and the second ratio
With the second ratio, calculates the specific word and combine corresponding words-frequency feature.Specific calculation formula can be such that
(TF-IDF)w=TFw*IDFw
Wherein, (TF-IDF)wIt is characterized the words-frequency feature of word combination w, TFwIndicate feature word combination w in sample script file
In probability of occurrence;N1For the total number for the script file that corpus includes, N2To include the spy
Levy the first number of the script file of word combination w.
IDFwThe file separating capacity for indicating feature word combination w, if the script file comprising feature word combination w is fewer,
Then IDFwValue it is bigger., whereas if the script file comprising feature word combination w is more, then the value of IDF w is smaller.
In a certain script file (such as sample script file), if a certain feature word combination is in the foot in the script file
Probability of occurrence in this document is higher, and (for example probability of occurrence is more than preset threshold, or is the highest feature phrase of probability of occurrence
Close), also, probability of occurrence of the specific word combination in entire corpus is lower (for example probability of occurrence is lower than preset threshold),
It is then combined for the specific word, biggish TF-IDF can be calculated.Therefore, TF-IDF is commonly used in filtering out common word,
Retain important word.It should be noted that above-mentioned IDFwIn formula, denominator adds 1 for optional way, in order to avoid point
Mother is 0, can also add other positive numbers, such as plus 2,1/3 certainly.
In the embodiment of the present application, each feature that the network equipment can include according to the fisrt feature set of sample script file
The corresponding words-frequency feature of word combination determines the corresponding feature vector of sample script file.Concrete implementation mode can be a variety of
Multiplicity, the embodiment of the present application provides two kinds of feasible implementations, specific as follows.
The word frequency for each feature word combination that the fisrt feature set of sample script file is included by mode one, the network equipment is special
Sign constitutes the corresponding feature vector of sample script file.
In the embodiment of the present application, each feature that the fisrt feature set of sample script file can be included by the network equipment
The words-frequency feature of word combination constitutes the corresponding feature vector of sample script file.For example, the fisrt feature collection of sample script file
Closing includes feature word combination 1, feature word combination 2, feature word combination 3, and corresponding words-frequency feature is (TF-IDF)1、(TF-IDF)2、
(TF-IDF)3, then the corresponding feature vector of sample script file is ((TF-IDF)1, (TF-IDF)2, (TF-IDF)3).For sample
The fisrt feature collection of this script file is combined into multiple situations, correspondingly, the network equipment can obtain multiple words-frequency feature set,
The network equipment can be by each words-frequency feature set, as a dimension of feature vector, to obtain feature vector.For example,
Characteristic set 1 is extracted based on 2gram, corresponding words-frequency feature is (TF-IDF)11、(TF-IDF)12、(TF-IDF)13, it is based on
4gram extracts characteristic set 2, and corresponding words-frequency feature is (TF-IDF)21、(TF-IDF)22、(TF-IDF)23, then feature to
Measure A=(a1,a2), wherein a1=((TF-IDF)11、(TF-IDF)12、(TF-IDF)13), a2=((TF-IDF)21、(TF-
IDF)22、(TF-IDF)23)。
Mode two, the network equipment can also calculate the foundation characteristic of sample script file, special according to words-frequency feature and basis
Sign, determines the corresponding feature vector of sample script file.Specific processing mode are as follows: be directed to each sample script file, utilize
Preset foundation characteristic extraction algorithm, extracts the foundation characteristic of the sample script file, special according to the basis of sample script file
The words-frequency feature of each feature word combination in the fisrt feature set of sample script file of seeking peace constitutes the feature of sample script file
Vector.
Wherein, foundation characteristic includes comentropy, longest word length, is overlapped one of index and compression ratio or a variety of.
In the embodiment of the present application, foundation characteristic extraction algorithm can also be stored in advance in the network equipment, foundation characteristic can
To include comentropy, longest word length, be overlapped one of index and compression ratio or a variety of, in addition, foundation characteristic can be with
Including other features in the prior art, the embodiment of the present application is without limitation.The network equipment can be according to sample script file
Code (i.e. source code) calculates comentropy, longest word length, is overlapped the foundation characteristics such as index and compression ratio, then, by sample
The corresponding word frequency of each feature word combination that the foundation characteristic of this script file and the fisrt feature set of sample script file include
Feature, constitutes the corresponding feature vector of sample script file, and this feature vector is multi-C vector.For example, long for longest word
Degree can be traversed comprising all words in sample script file, and then determine that sample script file includes that number of characters is most
Word, i.e. longest word, the character total number that longest word includes, i.e. longest word length.It for another example, can be with for compression ratio
Compression processing is carried out to sample script file, determines the compressed file size of sample script file, then with sample script text
The compressed file size of part obtains compression ratio divided by the original size of sample script file.Above- mentioned information entropy, longest word are long
The calculating process of degree, coincidence index and compression ratio belongs to the prior art, and the embodiment of the present application repeats no more.
Step 105, it is based on machine learning algorithm, according to the feature vector of each sample script file and each sample script
The label training script identification model of file.
In the embodiment of the present application, the network equipment can be based on machine learning algorithm, according to each sample script file
The label training script identification model of feature vector and each sample script file.For example, decision tree can be promoted using gradient
(English: GradientBoostingDecisionTree, referred to as: GBDT) algorithm, algorithm of support vector machine, random forests algorithm
Or the training scripts identification model such as logistic regression algorithm, the embodiment of the present application is without limitation.
Step 106, when getting script file to be identified, script file to be identified is identified using script identification model,
Determine whether script file to be identified is malicious script file.
In the embodiment of the present application, when the network equipment gets script file to be identified, the network equipment can extract this
Then the feature vector of script file to be identified is input to script identification model by the feature vector of script file to be identified
In, it whether is malicious script file with determination script file to be identified.Specifically identification process is subsequent will do it detailed description.
In the embodiment of the present application, training process (such as step 101-105) and identification process (such as step 106) can be
It executes, can also be executed on same electronic equipment on different electronic equipments.
The embodiment of the present application also provides a kind of exemplary flow charts of the training method of identification model, as shown in Figure 2.This shows
In example, feature extraction rule uses 2gram, 3gram and 4gram, in this way, being directed to any sample script file, the network equipment can
To calculate separately out the corresponding words-frequency feature set of 2gram, the corresponding words-frequency feature set of 3gram and the corresponding word frequency of 4gram
Characteristic set.Also, the network equipment can calculate the foundation characteristic of the sample script file, to obtain the script file sample
Corresponding feature vector, this feature vector include the foundation characteristic of the script file sample and the word of the script file sample
Frequency characteristic set, and then according to this feature vector training script identification model.Specific treatment process and above-mentioned steps 101~step
Rapid 105 process is similar, and details are not described herein again.
The embodiment of the present application also provides the training method example of another script identification model, the example is to be based on C4.5
It is illustrated for decision Tree algorithms training script identification model.C4.5 decision Tree algorithms are a kind of used in machine learning and data
The algorithm of classification problem in excavation.That is, giving a data set, each of these sample can use one group of feature (i.e. one
A feature vector) it describes, each sample belongs to a certain classification in the classification of a mutual exclusion.The target of C4.5 decision Tree algorithms
It is to be learnt by training, finds the mapping relations from feature vector to classification, it is subsequent, it is based on this mapping relations, it can
The object to be identified unknown to classification is classified.
Each internal node (i.e. non-leaf nodes) indicates the test in a feature, each branch in C4.5 decision tree
A test output is represented, and each leaf node stores a class label (i.e. label).Once decision tree is established, for one
The sample of a not given class label, can track a path by root node to leaf node, the class stored in the leaf node
Label is the class label (i.e. label) of the prediction classification of the sample.
Assuming that the collection of sample script file is combined into D, the collection for the feature vector determined according to sample script file is combined into A,
Decision tree T after training can be exported, specific training process are as follows.
Step 1: sampling feature vectors file is obtained.
Wherein, the set D of above-mentioned sample script file includes multiple sample script files for marking and having, the label packet
Include the label or be used to indicate the mark that web-page requests are non-malicious web-page requests for being used to indicate that web-page requests are malicious web pages request
Label.For each sample script file, after the feature vector for determining the sample script file, with the sample script file
Label marks the feature vector of the sample script file, obtains the feature vector for being marked with label.In this way, available sample is special
Vector file is levied, which includes multiple feature vectors for marking and having.
Step 2: sample data is normalized.
Step 3: decision tree is established
Wherein, establishing decision tree specifically can be as follows.
(1) it if set A is sky, generates the tree node that an information number is all 0 and returns.
(2) if the feature vector in set A is same category Ck, then it generates a leaf node and returns, the leaf
The class of node is marked as Ck。
(3) in addition to above-mentioned (1) and (2) the case where, calculates separately every then for every kind of feature in feature vector
The ratio of profit increase of kind feature.The calculation formula of ratio of profit increase can be such that
Wherein, IGR (English: Information gain rate) is information gain-ratio, IG (English: Information
It gain is) information gain of this feature, IV (English: Information Value) is the division information of this feature.
(4) the maximum fisrt feature of ratio of profit increase and the corresponding mode decision scheme of the fisrt feature are determined, by the fisrt feature
It is added in decision tree.
Wherein, mode decision scheme includes decision threshold and the corresponding outgoing route of each court verdict.
(5) it for other features in addition to fisrt feature, repeats step (4), recurrence builds related subtree.
(6) decision tree T is exported.
Ratio of profit increase based on the decision tree that above-mentioned process constructs, feature is bigger, then illustrates that the distinction of this feature is higher,
The corresponding child node of this feature is closer from the root node of decision tree;Conversely, the ratio of profit increase of feature is lower, then illustrate the area of this feature
Divide property lower, the corresponding child node of this feature is remoter from the root node of decision tree.
Optionally, after constructing decision tree, preset number sample script file can be input in the decision tree
Test judgement situation, and beta pruning processing can be carried out to the decision tree constructed using pessimistic error rate estimating algorithm, to improve
The judgement accuracy of decision tree.Then, output beta pruning treated decision tree.
As shown in figure 3, the embodiment of the present application also provides it is a kind of identify script file method flow diagram, specifically include with
Lower step.
Step 301, the first script file is obtained.
In the embodiment of the present application, available first script file to be identified of the network equipment.For example, when needing to certain
When script file (i.e. the first script file) carries out legitimacy identification, which can be input to net by technical staff
In network equipment, the network equipment can then receive the first script file of technical staff's input, and for another example, the network equipment also can receive
The first script file sent to other network equipments.The network equipment can also periodically carry out script file identification automatically,
Available the first script file being currently locally stored of the network equipment, alternatively, can from destination network device to be detected,
Obtain the first script file
Step 302, the first script file is converted into machine instruction sequence.
The concrete processing procedure of this step is referred to illustrating for step 102, and details are not described herein again.
Step 303, using preset feature extraction rule, feature is extracted from the machine instruction sequence of the first script file
Word combination obtains the fisrt feature set of the first script file.
The concrete processing procedure of this step is referred to illustrating for step 103, and details are not described herein again.
Step 304, it according to preset words-frequency feature algorithm, calculates separately each in the fisrt feature set of the first script file
The words-frequency feature of feature word combination, and according to the word frequency of feature word combination each in the fisrt feature set of the first script file spy
Sign, determines the feature vector of the first script file.
The concrete processing procedure of this step is referred to illustrating for step 104, and details are not described herein again.
In this way, extracting multiple characteristic sets using various features extracting rule, the feature vector of script file is formed, it can be with
The robustness of enhancing identification script file, improves the accuracy of identification.
Optionally, the concrete processing procedure for calculating words-frequency feature using TF-IDF is as follows.
Step 1, each feature word combination that the fisrt feature set for the first script file includes, determines this feature
Frequency of occurrence of the word combination in the first script file, the first of the total number for the feature word combination for including with the first script file
Ratio.
In the embodiment of the present application, for each feature word combination in the fisrt feature set of the first script file, net
Network equipment can count the frequency of occurrence of the specific word combination after carrying out feature word combination to the first script file, thus
The frequency of occurrence in the first script file is combined to the specific word.In addition, the network equipment can also count the feature extracted
The total number (i.e. the total number for the feature word combination that the first script file includes) of word combination, and then combined and corresponded to the specific word
Frequency of occurrence obtain the first ratio divided by the total number for the feature word combination that the first script file includes.It is specific to calculate public affairs
Formula is as follows:
Wherein, nwIt is characterized frequency of occurrence of the word combination w in the first script file, N is that the first script file includes
The total number of feature word combination.
The concrete processing procedure of this step is referred to illustrating for step 104, and details are not described herein again.
Step 2 determines the first number of the script file comprising the specific word combination, and really in preset corpus
Determine the total number for the script file that corpus includes and the second ratio of the first number.
The concrete processing procedure of this step is referred to illustrating for step 104, and details are not described herein again.
Step 3 calculates the words-frequency feature of the specific word combination according to the first ratio and the second ratio.
It, in the embodiment of the present application, can after the network equipment calculates the first ratio and the second ratio in the embodiment of the present application
To calculate the specific word and combine corresponding words-frequency feature according to the first ratio and the second ratio.Specific calculation formula can be as
Under:
(TF-IDF)w=TFw*IDFw
Wherein, (TF-IDF)wIt is characterized the words-frequency feature of word combination w, TFwIndicate feature word combination w in the first script file
In probability of occurrence;N1For the total number for the script file that corpus includes, N2To include the spy
Levy the first number of the script file of word combination w.
IDFwThe file separating capacity for indicating feature word combination w, if the script file comprising feature word combination w is fewer,
Then IDFwValue it is bigger., whereas if the script file comprising feature word combination w is more, then the value of IDF w is smaller.
In a certain specific file (such as first script file), if a certain feature word combination is in the spy in the specific file
Determining probability of occurrence in file, higher (for example probability of occurrence is more than preset threshold, or is the highest feature phrase of probability of occurrence
Close), also, probability of occurrence of the specific word combination in entire corpus is lower (for example probability of occurrence is lower than preset threshold),
It is then combined for the specific word, biggish TF-IDF can be calculated.Therefore, TF-IDF tends to filter out common word, protects
Stay important word.It should be noted that above-mentioned IDFwIn formula, denominator adds 1 to be optional way, in order to avoid denominator
It is 0, other positive numbers, such as plus 2,1/3 can also be added certainly.
In the embodiment of the present application, each feature that the network equipment can include according to the fisrt feature set of the first script file
The corresponding words-frequency feature of word combination determines that the corresponding feature vector of the first script file, concrete implementation mode can be a variety of
Multiplicity, the embodiment of the present application provides two kinds of feasible implementations, specific as follows.
Mode one, the network equipment by the words-frequency feature of feature word combination each in the fisrt feature set of the first script file,
Constitute the feature vector of the first script file.
The concrete processing procedure of this step is referred to illustrating for step 104, and details are not described herein again.
Mode two, the network equipment can also calculate the foundation characteristic of the first script file, special according to words-frequency feature and basis
Sign, determines the corresponding feature vector of the first script file, specific processing mode are as follows: is extracted and is calculated using preset foundation characteristic
Method extracts the foundation characteristic of the first script file;According to the first of the foundation characteristic of the first script file and the first script file
The words-frequency feature of each feature word combination in characteristic set constitutes the feature vector of the first script file.
Wherein, foundation characteristic includes comentropy, longest word length, is overlapped one of index and compression ratio or a variety of.
The concrete processing procedure of this step is referred to illustrating for step 104, and details are not described herein again.
Step 305, the feature vector of the first script file is input in script identification model, obtains the first script file
Recognition result.
Wherein, which can be the script identification mould that 101~step 105 through the above steps trains
Type.
In the embodiment of the present application, the corresponding feature vector of the first script file can be input to preset by the network equipment
In script identification model, which can then export the corresponding recognition result of the first script file, which can
Think non-malicious script file or malicious script file.
It is directed to previous webshell detection method, either traditional static matching method is still based on machine learning
The discrimination model of algorithm is substantially also to rely on feature database to differentiate, feature has relied on the collection of manpower and enriches, when
Can not often it realize when webshell file is using encryption or hiding means, and in the embodiment of the present application, script file is compiled
It translates and is converted into machine instruction sequence, then, analyze the rule (i.e. words-frequency feature) of the instruction sequence of positive and negative example sample, and analyze foot
The composition of the instruction sequence of this document closes the comentropy of original script, longest word length, is overlapped the features groups such as index, compression ratio
At the feature of each script file, comprehensively consider two kinds of features, to form the feature vector of script file to be identified, without according to
Rely in the feature database manually established, also, recognition accuracy with higher.Moreover, the application passes through machine learning training
Script identification model detects whether script file is webshell, and this method has computation complexity low, adaptivity is good.And
And since script file is after encrypting or hiding and call, although its document code can occur large change, not have rule
Property, but the words-frequency feature of the machine instruction sequence of its conversion but still has certain regularity, is based on this, it can be according to word frequency
Feature distinguishes malicious script file and non-malicious script file.In the present solution, special according to the word frequency of machine instruction sequence
Sign generates feature vector, then training script identification model, in this way, the script identification model trained can recognize that encryption and
The features such as webshel script that hidden method calls, real-time is high, meet the engineer application of current webshell detection field
It needs.
The embodiment of the present application also provides a kind of method examples for identifying script file, as shown in figure 4, concrete processing procedure
It can be as follows.
Step 401, the first script file is obtained.
The concrete processing procedure of this step is referred to illustrating for step 101, and details are not described herein again.
Step 402, using preset foundation characteristic extraction algorithm, the foundation characteristic of the first script file is extracted.
Wherein, foundation characteristic includes comentropy, longest word length, is overlapped index and compression ratio etc..
The concrete processing procedure of this step is referred to illustrating for step 104, and details are not described herein again.
Step 403, the first script file is converted into machine instruction sequence.
The concrete processing procedure of this step is referred to illustrating for step 102, and details are not described herein again.
Step 404, it using preset N-gram phrase extraction algorithm, is mentioned from the machine instruction sequence of the first script file
Feature word combination is taken, the fisrt feature set of the first script file is obtained.
The concrete processing procedure of this step is referred to illustrating for step 103, and details are not described herein again.
Step 405, the TF-IDF feature of each feature word combination in the fisrt feature set of the first script file is calculated (i.e.
Words-frequency feature).
The concrete processing procedure of this step is referred to illustrating for step 104, and details are not described herein again.
Step 406, according to words-frequency feature and foundation characteristic, the feature vector of the first script file is constituted.
The concrete processing procedure of this step is referred to illustrating for step 104, and details are not described herein again.
Step 407, the feature vector of the first script file is input in script identification model, obtains the first script file
Recognition result.
The concrete processing procedure of this step is referred to illustrating for step 106 and step 301~304, herein not
It repeats again.
In the embodiment of the present application, the network equipment obtains the first script file, and the first script file is then converted to machine
Instruction sequence extracts feature word combination from machine instruction sequence, obtains first script using preset feature extraction rule
The fisrt feature set of file, and then according to preset words-frequency feature algorithm, calculate separately the fisrt feature of the first script file
The corresponding words-frequency feature of each feature word combination that set includes, and according to the fisrt feature set of the first script file include it is each
The corresponding words-frequency feature of feature word combination, determines the feature vector of the first script file.The network equipment is by the first script file pair
The feature vector answered is input in script identification model, obtains the corresponding recognition result of the first script file.
Based on the embodiment of the present application, it is directed to previous webshell detection method, either traditional static matching side
Method is still substantially also to rely on feature database to differentiate based on the discrimination model of machine learning algorithm, and feature has relied on people
The collection of power and abundant, can not often realize when webshell file is using encryption or hiding means, and the application is implemented
In example, machine instruction sequence is converted by script file compiling, then, analyzes the rule of the instruction sequence of positive and negative example sample (i.e.
Words-frequency feature), and the composition for analyzing the instruction sequence of script file closes the comentropy of original script, longest word length, is overlapped
The features such as index, compression ratio form the feature of each script file, comprehensively consider two kinds of features, to form script text to be identified
The feature vector of part needs not rely upon the feature database manually established, also, recognition accuracy with higher.Moreover, the application
Detect whether script file is webshell by the script identification model of machine learning training, this method, which has, calculates complexity
Spend low, adaptivity is good.Also, due to script file through encryption or hide call after, although its document code can occur compared with
Change greatly, have not regulation, but the words-frequency feature of the machine instruction sequence of its conversion but still has certain regularity, base
In this, malicious script file and non-malicious script file can be distinguished according to words-frequency feature.In the present solution, according to machine
The words-frequency feature of instruction sequence generates feature vector, then training script identification model, in this way, the script identification model trained
It can recognize that the webshel script that encryption and hidden method call, the features such as real-time is high, meet current webshell inspection
The engineer application in survey field needs.
Based on the same technical idea, as shown in figure 5, the embodiment of the present application also provides a kind of dresses for identifying script file
It sets, which includes:
Module 510 is obtained, the sample script file having for obtaining multiple labels, label includes being used to indicate webpage
Request is the label of malicious web pages request or is used to indicate the label that web-page requests are non-malicious web-page requests;
Conversion module 520, for each sample script file to be converted to machine instruction sequence;
First extraction module 530 is used for for each sample script file, using preset feature extraction rule, from sample
Feature word combination is extracted in the machine instruction sequence of this script file, obtains the fisrt feature set of sample script file;
First determining module 540, for being calculated separately for fisrt feature set according to preset words-frequency feature algorithm
The words-frequency feature of each feature word combination in fisrt feature set, and according to the word frequency of feature word combination each in fisrt feature set spy
Sign, determines the feature vector of sample script file;
Training module 550, for being based on machine learning algorithm, according to the feature vector of each sample script file and each
The label training script identification model of sample script file;
Second determining module 560, for being identified wait know using script identification model when getting script file to be identified
Other script file determines whether script file to be identified is malicious script file.
Optionally, the first determining module 540, is specifically used for:
For each feature word combination that fisrt feature set includes, determine the specific word combination in sample script file
Frequency of occurrence, the first ratio of the total number for the feature word combination for including with sample script file;
In preset corpus, the first number of the script file comprising the specific word combination is determined, and determine corpus
Second ratio of the total number for the script file that library includes and the first number, wherein corpus includes multiple non-malicious script texts
The machine instruction sequence of the machine instruction sequence of part and multiple malicious script files;
According to the first ratio and the second ratio, the words-frequency feature of the specific word combination is calculated.
Optionally, the first determining module 540, is specifically used for:
The first ratio is determined using following formula:
Wherein, nwIt is characterized frequency of occurrence of the word combination w in sample script file, N is that sample script file includes
The total number of feature word combination;
In preset corpus, the first number of the script file comprising the specific word combination w is determined, and determine corpus
Second ratio of the total number for the script file that library includes and the first number;
The words-frequency feature of the specific word combination w is determined using following formula:
(TF-IDF)w=TFw*IDFw;
Wherein, (TF-IDF)wIt is characterized the words-frequency feature of word combination w, TFwIndicate feature word combination w
Optionally, as shown in fig. 6, the device further include:
Second extraction module 570, for being directed to each sample script file, using preset foundation characteristic extraction algorithm,
The foundation characteristic of the sample script file is extracted, foundation characteristic includes comentropy, longest word length, is overlapped index and compression ratio
One of or it is a variety of;
First determining module 540, is specifically used for:
According to the words-frequency feature of feature word combination each in the foundation characteristic of sample script file and fisrt feature set, constitute
The feature vector of sample script file.
Optionally, the first determining module 540, is specifically used for:
By the words-frequency feature of feature word combination each in fisrt feature set, the feature vector of sample script file is constituted.
Based on the same technical idea, as shown in fig. 7, the embodiment of the present application also provides a kind of dresses for identifying script file
It sets, which includes:
Module 710 is obtained, for obtaining the first script file;
Conversion module 720, for the first script file to be converted to machine instruction sequence;
First extraction module 730, for referring to from the machine of first script file using preset feature extraction rule
Extraction feature word combination in sequence is enabled, the fisrt feature set of first script file is obtained;
Determining module 740, for calculating separately the first of first script file according to preset words-frequency feature algorithm
The words-frequency feature of each feature word combination in characteristic set, and according to each feature in the fisrt feature set of first script file
The words-frequency feature of word combination determines the feature vector of first script file;
Input module 750 is obtained for the feature vector of first script file to be input in script identification model
The recognition result of first script file, the script identification model is according to the feature vector and engineering of sample script file
It practises algorithm training to obtain, the feature vector of the sample script file is determined according to the words-frequency feature of the sample script file.
Optionally, determining module 740 are specifically used for:
Each feature word combination that fisrt feature set for the first script file includes determines that the specific word combination exists
Frequency of occurrence in first script file, the first ratio of the total number for the feature word combination for including with the first script file;
In preset corpus, the first number of script file of the determination comprising the specific word combination, and described in determination
Second ratio of the total number for the script file that corpus includes and first number, wherein the corpus includes multiple
The machine instruction sequence of the machine instruction sequence of non-malicious script file and multiple malicious script files;
According to first ratio and second ratio, the words-frequency feature of the specific word combination is calculated.
Optionally, as shown in figure 8, the device further include:
Second extraction module 760 extracts first script file for utilizing preset foundation characteristic extraction algorithm
Foundation characteristic, the foundation characteristic include comentropy, longest word length, are overlapped one of index and compression ratio or a variety of;
Determining module 740, is specifically used for:
According to each spy in the fisrt feature set of the foundation characteristic of first script file and first script file
The words-frequency feature of word combination is levied, the feature vector of first script file is constituted.
Optionally, determining module 740 are specifically used for:
By the words-frequency feature of feature word combination each in the fisrt feature set of first script file, described first is constituted
The feature vector of script file.
Based on the embodiment of the present application, sample script file is converted into the machine instruction sequence that machine is understood that, is passed through
The words-frequency feature of the machine instruction sequence is analyzed, to generate the feature vector of the sample script file, then passes through machine learning
The feature vector of algorithm and sample script file carrys out training script identification model.Since script file by encryption or hides calling
Afterwards, although its document code can vary widely, have not regulation, the word frequency of its machine instruction sequence converted is special
Sign but still has certain regularity, is based on this, can be according to words-frequency feature to malicious script file and non-malicious script file
It distinguishes.In the present solution, generating feature vector according to the words-frequency feature of machine instruction sequence, then training script identifies mould
Type compensates for feature in this way, the script identification model trained can recognize that the webshell of encryption and hiding call method
The deficiency of library detection, has adaptivity strong, the high advantage of discrimination, also, script text is identified by script identification model
Part gets rid of limitation (do not need technical staff set feature database) of the conventional method dependent on feature database, and computation complexity is low,
Recognition accuracy is higher, meets the engineer application needs of current webshell detection field.Certainly, implement any of the application
It is not absolutely required to reach above all advantages simultaneously for product or method.
The embodiment of the present application also provides a kind of network equipments, as shown in figure 9, include processor 901, communication interface 902,
Memory 903 and communication bus 904, wherein processor 901, communication interface 902, memory 903 are complete by communication bus 904
At mutual communication,
Memory 903, for storing computer program;
Processor 901, when for executing the program stored on memory 903, so that the network equipment executes above-mentioned knowledge
The step of method of other script file.
The communication bus that the above-mentioned network equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component
Interconnect, PCI) bus or expanding the industrial standard structure (Extended Industry Standard
Architecture, EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc..For just
It is only indicated with a thick line in expression, figure, it is not intended that an only bus or a type of bus.
Communication interface is for the communication between the above-mentioned network equipment and other equipment.
Memory may include random access memory (Random Access Memory, RAM), also may include non-easy
The property lost memory (Non-Volatile Memory, NVM), for example, at least a magnetic disk storage.Optionally, memory may be used also
To be storage device that at least one is located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit,
CPU), network processing unit (Network Processor, NP) etc.;It can also be digital signal processor (Digital Signal
Processing, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing
It is field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete
Door or transistor logic, discrete hardware components.
In another embodiment provided by the present application, a kind of computer readable storage medium is additionally provided, which can
It reads to be stored with computer program in storage medium, the computer program realizes any of the above-described knowledge when being executed by processor
The method of other script file.
In another embodiment provided by the present application, a kind of computer program product comprising instruction is additionally provided, when it
When running on computers, so that the step of computer executes the method for any of the above-described identification script file.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including the element.
Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality
For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method
Part explanation.
The foregoing is merely the preferred embodiments of the application, are not intended to limit the protection scope of the application.It is all
Any modification, equivalent replacement, improvement and so within spirit herein and principle are all contained in the protection scope of the application
It is interior.
Claims (16)
1. a kind of method for identifying script file, which is characterized in that the described method includes:
Multiple sample script files for marking and having are obtained, the label includes being used to indicate web-page requests to ask for malicious web pages
The label asked or the label for being used to indicate web-page requests as non-malicious web-page requests;
Each sample script file is converted into machine instruction sequence;
For each sample script file, using preset feature extraction rule, from the machine instruction of the sample script file
Feature word combination is extracted in sequence, obtains the fisrt feature set of the sample script file;
Each feature in the fisrt feature set is calculated separately according to preset words-frequency feature algorithm for fisrt feature set
The words-frequency feature of word combination, and according to the words-frequency feature of feature word combination each in the fisrt feature set, determine the sample
The feature vector of script file;
Based on machine learning algorithm, instructed according to the label of the feature vector of each sample script file and each sample script file
Practice script identification model;
When getting script file to be identified, the script file to be identified is identified using the script identification model, is determined
Whether the script file to be identified is malicious script file.
2. calculating separately the method according to claim 1, wherein described according to preset words-frequency feature algorithm
The words-frequency feature of each feature word combination in the fisrt feature set, comprising:
For each feature word combination that the fisrt feature set includes, determine the specific word combination in the sample script text
Frequency of occurrence in part, the first ratio of the total number for the feature word combination for including with the sample script file;
In preset corpus, the first number of the script file comprising the specific word combination is determined, and determine the corpus
Second ratio of the total number for the script file that library includes and first number, wherein the corpus includes multiple non-evils
The machine instruction sequence of script file of anticipating and the machine instruction sequence of multiple malicious script files;
According to first ratio and second ratio, the words-frequency feature of the specific word combination is calculated.
3. according to the method described in claim 2, it is characterized in that, described according to first ratio and second ratio,
Calculate the words-frequency feature of the specific word combination, comprising:
First ratio is determined using following formula:
Wherein, nwIt is characterized frequency of occurrence of the word combination w in sample script file, N is the Feature Words that sample script file includes
Combined total number;
In preset corpus, the first number of the script file comprising the specific word combination w is determined, and determine corpus packet
Second ratio of the total number of the script file contained and the first number;
The words-frequency feature of the specific word combination w is determined using following formula:
(TF-IDF)w=TFw*IDFw;
Wherein, (TF-IDF)wIt is characterized the words-frequency feature of word combination w, TFwIndicate feature word combination w in sample script file
Probability of occurrence;N1For the total number for the script file that corpus includes, N2To include the specific word
Combine the first number of the script file of w.
4. the method according to claim 1, wherein the method also includes:
The basis of the sample script file is extracted using preset foundation characteristic extraction algorithm for each sample script file
Feature, the foundation characteristic include comentropy, longest word length, are overlapped one of index and compression ratio or a variety of;
The words-frequency feature according to feature word combination each in the fisrt feature set, determines the spy of the sample script file
Levy vector, comprising:
According to the words-frequency feature of each feature word combination in the foundation characteristic of the sample script file and the fisrt feature set,
Constitute the feature vector of the sample script file.
5. the method according to claim 1, wherein described according to feature phrase each in the fisrt feature set
The words-frequency feature of conjunction determines the feature vector of the sample script file, comprising:
By the words-frequency feature of feature word combination each in the fisrt feature set, constitute the feature of the sample script file to
Amount.
6. a kind of method for identifying script file, which is characterized in that the described method includes:
Obtain the first script file;
First script file is converted into machine instruction sequence;
Using preset feature extraction rule, feature word combination is extracted from the machine instruction sequence of first script file,
Obtain the fisrt feature set of first script file;
According to preset words-frequency feature algorithm, each feature phrase in the fisrt feature set of first script file is calculated separately
The words-frequency feature of conjunction, and according to the words-frequency feature of feature word combination each in the fisrt feature set of first script file, really
The feature vector of fixed first script file;
The feature vector of first script file is input in script identification model, the knowledge of first script file is obtained
It is described not as a result, the script identification model is obtained according to the training of the feature vector and machine learning algorithm of sample script file
The feature vector of sample script file is determined according to the words-frequency feature of the sample script file.
7. according to the method described in claim 6, calculating separately it is characterized in that, described according to preset words-frequency feature algorithm
The words-frequency feature for each feature word combination that the fisrt feature set of first script file includes, comprising:
Each feature word combination that fisrt feature set for first script file includes determines that the specific word combination exists
Frequency of occurrence in first script file, the first of the total number for the feature word combination for including with first script file
Ratio;
In preset corpus, the first number of the script file comprising the specific word combination is determined, and determine the corpus
Second ratio of the total number for the script file that library includes and first number, wherein the corpus includes multiple non-evils
The machine instruction sequence of script file of anticipating and the machine instruction sequence of multiple malicious script files;
According to first ratio and second ratio, the words-frequency feature of the specific word combination is calculated.
8. according to the method described in claim 6, it is characterized in that, the method also includes:
Using preset foundation characteristic extraction algorithm, the foundation characteristic of first script file, the foundation characteristic packet are extracted
It includes comentropy, longest word length, be overlapped one of index and compression ratio or a variety of;
The words-frequency feature of each feature word combination in the fisrt feature set according to first script file determines described
The feature vector of one script file, comprising:
According to each Feature Words in the fisrt feature set of the foundation characteristic of first script file and first script file
Combined words-frequency feature constitutes the feature vector of first script file.
9. according to the method described in claim 6, it is characterized in that, the fisrt feature collection according to first script file
The words-frequency feature of each feature word combination in conjunction, determines the feature vector of first script file, comprising:
By the words-frequency feature of feature word combination each in the fisrt feature set of first script file, first script is constituted
The feature vector of file.
10. a kind of device for identifying script file, which is characterized in that described device includes:
Module is obtained, the sample script file having for obtaining multiple labels, the label includes being used to indicate webpage to ask
It seeks the label for malicious web pages request or is used to indicate the label that web-page requests are non-malicious web-page requests;
Conversion module, for each sample script file to be converted to machine instruction sequence;
First extraction module is used for for each sample script file, using preset feature extraction rule, from the sample foot
Feature word combination is extracted in the machine instruction sequence of this document, obtains the fisrt feature set of the sample script file;
First determining module, for calculating separately described first according to preset words-frequency feature algorithm for fisrt feature set
The words-frequency feature of each feature word combination in characteristic set, and according to the word frequency of feature word combination each in fisrt feature set spy
Sign, determines the feature vector of the sample script file;
Training module, for being based on machine learning algorithm, according to the feature vector of each sample script file and each sample foot
The label training script identification model of this document;
Second determining module, for when getting script file to be identified, using the script identification model identify it is described to
It identifies script file, determines whether the script file to be identified is malicious script file.
11. device according to claim 10, which is characterized in that first determining module is specifically used for:
For each feature word combination that the fisrt feature set includes, determine the specific word combination in the sample script text
Frequency of occurrence in part, the first ratio of the total number for the feature word combination for including with the sample script file;
In preset corpus, the first number of the script file comprising the specific word combination is determined, and determine the corpus
Second ratio of the total number for the script file that library includes and first number, wherein the corpus includes multiple non-evils
The machine instruction sequence of script file of anticipating and the machine instruction sequence of multiple malicious script files;
According to first ratio and second ratio, the words-frequency feature of the specific word combination is calculated.
12. device according to claim 11, which is characterized in that first determining module is specifically used for:
First ratio is determined using following formula:
Wherein, nwIt is characterized frequency of occurrence of the word combination w in sample script file, N is the Feature Words that sample script file includes
Combined total number;
In preset corpus, the first number of the script file comprising the specific word combination w is determined, and determine corpus packet
Second ratio of the total number of the script file contained and the first number;
The words-frequency feature of the specific word combination w is determined using following formula:
(TF-IDF)w=TFw*IDFw;
Wherein, (TF-IDF)wIt is characterized the words-frequency feature of word combination w, TFwIndicate feature word combination w in sample script file
Probability of occurrence;N1For the total number for the script file that corpus includes, N2To include the specific word
Combine the first number of the script file of w.
13. device according to claim 10, which is characterized in that described device further include:
Second extraction module, for extracting the sample using preset foundation characteristic extraction algorithm for each sample script file
The foundation characteristic of this script file, the foundation characteristic include comentropy, longest word length, are overlapped in index and compression ratio
It is one or more;
First determining module, is specifically used for:
According to the words-frequency feature of each feature word combination in the foundation characteristic of the sample script file and the fisrt feature set,
Constitute the feature vector of the sample script file;
Or
First determining module, is specifically used for:
By the words-frequency feature of feature word combination each in the fisrt feature set, constitute the feature of the sample script file to
Amount.
14. a kind of device for identifying script file, which is characterized in that described device includes:
Module is obtained, for obtaining the first script file;
Conversion module, for first script file to be converted to machine instruction sequence;
First extraction module, for regular using preset feature extraction, from the machine instruction sequence of first script file
Middle extraction feature word combination, obtains the fisrt feature set of first script file;
Determining module, for calculating separately the fisrt feature collection of first script file according to preset words-frequency feature algorithm
The words-frequency feature of each feature word combination in conjunction, and according to feature word combination each in the fisrt feature set of first script file
Words-frequency feature, determine the feature vector of first script file;
Input module obtains described for the feature vector of first script file to be input in script identification model
The recognition result of one script file, the script identification model is according to the feature vector and machine learning algorithm of sample script file
Training obtains, and the feature vector of the sample script file is determined according to the words-frequency feature of the sample script file.
15. a kind of network equipment, which is characterized in that including processor, communication interface, memory and communication bus, wherein processing
Device, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes any institute of claim 1-5 or claim 6-9
The method and step stated.
16. a kind of machine readable storage medium, which is characterized in that be stored with machine-executable instruction, by processor call and
When execution, the machine-executable instruction promotes the processor: realizing that claim 1-5 or claim 6-9 is any described
Method and step.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811202216.0A CN110427755A (en) | 2018-10-16 | 2018-10-16 | A kind of method and device identifying script file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811202216.0A CN110427755A (en) | 2018-10-16 | 2018-10-16 | A kind of method and device identifying script file |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110427755A true CN110427755A (en) | 2019-11-08 |
Family
ID=68407286
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811202216.0A Pending CN110427755A (en) | 2018-10-16 | 2018-10-16 | A kind of method and device identifying script file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110427755A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111163094A (en) * | 2019-12-31 | 2020-05-15 | 奇安信科技集团股份有限公司 | Network attack detection method, network attack detection device, electronic device, and medium |
CN111695117A (en) * | 2020-06-12 | 2020-09-22 | 国网浙江省电力有限公司信息通信分公司 | Webshell script detection method and device |
CN112016088A (en) * | 2020-08-13 | 2020-12-01 | 北京兰云科技有限公司 | Method and device for generating file detection model and method and device for detecting file |
CN113282917A (en) * | 2021-06-25 | 2021-08-20 | 深圳市联软科技股份有限公司 | Security process identification method and system based on machine instruction structure |
CN113761533A (en) * | 2021-09-08 | 2021-12-07 | 广东电网有限责任公司江门供电局 | Webshell detection method and system |
CN113761521A (en) * | 2021-09-02 | 2021-12-07 | 恒安嘉新(北京)科技股份公司 | Script file detection method, device, equipment and storage medium based on machine learning |
CN113761534A (en) * | 2021-09-08 | 2021-12-07 | 广东电网有限责任公司江门供电局 | Webshell file detection method and system |
CN114189714A (en) * | 2021-12-08 | 2022-03-15 | 安天科技集团股份有限公司 | Detection method, device, equipment and medium for spreading malicious software in video website |
CN115221516A (en) * | 2022-07-13 | 2022-10-21 | 中国电信股份有限公司 | Malicious application program identification method and device, storage medium and electronic equipment |
CN115801466A (en) * | 2023-02-08 | 2023-03-14 | 北京升鑫网络科技有限公司 | Method and device for detecting ore excavation script based on flow |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663296A (en) * | 2012-03-31 | 2012-09-12 | 杭州安恒信息技术有限公司 | Intelligent detection method for Java script malicious code facing to the webpage |
CN103309862A (en) * | 2012-03-07 | 2013-09-18 | 腾讯科技(深圳)有限公司 | Webpage type recognition method and system |
CN103577756A (en) * | 2013-11-05 | 2014-02-12 | 北京奇虎科技有限公司 | Virus detection method and device based on script type judgment |
US20140215619A1 (en) * | 2013-01-28 | 2014-07-31 | Infosec Co., Ltd. | Webshell detection and response system |
CN105956472A (en) * | 2016-05-12 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for identifying whether webpage includes malicious content or not |
-
2018
- 2018-10-16 CN CN201811202216.0A patent/CN110427755A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103309862A (en) * | 2012-03-07 | 2013-09-18 | 腾讯科技(深圳)有限公司 | Webpage type recognition method and system |
CN102663296A (en) * | 2012-03-31 | 2012-09-12 | 杭州安恒信息技术有限公司 | Intelligent detection method for Java script malicious code facing to the webpage |
US20140215619A1 (en) * | 2013-01-28 | 2014-07-31 | Infosec Co., Ltd. | Webshell detection and response system |
CN103577756A (en) * | 2013-11-05 | 2014-02-12 | 北京奇虎科技有限公司 | Virus detection method and device based on script type judgment |
CN105956472A (en) * | 2016-05-12 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for identifying whether webpage includes malicious content or not |
Non-Patent Citations (3)
Title |
---|
李少波: "《制造大数据技术与应用》", 31 January 2018, 华中科技大学出版社 * |
胥小波: "基于多层感知器神经网络的WebShell检测方法", 《通信技术》 * |
马慧彬: "《基于机器学习的乳腺图像辅助诊断算法研究》", 31 August 2016, 湖南师范大学出版社 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111163094B (en) * | 2019-12-31 | 2022-04-19 | 奇安信科技集团股份有限公司 | Network attack detection method, network attack detection device, electronic device, and medium |
CN111163094A (en) * | 2019-12-31 | 2020-05-15 | 奇安信科技集团股份有限公司 | Network attack detection method, network attack detection device, electronic device, and medium |
CN111695117A (en) * | 2020-06-12 | 2020-09-22 | 国网浙江省电力有限公司信息通信分公司 | Webshell script detection method and device |
CN111695117B (en) * | 2020-06-12 | 2023-10-03 | 国网浙江省电力有限公司信息通信分公司 | Webshell script detection method and device |
CN112016088A (en) * | 2020-08-13 | 2020-12-01 | 北京兰云科技有限公司 | Method and device for generating file detection model and method and device for detecting file |
CN113282917A (en) * | 2021-06-25 | 2021-08-20 | 深圳市联软科技股份有限公司 | Security process identification method and system based on machine instruction structure |
CN113761521A (en) * | 2021-09-02 | 2021-12-07 | 恒安嘉新(北京)科技股份公司 | Script file detection method, device, equipment and storage medium based on machine learning |
CN113761534A (en) * | 2021-09-08 | 2021-12-07 | 广东电网有限责任公司江门供电局 | Webshell file detection method and system |
CN113761533A (en) * | 2021-09-08 | 2021-12-07 | 广东电网有限责任公司江门供电局 | Webshell detection method and system |
CN114189714A (en) * | 2021-12-08 | 2022-03-15 | 安天科技集团股份有限公司 | Detection method, device, equipment and medium for spreading malicious software in video website |
CN114189714B (en) * | 2021-12-08 | 2023-11-10 | 安天科技集团股份有限公司 | Method, device, equipment and medium for detecting propagation of malicious software in video website |
CN115221516A (en) * | 2022-07-13 | 2022-10-21 | 中国电信股份有限公司 | Malicious application program identification method and device, storage medium and electronic equipment |
CN115221516B (en) * | 2022-07-13 | 2024-04-26 | 中国电信股份有限公司 | Malicious application program identification method and device, storage medium and electronic equipment |
CN115801466A (en) * | 2023-02-08 | 2023-03-14 | 北京升鑫网络科技有限公司 | Method and device for detecting ore excavation script based on flow |
CN115801466B (en) * | 2023-02-08 | 2023-05-02 | 北京升鑫网络科技有限公司 | Flow-based mining script detection method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110427755A (en) | A kind of method and device identifying script file | |
Jerlin et al. | A new malware detection system using machine learning techniques for API call sequences | |
CN107659570A (en) | Webshell detection methods and system based on machine learning and static and dynamic analysis | |
CN107451476A (en) | Webpage back door detection method, system, equipment and storage medium based on cloud platform | |
KR20110081177A (en) | Detection of confidential information | |
CN111931935B (en) | Network security knowledge extraction method and device based on One-shot learning | |
CN107341399A (en) | Assess the method and device of code file security | |
US20200159925A1 (en) | Automated malware analysis that automatically clusters sandbox reports of similar malware samples | |
US11550937B2 (en) | Privacy trustworthiness based API access | |
RU2722692C1 (en) | Method and system for detecting malicious files in a non-isolated medium | |
Yang et al. | Wtagraph: Web tracking and advertising detection using graph neural networks | |
CN112989348B (en) | Attack detection method, model training method, device, server and storage medium | |
CN110191096A (en) | A kind of term vector homepage invasion detection method based on semantic analysis | |
US11836331B2 (en) | Mathematical models of graphical user interfaces | |
CN112132238A (en) | Method, device, equipment and readable medium for identifying private data | |
Jisha et al. | Mobile applications recommendation based on user ratings and permissions | |
Le et al. | GuruWS: A hybrid platform for detecting malicious web shells and web application vulnerabilities | |
US11797617B2 (en) | Method and apparatus for collecting information regarding dark web | |
Zhang et al. | A php and jsp web shell detection system with text processing based on machine learning | |
Akram et al. | DroidMD: an efficient and scalable android malware detection approach at source code level | |
Hu et al. | Cross-site scripting detection with two-channel feature fusion embedded in self-attention mechanism | |
CN103838865B (en) | For excavating the method and device of ageing kind of subpage | |
KR102483004B1 (en) | Method for detecting harmful url | |
CN115906086A (en) | Method, system and storage medium for detecting webpage backdoor based on code attribute graph | |
CN114579965A (en) | Malicious code detection method and device and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191108 |
|
RJ01 | Rejection of invention patent application after publication |