The intelligent detecting method of object web page JavaScript malicious code
Technical field
The present invention relates to a kind of JavaScript malicious code intelligent detecting method.
Background technology
Malicious code is one of important form threatening computer security, is one section of computer code or program (one section of instruction) in essence, and this section of code can perform a series of operation comprising malicious intentions according to the wish of assailant; The form of code may be executable code instruction, script, word processing macrolanguage or other types.Typical malicious code comprises virus, worm and Trojan Horse.
The present invention's research to as if the JavaScript script that can be embedded in webpage, be a kind of based on object and event driven client-side scripting language.JavaScript makes to achieve between webpage and user relation that is a kind of real-time, dynamic, interactivity, make webpage can comprise how active element and more excellent content, but also make hackers more easily write and run malicious code, from network, such as automatically can load other malicious script, energy operation pages Document object, the html interface that operation user sees, can obtain or ask to input the data such as valuable account number cipher to user, and send request of data to server under the sun.Meanwhile, hacker can also use JavaScript to attack the leak of browser, and this attack may cause browser collapse, RAM leakage etc.In the face of these safety problems, need badly and the safety problem of JavaScript is furtherd investigate, improve the detectability of the malicious script to JavaScript, ensure the safety of internet, applications.
Malicious Code Detection technology has become an important directions of information security field, and has achieved very many achievements in research.The detection technique of malicious code mainly divides Static Detection and detection of dynamic two kinds according to adopting the difference of analytic target, and Static Detection analyzes the text feature of code, and detection of dynamic is then the analysis to code act of execution.
The typical method of Static Detection is the detection technique based on signature, mainly based on the thought of pattern match, creates malicious code storehouse for often kind of known malicious code produces a unique signature character mark.These signature characters analyze Virus Sample by industry specialists, carries out manual extraction, the peculiar property of a signature mark one particular malicious code.Performing step based on endorsement method is as follows:
(1) known malicious code sample is gathered;
(2) in malicious code sample, malicious code signature feature is extracted;
(3) signature is included in malicious code data storehouse;
(4) file is detected.If containing the signature in malicious code storehouse in file to be checked, namely judge that this file is malicious code or by malicious code infections.
Be current most convenient, most widely used detection method based on endorsement method, the virus killing product of a lot of business is all adopt this technology.Its advantage is that detection speed is fast, and existing malicious code in virus base, can accurately detect, rate of false alarm is lower.Shortcoming is helpless to emerging virus, must constantly update version, adds the feature of new virus in virus base.
The typical method of detection of dynamic is behavior-based detection technology, generally needs Dynamic Execution code or virtual run time version, utilizes the peculiar behavioural characteristic of virus to monitor virus.By to viral years of researches, find that there is the joint act that some behaviors are malicious codes, and very special, seldom comprise these behaviors in normal code.Some typical malicious act features are as follows:
(1) No. 13H, INT interruption is seized.Boot-type virus can attack Boot sector or Master boot sector, and places the code needed for virus wherein, and when system starts, Boot sector or Master boot sector can perform INT 13H function, and viral code will be loaded.
(2) Installed System Memory total amount is revised.Virus, in order to complete the specific functions such as infection and destruction, will reduce Installed System Memory total amount, and make system and other application program can not occupy its space, and make self to reside in internal memory.
(3) write operation is performed to specific file.Because virus depends on and gives birth to, so when virus performs, native codes will be attached among infected file, and make infected file have abnormal write operation.
(4) monitoring system calling sequence.System call is the unique interface of user application and operating system, and the malice that some system call sequence can embody to a certain degree is semantic.
Therefore, behavior-based detection method can detect some emerging unknown virus, and the difficult point of its research is to extract malicious act feature, and system overhead is larger.
In sum, Static Detection efficiency is high, but cannot detect new malicious code; Dynamic detection technology can detect new malicious code, but efficiency is not high, and it is large that behavioural characteristic extracts difficulty, poor operability.In view of this, researchist pays close attention to how automatically to detect emerging malicious code efficiently, and the method for automatic categorizer just becomes a kind of hot spot technology in anti-virus field.In fact, along with the application of data mining technology, data mining technology is applied to Malicious Code Detection and has achieved good experiment effect.At present, the Malicious Code Detection based on data mining and machine learning starts to be paid close attention to more and more, has become a new study hotspot.
But, although machine learning method is applied to Malicious Code Detection field and has achieved more achievement in research, but the main object of at present research mainly concentrates the executable file to Windows system, the current web virus the fastest with the JavaScript script velocity of propagation that is representative is detected and also lacks further investigation.In fact, Code Obfuscation Security Technology is applied in JavaScript script edit more and more, such as code compaction, substitute, restructuring, redundancy interference and encryption etc., the script generated by special obfuscation often successfully can escape the detection of the static detecting tool of feature based code.Therefore, the new efficient JavaScript malicious script detection method that research stationary detection technique and dynamic detection technology merge mutually will be a kind of trend.
Summary of the invention
The technical problem to be solved in the present invention is, for the ubiquitous Code obfuscation phenomenon of JavaScript malicious script, the shortcoming of new malicious code cannot be detected in order to overcome stationary detection technique, and solve the problems such as the lower and feature extraction of efficiency that dynamic behaviour detection technique exists is more difficult, provide a kind of based on renewable preferred sample, without the need to static code condition code and dynamic behaviour condition code, the novel intelligent detection method that can detect new malicious code, general stalwartness.
The technical solution adopted for the present invention to solve the technical problems is:
There is provided a kind of intelligent detecting method of object web page JavaScript malicious code, the method comprises preferred sample, safety detection, renewal preferably three processes, is specially:
Preferred sample: utilize N-gram language statistics method and machine learning algorithm KNN, by carrying out machine learning to the JavaScript script in JavaScript script training storehouse, generating the JavaScript being used for safety detection and detecting Sample Storehouse;
Safety detection: extract JavaScript script for webpage URL to be detected, the JavaScript set up based on preferred sample processes detects Sample Storehouse, detects the webpage of specifying whether comprise JavaScript malicious code by KNN sorting algorithm;
Upgrade preferably: the accuracy of statistics safety detection, if accuracy of detection remains in the scope of setting, then sustainable enabling detects Sample Storehouse execution safety detection through preferred JavaScript; If otherwise accuracy of detection decline and exceed preset range, then all having completed is detected and the JavaScript script causing accuracy of detection to decline is inserted in JavaScript script training storehouse, re-starts the detection Sample Storehouse that preferred sample obtains renewal; In this process, keep preferred detection Sample Storehouse quantity constant with the efficiency detected that ensures safety.
In the present invention, described JavaScript detects Sample Storehouse and comprises malicious code N-gram sample and benign code N-gram sample.
In the present invention, in the process of described preferred sample, by determining following parameter to the analysis of training script: the accuracy of P, JavaScript safety detection; N, N-gram size parameter, N
f, N-gram frequency statistics threshold value, namely represent the frequency of occurrences in JavaScript script training storehouse the highest before N
findividual N-gram; N °, the malice sample in representative preferred detection Sample Storehouse and the total quantity of optimum sample.Specifically comprise the steps:
(1) gather current representative JavaScript malicious script and optimum script, form the script training storehouse reaching ten thousand number of stages;
(2) adopt the JavaScript analytics engine V8 that increases income of Google, compiling JavaScript script obtains V8 machine code, the sequence of operation of onestep extraction machine code of going forward side by side;
(3) be base unit with handling function, calculate the sequence of operation N-gram of each malice training script and optimum training script, and preserve the highest front N of frequency
findividual N-gram; Malicious script in note JavaScript script training storehouse and the quantity of optimum script are respectively n
mand n
b, training script total amount is n=n
m+ n
b; By the N of each training script calculated
findividual N-gram set is designated as
(i=1,2 ..., n
m) and
(i=1,2 ..., n
b), the frequency values that each N-gram occurs is designated as respectively
(i=1,2 ..., n
m) and
(i=1,2 ..., n
b), here to not gathering
or
in N-gram s ', namely
order
i=1,2 ..., n
m;
(4) select KNN sorter (getting K=i), sorting algorithm is described below: the front N calculating the JavaScript script machine code sequence of operation to be sorted
findividual N-gram, is designated as S set
f, the frequency values that each N-gram occurs is designated as f (s), s ∈ S
f; Obtain satisfied
i=1,2 ..., n
mi=j
m, meet
i=1,2 ..., n
bi=j
b; If d
m< d
bthen judge that this script is as malicious code, jth
mnamely individual malice training script is selected as once as malice sample; Otherwise be benign code, jth
bnamely individual optimum training script is selected as once as optimum sample;
(5) for total amount be the training script storehouse of n, the cross validation taking KNN to classify is tested, and specifically training script can be divided into respectively
with
part (n of selection
mand n
bbe all the multiple of N °), random each portion of selecting is as KNN training data, and remainder is all as test data; When logging test results is correct, each training script is elected to be the cumulative number of sample by KNN sorter; Finally according to the height of cumulative number, N ° detection Sample Storehouse be made up of the N-gram of malicious code and benign code respectively before selecting respectively, the note malicious code N-gram detected in Sample Storehouse gathers and benign code N-gram gathers and is respectively
(i=1,2 ..., N °) and
(i=1,2 ..., N °), in each set, the frequency values of each N-gram is designated as respectively
(i=1,2 ..., N °) and
(i=1,2 ..., n
b).
In the present invention, in described safety detection process, comprise the steps:
(1) according to the webpage URL specified, embedded JavaScript code is extracted as script to be detected;
(2) the JavaScript analytics engine V8 that increases income performing Google obtains JavaScript machine code, the onestep extraction sequence of operation of going forward side by side;
(3) the front N of script operation sequence to be detected is calculated
findividual N-gram, note N-gram set is S
f, the frequency of occurrences value of each N-gram is designated as f (s), s ∈ S
f;
(4) the KNN sorting algorithm of K=1 is utilized to detect S
fwhether be the N-gram of malicious code, basic process is as follows: calculate
i=1,2,...,N°,
i=1,2,...,N°,
Here min () function representation gets minimum value.If d
m< d
bthen judge that this script is as malicious code, otherwise be benign code.
In the present invention, in described renewal preferred process, comprise the steps:
(1) record JavaScript script when each safety detection is failed to report or reported by mistake, and first directly added to by its N-gram and preferably detect in Sample Storehouse, JavaScript detects Sample Storehouse and increases;
(2) the error rate P after accumulative each safety detection
f, work as P
ftime > 2 (1-P) (P is here the Malicious Code Detection accuracy of setting), by n
fthe wrong script of individual detection all joins in existing n script training storehouse, and namely script training storehouse size becomes n=n+n
f, again perform preferred sample processes, the JavaScript regaining 2N ° size detects Sample Storehouse (N ° malice detects sample and N ° optimum detection sample), to keep accuracy P and sorting algorithm execution efficiency not to decline;
(3) if the P in step (2)
f> 2 (1-P) is false, then repeated execution of steps (1).
Beneficial effect of the present invention is mainly manifested in:
(1) high efficiency method that a kind of Static Detection and detection of dynamic effectively mix is proposed, by the N-gram statistical model of classics and KNN sorter effective integration, can realize the dynamic behaviour analysis to code by the N-gram feature setting up the JavaScript machine code sequence of operation, the sample relied on by preferred KNN sorter can improve the efficiency that static classification detects greatly.
(2) preferred sample, safety detection, renewal preferably three relatively independent parts are proposed; make the unique operability of intelligent detecting method proposed by the invention; namely preferred sample can guarantee that classification effectiveness can not reduce along with the increase in training script storehouse; safety detection can be guaranteed to perform efficient Intelligent Measurement based on preferred sample, upgrades and preferably then can guarantee that the precision of Intelligent Measurement can not decline with the increase of new malicious script.
(3) along with the continuous increase of training script and the development of JavaScript technology, N, N involved in the method that the present invention proposes
f, the major parameter such as N °, P appropriately adjusts by machine learning and experimental analysis, the dynamic optimization adjustment capability that ability intelligent detecting method being possessed better detect new malicious script and continuous firing produce.
(4) arbitrary JavaScript code obfuscation can effectively be shielded by research machine code operations sequence signature, and because adopt JavaScript analytics engine V8 can obtain the machine code of any JavaScript script, the method that therefore the present invention proposes also can support the safety detection to the JavaScript code fragment extracted in webpage URL.
(5) all algorithms involved by JavaScript malicious code intelligent detecting method disclosed by the invention and implementation step, simple and practical, efficient low-consume, is easy to realize modular develop and field on all kinds of platform.
Accompanying drawing explanation
The Intelligent Measurement flow process of Fig. 1 object web page JavaScript malicious code;
Fig. 2 is based on the preferred sample basic process of JavaScript machine code sequence of operation N-gram and KNN sorter;
Fig. 3 adopts the machine code after JavaScript analytics engine V8 compiling JavaScript script and extraction machine code operations sequence diagram.
Embodiment
First it should be noted that, the present invention relates to the application of the software engineerings such as search engine, is that computer technology is applied in the one of internet arena.In implementation procedure of the present invention, the application of multiple software function module can be related to.Applicant thinks, as reading over application documents, accurate understanding is of the present invention realize principle and goal of the invention after, when in conjunction with existing known technology, those skilled in the art can use its software programming technical ability grasped to realize the present invention completely.This category of all genus that all the present patent application files are mentioned, applicant will not enumerate.
Preferred sample utilizes N-gram language statistics method and machine learning algorithm KNN (data analysis technique of both generally acknowledging), by determining following parameter to the analysis of training script: P (accuracy of JavaScript safety detection), N (N-gram size parameter N), N
f(N-gram frequency statistics threshold value, namely represent the frequency of occurrences in JavaScript script training storehouse the highest before N
findividual N-gram), N ° (representing the quantity that preferred JavaScript detects malice sample and optimum sample in Sample Storehouse), final generation size is that the preferred JavaScript of 2N ° detects Sample Storehouse, mainly comprises malicious code sample N-gram and benign code sample N-gram.Basic process as shown in Figure 2, mainly comprises the steps:
(1) gather current representative JavaScript malicious script and optimum script, form the JavaScript script training storehouse reaching ten thousand number of stages;
(2) adopt the JavaScript analytics engine V8 of Google, compiling JavaScript script obtains V8 machine code, the sequence of operation (as shown in Figure 3) of onestep extraction machine code of going forward side by side;
(3) take handling function as base unit, calculate the N-gram of each script (malicious script and optimum script) the machine code sequence of operation in script training storehouse, and preserve the highest front N of the frequency of occurrences
findividual N-gram.Note malice and optimum script quantity are respectively n
mand n
b, script total amount is n=n
m+ n
b; By the N of each script calculated
findividual N-gram set is designated as
(i=1,2 ..., n
m) and
(i=1,2 ..., n
b), the frequency values that each N-gram occurs is designated as respectively
(i=1,2 ..., n
m) and
(i=1,2 ..., n
b), here to not gathering
or
in N-gram s ', namely
regulation
i=1,2 ..., n
m.
(4) select KNN sorter (getting K=1), sorting algorithm is described below: the front N calculating the JavaScript script machine code sequence of operation to be sorted
findividual N-gram, is designated as S set
f, the frequency values that each N-gram occurs is designated as f (s), s ∈ S
f.Obtain satisfied
i=1,2 ..., n
mi, and be designated as i=j
m, obtain satisfied
i=1,2 ..., n
bj, be designated as i=j
b.If d
m< d
bthen judge that this script is as malicious code, jth in script training storehouse
mnamely individual malicious script is selected as once as the detection sample of malice; Otherwise be benign code, jth
bnamely individual optimum script is selected as once as optimum detection sample.
(5) for total amount be the training script storehouse of n, the cross validation taking KNN to classify is tested, and specifically training script can be divided into respectively
with
part (n of selection
mand n
bbe all the multiple of N °), random each portion of selecting is as KNN training data, and remainder is all as test data; When logging test results is correct, each training script is elected to be the cumulative number of sample by KNN sorter; Finally according to the height of cumulative number, before selecting respectively, N ° malicious script and optimum script are as the malice sample detected in sample and optimum sample, and are stored as N-gram set respectively, are designated as
(i=1,2 ..., N °) (maliciously) and
(i=1,2 ..., N °) (optimum), then remember that the frequency values of each N-gram in above-mentioned two set is respectively
(i=1,2 ..., N °) and
(i=1,2 ..., n
b).
As the experimental demonstration to Selecting parameter in intelligent detecting method, we have obtained 5000 optimum scripts respectively from websites such as http://code.google.com and http://vx.netlux.org/ and malicious script forms training storehouse, analyze by experiment and determine at N=3, N
fwhen=500, N °=100, classification accuracy rate can reach P > 95%, obtains ideal result, and can ensure higher execution efficiency.As a preferred scheme, also therefore determine that parameters is: N=3, N
f=500, N °=100, P=95%.
Safety detection utilizes the detection sample of preferred Sample Establishing
(i=1,2 ..., N °) and
(i=1,2 ..., N °), the webpage of specifying is detected whether comprise JavaScript malicious code.Key step is as follows:
(1) according to the webpage URL specified, embedded JavaScript code is extracted as script to be detected;
(2) perform analytics engine V8 and obtain JavaScript machine code, the onestep extraction sequence of operation of going forward side by side;
(3) the front N of script operation sequence to be detected is calculated
findividual N-gram, note N-gram set is S
f, the frequency of occurrences value of each N-gram is designated as f (s), s ∈ S
f;
(4) the KNN sorting algorithm of K=1 is utilized to detect S
fwhether is the N-gram of malicious script, basic process is as follows:
Calculate
i=1,2,...,N°,
i=1,2,...,N°,
If d
m< d
bthen judge that this script is as malicious code, otherwise be benign code.Preferred renewal utilizes the result of safety detection, and completing size is that (N ° malice sample and N ° optimum sample, be designated as set respectively for the detection Sample Storehouse of 2N °
(i=1,2 ..., N °) and
(i=1,2 ..., N °)) reselect, ensure safety detect accuracy and execution efficiency.Key step is as follows:
(1) record each safety detection make a mistake (fail to report or report by mistake) time JavaScript script, and first its N-gram is directly added to and preferably detects in Sample Storehouse;
(2) the error rate P after accumulative each safety detection
f, work as P
ftime > 2 (1-P), by n
fthe wrong script of individual detection all joins in existing n script training storehouse, script bank size n=n+n
f, perform preferred sample processes, regain the detection Sample Storehouse of 2N ° size, to keep accuracy P and sorting algorithm execution efficiency not to decline;
(3) if the P in (2)
f> 2 (1-P) is false, then repeat (1).
Here get P=95%, then work as P
frefer to time > 2 (1-P) that current detection error rate reaches 10%, explanation utilizes current preferred detection sample, when adopting KNN sorter to differentiate new JavaScript script, there is obvious decay in Detection results, therefore the new script causing detecting error is fed back to as training script that to re-execute sample in script training storehouse preferred, obtain the detection Sample Storehouse of new KNN classification, ensure safety the efficiency and precision that detect.