CN108573148A

CN108573148A - It is a kind of that encryption script recognition methods is obscured based on morphological analysis

Info

Publication number: CN108573148A
Application number: CN201710140949.5A
Authority: CN
Inventors: 聂眉宁; 应凌云; 苏璞睿
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2017-03-10
Filing date: 2017-03-10
Publication date: 2018-09-25
Anticipated expiration: 2037-03-10
Also published as: CN108573148B

Abstract

The present invention relates to a kind of to obscure encryption script recognition methods based on morphological analysis.This method is trained first based on human language set of letters in conjunction with the big data script file (non-malicious) of random acquisition on network, generates script dictionary；Then the dictionary is utilized, morphology coverage rate detection is carried out to another batch of big data script file (non-malicious) of random acquisition on network, determine the lowest threshold of morphology coverage rate, the ratio for counting the annotation amount and size of code of this batch of script file simultaneously, determines the highest threshold value of annotated code ratio；Finally in the actually detected stage, morphological analysis is carried out to sample to be tested and is analyzed with annotation amount, by assessing whether its morphology coverage rate is less than threshold value or whether annotation ratio is higher than threshold value, judges whether it is by obscuring encrypted malicious script.If it is determined that then further detecting whether it is malicious script with other existing detection methods for non-aliased encrypted sample.The present invention has very high detection efficiency and accuracy in detection.

Description

It is a kind of that encryption script recognition methods is obscured based on morphological analysis

Technical field

The invention belongs to Malicious Code Detection technical fields, and in particular to it is a kind of based on morphological analysis for obscuring encryption The detection method of script.

Background technology

With the development of information-intensive society, computer and networks is more and more extensive in the application of social every field, information system The importance of system is also growing day by day.At the same time, malicious code attacks the harm brought also getting worse, is especially hidden in Malicious code in all kinds of scripts such as vbs, js, since it is based entirely on the Content Organizing form of text, have transmission range it is wide, The features such as cross-platform ability is strong, deformation encryption threshold is low, signature detection difficulty is big, it has also become current most common malicious code Form.Such as break out in recent years it is a large amount of extort software by what the approach such as malious email were propagated, mostly to obscure plus Malicious script after close is carrier, and grave danger is constituted to public safety and data safety.Therefore, for malicious script, spy Be not obscure ciphering type malicious script detection be current information security field urgent need to resolve major issue.

The encryption method of obscuring of malicious script is divided into two major classes at present, first, to function name, variable name, line feed retraction style Etc. being deformed, its static nature is eliminated；Second is that being inserted into a large amount of annotations floods or interrupt malicious code function fragment, it is quiet to reduce its State feature weight.And current malicious script detection technique, usually using following several method：

1. the malicious code detecting method based on fingerprint recognition carries out static binary scanning to script file, and right Than the existing malicious code feature in malicious code feature database, malicious script detection is realized.This method can only be directed to known evil Meaning code characteristic is detected, and malicious script is text based explanation execution language, is had and is flexibly obscured very much encryption Mode can bypass this kind of detection method easily.Such as it replaces characteristic variable name and function name, split sensitive features character string Intert a large amount of nonsignificant datas etc. at various character combinations, as annotations in feature code.

2. the malicious code detecting method based on dynamic debugging uses debugger to track scripting host to malicious script solution Implementation procedure is released, the various system actions occurred during this are captured, and analyze and wherein whether there is malicious act, to realize Malicious Code Detection.It is produced from script file or scripting host since this method is not easy to define behavior, in addition analytic process The manual intervention of a large amount of professions is needed, therefore is more suitable for a kind of analysis method.It is applied to context of detection, practical operation is difficult Spend larger, accuracy is relatively low.

3. script file is placed in the virtual environments such as sandbox and carries out mould by the malicious code detecting method based on virtual execution Quasi- to execute, its operational process of dynamic analysis extracts corelation behaviour feature, is compared with behavior white list, realizes malicious code inspection It surveys.Since this method needs actual motion script file, to prevent malicious code from penetrating true environment, usually need to combine The virtual machine techniques such as VMWare carry out detection work.For such methods, the on the one hand behavior of current a large amount of malicious codes can be according to Rely special trigger condition, terminate that execution, network environment is untrue as detected if detecting virtual environment, terminates execution Deng so that detection difficulty increases；On the other hand due to the light weight level characteristics of script type malicious code, cause its spread scope it is wide, Explosive strong, at present in actual scene, this kind of detection method based on virtual execution still hardly possible is to meet performance requirements.

In conclusion the detection method of malicious script, major defect are at present：For the malice for obscuring encryption type Code, the scarce capacity of rapid static detection；And dynamic testing method and virtual detection method are required to the artificial of a large amount of professions Analysis is intervened, and faces behavior triggering difficulty larger problem and detection performance and be difficult to meet asking for actually detected scene demand Topic.

Invention content

The present invention is a kind of malicious script detection method based on morphological analysis, and key problems-solving is to obscuring encryption The quick identification of type malicious script.

The present invention based on ready-made (or existing) human language set of letters, adopts at random in conjunction on network first The big data script file (non-malicious) of collection is trained, and generates the word lexicon suitable for script；Then the word is utilized Allusion quotation carries out morphology coverage rate detection to another batch of big data script file (non-malicious) of random acquisition on network, determines non-evil The lowest threshold for script file morphology coverage rate of anticipating, while the ratio of the annotation amount and size of code of this batch of script file is counted, really Determine the highest threshold value of non-malicious script file annotation ratio；Finally in the actually detected stage, morphological analysis is carried out to sample file It is analyzed with annotation amount, by assessing whether its morphology coverage rate is less than threshold value or whether annotation ratio is higher than threshold value, judges that it is No process obscures encryption.This method thinks that using cryptographic means are obscured be effective ways and master of the malicious script around static detection Syllabus, therefore encryption sample will be obscured and be regarded as malicious script.For being determined as that it is other existing that non-aliased encrypted sample then uses Static detection method further detect whether it is malicious script.

The a kind of of the present invention obscures encryption malicious script detection method based on morphological analysis, and its step are as follows：

1) analysis dictionary file, arrange mankind's word list, form initial detection dictionary, then utilize reptile instrument from A large amount of normal script files such as major main stream website crawl js, the vbs in internet, as dictionary training set.Count dictionary training The token-category and word quantity of concentration.For the word not in dictionary, counts it and occurs in how many script file respectively, If more than threshold value (script file as being more than 1/3), then illustrates that the word is the common word of script, be added to dictionary In.After training, final script dictionary is formed.

2) reptile instrument is utilized to capture a large amount of normal script files such as js, vbs from the major main stream website in internet, as Threshold-training collection.For each sample in training set, count word quantity wherein in script dictionary with not in script The ratio of word quantity in dictionary, i.e. morphology coverage rate.After the morphology coverage rate of all samples has been calculated, selection is therein most Lowest threshold of the small value as non-malicious script file morphology coverage rate.

3) each sample concentrated for Threshold-training, statistics are wherein used for the word quantity of annotation and the list for code The ratio of word quantity, i.e. annotated code ratio.After the annotated code ratio of all samples has been calculated, select maximum value therein as non- Malicious script file notes code than highest threshold value.

4) in the actually detected stage, for script file to be detected, its morphology coverage rate is calculated.If its morphology coverage rate Less than the lowest threshold of non-malicious script file morphology coverage rate, then judge the script file for by obscuring encrypted malice foot This document；

5) in the actually detected stage, for script file to be detected, its annotated code ratio is calculated.If its annotated code ratio Higher than non-malicious script file annotated code than highest threshold value, then judge the script file for by obscuring encrypted malice foot This document.

6) in the actually detected stage, for script file to be detected, if two above step is to determine that it is malice Script then illustrates that the script is therefore to be examined using existing various static detection methods without obscuring encrypted plaintext script Survey whether it is malicious script file.It is considered herein that being that malicious script bypasses the effective of static detection using cryptographic means are obscured Method and main purpose, therefore encryption sample will be obscured and be regarded as malicious script.

7) testing result is recorded, and is reported to user.

Advantages of the present invention is as follows：

1. the present invention is based on morphological analysis, detection sample file whether by obscuring encryption, detection method with obscure encryption The technical sophistication degree of means is unrelated, has very high detection accuracy.

2. the present invention is based on annotated code proportion grading, whether detection sample file by sensitive instructions is submerged in magnanimity rubbish In information, the detection accuracy of the detection method based on static nature is improved.

3. the present invention is based on static scanning method, the dynamic behaviour of monitoring script engine is not needed, it also need not be to script Code carries out semiology analysis or simulation executes, therefore has very high detection efficiency, and performance is sufficient in actual scene greatly Measure the malicious script detection demand of outburst.

4. the present invention is using before existing malicious script static detection method, first to sample obscure encryption situation into Row analysis, greatly improves the detection accuracy of existing detection method.

Description of the drawings

The dictionary training flow chart for obscuring encryption script recognition methods based on morphological analysis of Fig. 1 present invention.

The Threshold-training flow chart for obscuring encryption script recognition methods based on morphological analysis of Fig. 2 present invention.

The pattern detection flow chart for obscuring encryption script recognition methods based on morphological analysis of Fig. 3 present invention.

The test result figure of the rate of false alarm and rate of failing to report of Fig. 4 present invention.

Specific implementation mode

The technical solution that the invention will now be described in detail with reference to the accompanying drawings：

The present invention's obscures encryption script recognition methods based on morphological analysis, includes mainly three phases, before being respectively The dictionary training of phase and the pattern detection stage of Threshold-training stage and later stage system in actual use.

The detailed step of dictionary training stage is as shown in Figure 1, include the following steps：

1, prepare lexicon file, this document is ready-made mankind's word list.

2, a large amount of normal script files such as js, vbs are captured from the major main stream website in internet using reptile instrument, as Dictionary training set.

3, a training sample is selected, word type therein is analyzed, records all words not in lexicon file.

4, step 3 is repeated, all analyzes and has finished until the training sample file prepared in step 2.

5, a word not in lexicon file, statistics is selected to include the script file number of the word, that is, count non- The frequency of occurrences of dictionary word.

6, for the script file number counted in step 5, if it exceeds in training set sample total number 1/3, then The word selected in step 5 is added in lexicon file.

7, step 5, step 6 are repeated, all analyzes and finishes until all words not in dictionary of record.

8, so far, the lexicon file suitable for script, i.e. script dictionary are formed.

The detailed step in Threshold-training stage is as shown in Fig. 2, include the following steps：

9, a large amount of normal script files such as js, vbs are captured from the major main stream website in internet using reptile instrument, as Threshold-training collection.

10, a training sample is selected, statistics wherein how many word is in script dictionary, how many word is not in script In dictionary, and the quantity ratio of both words is calculated, as the morphology coverage rate of the sample, recorded.

11, the training sample selected for step 10, for annotating, how many word is used for statistics wherein how many word Code, and the quantity ratio of both words is calculated, as the annotated code ratio (or annotation ratio) of the sample, record Come.

12, step 10 and step 11 are repeated, has analyzed and has finished until whole training samples of step 9 preparation.

13, minimum value is found out from the morphology coverage rate recorded, as non-malicious script file morphology coverage rate Lowest threshold.

14, find out maximum value from the annotated code ratio recorded, as non-malicious script file annotated code than Highest threshold value.

Actual sample detects the detailed step of training stage as shown in figure 3, including the following steps：

15, the morphology coverage rate of the script dictionary, the formation of Threshold-training stage that are formed to the system input dictionary training stage Threshold value and annotated code prepare script sample set to be detected than threshold value.

16, a script sample file to be detected is selected, counts wherein how many word in script dictionary, how many Word calculates the quantity ratio of both words not in script dictionary, the morphology coverage rate as the sample.

17, for calculated morphology coverage rate in step 16, if being less than morphology coverage rate threshold value, judge the sample Obscure encrypted malicious script file to pass through, and by test results report to user.

18, for calculated morphology coverage rate in step 16, if being higher than morphology coverage rate threshold value, the sample is counted In how many word for annotating, how many word is used for code, and calculates the quantity ratio of both words, as the sample Annotated code ratio.

19, for calculated annotated code ratio in step 18, if being higher than annotated code threshold value, judge that the sample is By obscuring encrypted malicious script file, and by test results report to user.

If 20, calculated morphology coverage rate is higher than morphology coverage rate threshold value in step 16, and is calculated in step 18 Annotated code ratio less than annotated code than threshold value, then judge the sample for plaintext script, thread called in the form of plug-in unit Malicious script detection method is detected it, and to user report testing result.

21, step 16 is repeated to step 20, is all analyzed and is finished until the sample to be tested this document prepared in step 15, ties The entire detection process of beam.

A kind of malicious script detection method and system, key problems-solving based on morphological analysis proposed by the present invention are To obscuring the quick identification of ciphering type malicious script.It for a person skilled in the art, can oneself selection as needed Original lexicon file and dictionary training set and Threshold-training collection.In detection process, system is judged to adding without obscuring Close plaintext script sample oneself can select (or addition) existing malicious script detection module, such as various biographies as needed System antivirus software.Malicious script to carry out high efficiency, high-accuracy detects work.

The present invention has crawled 1000 white samples from Sina and Tencent website, wherein 500 are used to train word lexicon, 500 for training threshold value.For this experiment, the morphology coverage rate threshold value trained is 50%, and annotated code is than threshold value 4.Next 900 black samples are had collected and carry out rate of failing to report tests, and from Netease and www.baidu.com crawled 900 white samples into Row rate of false alarm is tested.To make test result be more clear, if " threat level " is " (1- morphology coverage rate) * 100% ", and will Annotation ratio is more than that the threat level of the sample of threshold value is provided directly as 60%.Therefore, when morphology coverage rate threshold value value is When 50%, sample of the threat level more than 50% is judged as by obscuring encrypted malice sample, and threat level is less than 50% Sample be judged as normal sample.

Test results are shown in figure 4, and in rate of false alarm test, the threat level maximum value of white sample is 45%, illustrates this The minimum value of lot sample this morphology coverage rate is 55%, with threshold value 50% compared to still having at a distance from 5%, that is, is directed to test sample, this Method is not reported by mistake；In rate of failing to report test, the threat level of only 28 black samples is less than 50%, illustrates there are 28 samples Not only annotation ratio is not above threshold value 4 for this, and morphology coverage rate is also above threshold value 50%.Through manual analysis, this 28 samples It is that and can test sample be directed to by traditional antivirus software identifications such as kappa this bases without obscuring encrypted malicious code, This method is not failed to report.

Although disclosing specific embodiments of the present invention and attached drawing for the purpose of illustration, its object is to help to understand the present invention Content and implement according to this, it will be appreciated by those skilled in the art that：In the essence for not departing from the present invention and the attached claims In god and range, various substitutions, changes and modifications are all possible.Therefore, the present invention should not be limited to most preferred embodiment and attached Figure disclosure of that, the scope of protection of present invention is subject to the scope defined in the claims.

Claims

1. a kind of obscuring encryption script recognition methods based on morphological analysis, step includes：

1) before testing, based on existing human language set of letters, in conjunction with the non-malicious of random acquisition on network Big data script file is trained, and generates the word lexicon suitable for script；

2) morphology coverage rate before testing, is carried out to another batch of big data script file of the non-malicious of random acquisition on network Detection, determines the lowest threshold of the morphology coverage rate of non-malicious script file, at the same count the annotation amount of this batch of script file with The ratio of size of code determines the highest threshold value of the annotation ratio of non-malicious script file；

3) in actually detected, morphological analysis is carried out to sample to be tested and is analyzed with annotation amount, is by assessing its morphology coverage rate It is no whether to be higher than the highest threshold value less than the lowest threshold or annotation ratio, judge whether it is by obscuring encrypted evil Meaning script.

2. the method as described in claim 1, which is characterized in that for non-aliased encrypted plaintext script, with existing quiet State detection method further detects whether it is malicious script.

3. the method as described in claim 1, which is characterized in that in step 1) existing human language set of letters be include ox A variety of dictionary set for including significant word including the dictionary of Tianjin；Big data script file in step 1) and step 2) passes through Reptile captures from portal website.

4. the method as described in claim 1, which is characterized in that the training method in step 1) is：It is more than threshold value that will appear in Word in a script file, not in dictionary is added to dictionary.

5. the method as described in claim 1, which is characterized in that the computational methods of morphology coverage rate in step 2) and step 3) For：Each word in sample file is analyzed, word quantity and the not word in script dictionary in script dictionary are calculated The ratio of quantity, i.e. morphology coverage rate.

6. the method as described in claim 1, which is characterized in that the morphology coverage rate threshold value determination method in step 2) is： Morphology coverage rate is calculated for all samples in training set, and selects minimum value therein as threshold value.

7. the method as described in claim 1, which is characterized in that the computational methods of the annotation ratio in step 2) and step 3) For：Each word in sample file is analyzed, the ratio of the word quantity and the word quantity for code for annotation is calculated, Annotate ratio.

8. the method as described in claim 1, which is characterized in that the determination method of the annotation proportion threshold value in step 2) is：Needle Annotation ratio is calculated to all samples in training set, and selects maximum value therein as threshold value.

9. the method as described in claim 1, which is characterized in that obscure encrypted detection method to code morphing formula in step 3) For：The morphology coverage rate of sample is calculated, and it is compared with the threshold value of morphology coverage rate, if the morphology coverage rate of sample is low In threshold value, then illustrate comprising a large amount of meaningless words in the sample, i.e., and the indirect hand for coming from human programmers, it is determined that The sample is by obscuring encryption.

10. the method as described in claim 1, which is characterized in that cover class to feature in step 3) and obscure encrypted detection side Method is：The annotation ratio of sample is calculated, and it is compared with the threshold value of annotation ratio, if the annotated code ratio of sample is higher than Threshold value then illustrates to attempt to a large amount of words comprising a large amount of annotations in the sample and flood feature code with character string, so sentencing The fixed sample is by obscuring encryption.