CN105488413A

CN105488413A - Malicious code detection method and system based on information gain

Info

Publication number: CN105488413A
Application number: CN201510344523.2A
Authority: CN
Inventors: 常安琪; 李柏松
Original assignee: Harbin Antiy Technology Co Ltd
Current assignee: Harbin Antiy Technology Co Ltd
Priority date: 2015-06-19
Filing date: 2015-06-19
Publication date: 2016-04-13

Abstract

The invention discloses a malicious code detection method based on information gain. The method comprises the following steps of collecting samples to form a training sample set; selecting splitting criteria to form an attribute set; intensively extracting samples from the training sample set in a returning random extraction form to form a test sample set; randomly extracting features by aiming at the test sample set to form a feature sample set; selecting the splitting criterion of each splitting node from the attribute set based on the information gain maximization criterion; performing layer-by-layer splitting on the feature sample set until the splitting cannot continue; finally combining the splitting criteria of all of the splitting nodes to form a decision-making tree; repeating the processes to obtain the required quantity of decision-making trees; giving judging results for data to be detected by each decision-making tree; and giving the final detection result by integrating all judging results. The invention also discloses a malicious code detection method based on information gain. By using the technical scheme, unknown malicious code can be effectively recognized; and the detection efficiency can be improved.

Description

A kind of malicious code detecting method based on information gain and system

Technical field

The present invention relates to field of information security technology, particularly relate to a kind of malicious code detecting method based on information gain and system.

Background technology

At present, although the detection technique of malicious code is at development, the detection technique of malicious code and ability still lag behind the development of malicious code, particularly propose huge challenge to the detectability of unknown malicious code to Malicious Code Detection technology.Detection technique conventional at present comprises: the mode-matching technique of feature based code and the detection technique based on malicious code rule of conduct, wherein the mode-matching technique of feature based code is mated with the malicious code feature string in property data base by the feature code of detected file, when the match is successful, interval scale is detected containing malicious code in file, otherwise thinks not containing malicious code.The shortcoming of this technology needs the artificial very first time to find to obtain malicious code sample and extract condition code to add malicious code feature database; And based on the detection technique of malicious code rule of conduct, be carry out detection of malicious code according to the common rule of conduct of expert's predefined malice code.There is hysteresis quality in this method, particularly along with the significantly lifting of computer run speed, when malicious code behavior by the time being detected, often brings irreparable damage to system.Above-mentioned two kinds of detection techniques are all one detection techniques afterwards, known malicious code can only be detected, or just can be detected after malicious code is performed, but malicious code causes destruction during this period.Also have some malicious code detecting methods higher to computing machine expense, often drag the normal operation of slow system.

Summary of the invention

Technical solutions according to the invention pass through the sample composition training sample set collecting malice or non-malicious, randomly draw the sample of some at every turn, and then for training division after random extraction feature.Selection for the fragmentation criterion of each split vertexes needs to follow the maximized standard of information gain, so repeatedly implements the multiple decision tree of final formation; Utilize these decision trees to judge whether maliciously data to be tested.Technical scheme of the present invention, by the mode of Fast Training sample, improves the recall rate for unknown malicious code, makes testing result more accurate.

The present invention adopts and realizes with the following method: a kind of malicious code detecting method based on information gain, comprising:

Step 1, collection malice sample and non-malicious sample composition training sample set;

Step 2, choose various fragmentation criterion composition property set;

Step 3, utilize to be concentrated from training sample by the form randomly drawed put back to and extract the sample of predetermined number and form test sample book collection;

Step 4, extract feature composition characteristic sample set at random for described test sample book collection;

Step 5, concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, carry out division from level to level till can not dividing to feature samples collection, the fragmentation criterion of each split vertexes is finally combined to form decision tree again;

Step 6, repetition step 3 to step 5 obtain the decision tree of requirement;

Step 7, based on each decision tree, data to be tested to be judged, carry out comprehensively, providing final testing result to all result of determination.

Further, describedly concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, comprising:

Computation attribute concentrates various fragmentation criterion at the information gain Gain of this split vertexes:

；

;

Wherein, D is the total sample number amount entering current split vertexes, and K is the division kind obtained after the fragmentation criterion chosen, for the sample size of each division kind, m=2, representative the probability of malice sample is belonged in individual sample, representative the probability of non-malicious sample is belonged in individual sample;

Choose in property set the fragmentation criterion of fragmentation criterion as this split vertexes making information gain Gain obtain maximal value.

Further, described concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes after, also comprise, by this fragmentation criterion dependency concentrate delete.

The present invention can adopt following system to realize: a kind of malicious code detection system based on information gain, comprising:

Training sample set generation module, for collecting malice sample and non-malicious sample composition training sample set;

Property set generation module, for choosing various fragmentation criterion composition property set;

Test sample book collection generation module, concentrates the sample that extract predetermined number to form test sample book collection by the form randomly drawed put back to from training sample for utilizing;

Feature samples collection generation module, for extracting feature composition characteristic sample set at random for described test sample book collection;

Decision tree generation module, for concentrating based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, carry out division from level to level till can not dividing to feature samples collection, the fragmentation criterion of each split vertexes is finally combined to form decision tree again;

Detected rule generation module, call test sample book collection generation module for repeating, feature samples collection generation module and decision tree generation module obtain the decision tree of requirement;

Determination module, for judging data to be tested based on each decision tree, carries out comprehensively, providing final testing result to all result of determination.

Computation attribute concentrates various fragmentation criterion at the information gain Gain of this split vertexes;

；

To sum up, the present invention provides a kind of malicious code detecting method based on information gain and system, its technological thought is, by collecting malice sample and non-malicious sample, and carry out the mark of whether malice, therefrom randomly draw the sample composition test sample book collection of predetermined number, the random feature composition characteristic sample set extracting test sample book collection; Division is layer by layer carried out till can not dividing to described feature samples collection again, wherein the fragmentation criterion of each split vertexes maximizes according to information gain and selects, multiple decision tree is formed after repeatedly performing aforesaid operations, finally utilize these decision trees to continue to detect to data to be tested, provide final testing result.

Beneficial effect is: it is random for choosing for the sample of test sample book collection due to technical scheme of the present invention, extraction for feature is also random, therefore the learning efficiency of technical scheme of the present invention is high, and effectively can reduce the overfitting problem of characteristic, improve recall rate and accuracy.

Accompanying drawing explanation

In order to be illustrated more clearly in technical scheme of the present invention, be briefly described to the accompanying drawing used required in embodiment below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is a kind of malicious code detecting method embodiment process flow diagram based on information gain provided by the invention;

Fig. 2 is a kind of malicious code detection system example structure figure based on information gain provided by the invention.

Embodiment

The present invention gives a kind of malicious code detecting method based on information gain and system embodiment, technical scheme in the embodiment of the present invention is understood better in order to make those skilled in the art person, and enable above-mentioned purpose of the present invention, feature and advantage become apparent more, below in conjunction with accompanying drawing, technical scheme in the present invention is described in further detail:

The present invention provide firstly a kind of malicious code detecting method embodiment based on information gain, as shown in Figure 1, comprising:

S101, collection malice sample and non-malicious sample composition training sample set; Wherein, each sample being carried out to the mark of whether malice, can be that N or Y, N represent non-malicious sample, and Y represents maliciously sample;

S102, choose various fragmentation criterion composition property set; Wherein, fragmentation criterion includes but not limited to, carries out the criterion that may use when malicious code judges;

S103, utilize to be concentrated from training sample by the form randomly drawed put back to and extract the sample of predetermined number and form test sample book collection; Wherein, described predetermined number can set as required;

S104, extract feature composition characteristic sample set at random for described test sample book collection;

Wherein, by the method that twice " at random " extracts, effectively can solve detection method and concentrate performance better at training sample, and for showing more weak problem in real testing process, improve the learning ability of self, improve the Detection capability for unknown malicious code;

S105, concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, carry out division from level to level till can not dividing to feature samples collection, the fragmentation criterion of each split vertexes is finally combined to form decision tree again;

Wherein, based on the maximized standard of information gain for each split vertexes chooses fragmentation criterion, the decision tree finally set up or splitted construction, continuous or discrete numerical value can be processed and easily to calculate and efficiency is higher, effectively can reduce noise, and effectively solve the low problem of accuracy of detection;

S106, judge whether the quantity of decision tree reaches requirement, if so, then performs S107, otherwise perform S103;

S107, based on each decision tree, data to be tested to be judged, carry out comprehensively, providing final testing result to all result of determination.

Wherein, when single decision tree detects data to be tested, performance is relatively weak, and may produce over-fitting problem, the judgement accuracy for unknown malicious code is not high, is judged, effectively can improve the accuracy of detection by the combination of multiple decision tree.Describedly carry out comprehensively, providing final testing result and can be, but not limited to all result of determination, choose the higher result of determination of wherein accounting as final testing result.

Preferably, describedly concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, comprising:

；

Preferably, described concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes after, also comprise, by this fragmentation criterion dependency concentrate delete.

Present invention also offers a kind of malicious code detection system embodiment based on information gain, as shown in Figure 2, comprising:

Training sample set generation module 201, for collecting malice sample and non-malicious sample composition training sample set;

Property set generation module 202, for choosing various fragmentation criterion composition property set;

Test sample book collection generation module 203, concentrates the sample that extract predetermined number to form test sample book collection by the form randomly drawed put back to from training sample for utilizing;

Feature samples collection generation module 204, for extracting feature composition characteristic sample set at random for described test sample book collection;

Decision tree generation module 205, for concentrating based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, carry out division from level to level till can not dividing to feature samples collection, the fragmentation criterion of each split vertexes is finally combined to form decision tree again;

Detected rule generation module 206, call test sample book collection generation module for repeating, feature samples collection generation module and decision tree generation module obtain the decision tree of requirement;

Determination module 207, for judging data to be tested based on each decision tree, carries out comprehensively, providing final testing result to all result of determination.

；

As mentioned above, the present invention is divided layer by layer by the sample concentrated feature samples, till most Zhongdao can not be divided, final division terminal is the label of each sample whether malice, the selection of the fragmentation criterion of each split vertexes is followed the maximized standard of information gain and is selected, multiple decision trees that final utilization is formed judge data to be tested, comprehensively provide final testing result to multiple result of determination.

To sum up, by constantly making information gain maximize the fragmentation criterion choosing each split vertexes, thus reach best splitting effect, and then improve the judgement accuracy to data to be tested; By random selecting sample composition test sample book collection, and by the random form composition characteristic sample set extracting feature, the present invention is improved greatly for the recall rate of unknown code, and effectively solves over-fitting problem, improve the detected representation of each decision tree for data to be tested.

Above embodiment is unrestricted technical scheme of the present invention in order to explanation.Do not depart from any modification or partial replacement of spirit and scope of the invention, all should be encompassed in the middle of right of the present invention.

Claims

1. based on a malicious code detecting method for information gain, it is characterized in that, comprising:

Step 2, choose various fragmentation criterion composition property set;

Step 6, repetition step 3 to step 5 obtain the decision tree of requirement;

2. the method for claim 1, is characterized in that, describedly concentrates based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, comprising:

；

3. the method for claim 1, is characterized in that, described concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes after, also comprise, by this fragmentation criterion dependency concentrate delete.

4. based on a malicious code detection system for information gain, it is characterized in that, comprising:

5. system as claimed in claim 4, is characterized in that, describedly concentrates based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, comprising:

；

6. system as claimed in claim 4, is characterized in that, described concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes after, also comprise, this fragmentation criterion dependency concentrated and deletes.