CN105488413A - Malicious code detection method and system based on information gain - Google Patents

Malicious code detection method and system based on information gain Download PDF

Info

Publication number
CN105488413A
CN105488413A CN201510344523.2A CN201510344523A CN105488413A CN 105488413 A CN105488413 A CN 105488413A CN 201510344523 A CN201510344523 A CN 201510344523A CN 105488413 A CN105488413 A CN 105488413A
Authority
CN
China
Prior art keywords
sample
information gain
fragmentation criterion
criterion
split vertexes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510344523.2A
Other languages
Chinese (zh)
Inventor
常安琪
李柏松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Antiy Technology Co Ltd
Original Assignee
Harbin Antiy Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Antiy Technology Co Ltd filed Critical Harbin Antiy Technology Co Ltd
Priority to CN201510344523.2A priority Critical patent/CN105488413A/en
Publication of CN105488413A publication Critical patent/CN105488413A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Abstract

The invention discloses a malicious code detection method based on information gain. The method comprises the following steps of collecting samples to form a training sample set; selecting splitting criteria to form an attribute set; intensively extracting samples from the training sample set in a returning random extraction form to form a test sample set; randomly extracting features by aiming at the test sample set to form a feature sample set; selecting the splitting criterion of each splitting node from the attribute set based on the information gain maximization criterion; performing layer-by-layer splitting on the feature sample set until the splitting cannot continue; finally combining the splitting criteria of all of the splitting nodes to form a decision-making tree; repeating the processes to obtain the required quantity of decision-making trees; giving judging results for data to be detected by each decision-making tree; and giving the final detection result by integrating all judging results. The invention also discloses a malicious code detection method based on information gain. By using the technical scheme, unknown malicious code can be effectively recognized; and the detection efficiency can be improved.

Description

A kind of malicious code detecting method based on information gain and system
Technical field
The present invention relates to field of information security technology, particularly relate to a kind of malicious code detecting method based on information gain and system.
Background technology
At present, although the detection technique of malicious code is at development, the detection technique of malicious code and ability still lag behind the development of malicious code, particularly propose huge challenge to the detectability of unknown malicious code to Malicious Code Detection technology.Detection technique conventional at present comprises: the mode-matching technique of feature based code and the detection technique based on malicious code rule of conduct, wherein the mode-matching technique of feature based code is mated with the malicious code feature string in property data base by the feature code of detected file, when the match is successful, interval scale is detected containing malicious code in file, otherwise thinks not containing malicious code.The shortcoming of this technology needs the artificial very first time to find to obtain malicious code sample and extract condition code to add malicious code feature database; And based on the detection technique of malicious code rule of conduct, be carry out detection of malicious code according to the common rule of conduct of expert's predefined malice code.There is hysteresis quality in this method, particularly along with the significantly lifting of computer run speed, when malicious code behavior by the time being detected, often brings irreparable damage to system.Above-mentioned two kinds of detection techniques are all one detection techniques afterwards, known malicious code can only be detected, or just can be detected after malicious code is performed, but malicious code causes destruction during this period.Also have some malicious code detecting methods higher to computing machine expense, often drag the normal operation of slow system.
Summary of the invention
Technical solutions according to the invention pass through the sample composition training sample set collecting malice or non-malicious, randomly draw the sample of some at every turn, and then for training division after random extraction feature.Selection for the fragmentation criterion of each split vertexes needs to follow the maximized standard of information gain, so repeatedly implements the multiple decision tree of final formation; Utilize these decision trees to judge whether maliciously data to be tested.Technical scheme of the present invention, by the mode of Fast Training sample, improves the recall rate for unknown malicious code, makes testing result more accurate.
The present invention adopts and realizes with the following method: a kind of malicious code detecting method based on information gain, comprising:
Step 1, collection malice sample and non-malicious sample composition training sample set;
Step 2, choose various fragmentation criterion composition property set;
Step 3, utilize to be concentrated from training sample by the form randomly drawed put back to and extract the sample of predetermined number and form test sample book collection;
Step 4, extract feature composition characteristic sample set at random for described test sample book collection;
Step 5, concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, carry out division from level to level till can not dividing to feature samples collection, the fragmentation criterion of each split vertexes is finally combined to form decision tree again;
Step 6, repetition step 3 to step 5 obtain the decision tree of requirement;
Step 7, based on each decision tree, data to be tested to be judged, carry out comprehensively, providing final testing result to all result of determination.
Further, describedly concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, comprising:
Computation attribute concentrates various fragmentation criterion at the information gain Gain of this split vertexes:
;
Wherein, D is the total sample number amount entering current split vertexes, and K is the division kind obtained after the fragmentation criterion chosen, for the sample size of each division kind, m=2, representative the probability of malice sample is belonged in individual sample, representative the probability of non-malicious sample is belonged in individual sample;
Choose in property set the fragmentation criterion of fragmentation criterion as this split vertexes making information gain Gain obtain maximal value.
Further, described concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes after, also comprise, by this fragmentation criterion dependency concentrate delete.
The present invention can adopt following system to realize: a kind of malicious code detection system based on information gain, comprising:
Training sample set generation module, for collecting malice sample and non-malicious sample composition training sample set;
Property set generation module, for choosing various fragmentation criterion composition property set;
Test sample book collection generation module, concentrates the sample that extract predetermined number to form test sample book collection by the form randomly drawed put back to from training sample for utilizing;
Feature samples collection generation module, for extracting feature composition characteristic sample set at random for described test sample book collection;
Decision tree generation module, for concentrating based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, carry out division from level to level till can not dividing to feature samples collection, the fragmentation criterion of each split vertexes is finally combined to form decision tree again;
Detected rule generation module, call test sample book collection generation module for repeating, feature samples collection generation module and decision tree generation module obtain the decision tree of requirement;
Determination module, for judging data to be tested based on each decision tree, carries out comprehensively, providing final testing result to all result of determination.
Further, describedly concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, comprising:
Computation attribute concentrates various fragmentation criterion at the information gain Gain of this split vertexes;
Wherein, D is the total sample number amount entering current split vertexes, and K is the division kind obtained after the fragmentation criterion chosen, for the sample size of each division kind, m=2, representative the probability of malice sample is belonged in individual sample, representative the probability of non-malicious sample is belonged in individual sample;
Choose in property set the fragmentation criterion of fragmentation criterion as this split vertexes making information gain Gain obtain maximal value.
Further, described concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes after, also comprise, by this fragmentation criterion dependency concentrate delete.
To sum up, the present invention provides a kind of malicious code detecting method based on information gain and system, its technological thought is, by collecting malice sample and non-malicious sample, and carry out the mark of whether malice, therefrom randomly draw the sample composition test sample book collection of predetermined number, the random feature composition characteristic sample set extracting test sample book collection; Division is layer by layer carried out till can not dividing to described feature samples collection again, wherein the fragmentation criterion of each split vertexes maximizes according to information gain and selects, multiple decision tree is formed after repeatedly performing aforesaid operations, finally utilize these decision trees to continue to detect to data to be tested, provide final testing result.
Beneficial effect is: it is random for choosing for the sample of test sample book collection due to technical scheme of the present invention, extraction for feature is also random, therefore the learning efficiency of technical scheme of the present invention is high, and effectively can reduce the overfitting problem of characteristic, improve recall rate and accuracy.
Accompanying drawing explanation
In order to be illustrated more clearly in technical scheme of the present invention, be briefly described to the accompanying drawing used required in embodiment below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of malicious code detecting method embodiment process flow diagram based on information gain provided by the invention;
Fig. 2 is a kind of malicious code detection system example structure figure based on information gain provided by the invention.
Embodiment
The present invention gives a kind of malicious code detecting method based on information gain and system embodiment, technical scheme in the embodiment of the present invention is understood better in order to make those skilled in the art person, and enable above-mentioned purpose of the present invention, feature and advantage become apparent more, below in conjunction with accompanying drawing, technical scheme in the present invention is described in further detail:
The present invention provide firstly a kind of malicious code detecting method embodiment based on information gain, as shown in Figure 1, comprising:
S101, collection malice sample and non-malicious sample composition training sample set; Wherein, each sample being carried out to the mark of whether malice, can be that N or Y, N represent non-malicious sample, and Y represents maliciously sample;
S102, choose various fragmentation criterion composition property set; Wherein, fragmentation criterion includes but not limited to, carries out the criterion that may use when malicious code judges;
S103, utilize to be concentrated from training sample by the form randomly drawed put back to and extract the sample of predetermined number and form test sample book collection; Wherein, described predetermined number can set as required;
S104, extract feature composition characteristic sample set at random for described test sample book collection;
Wherein, by the method that twice " at random " extracts, effectively can solve detection method and concentrate performance better at training sample, and for showing more weak problem in real testing process, improve the learning ability of self, improve the Detection capability for unknown malicious code;
S105, concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, carry out division from level to level till can not dividing to feature samples collection, the fragmentation criterion of each split vertexes is finally combined to form decision tree again;
Wherein, based on the maximized standard of information gain for each split vertexes chooses fragmentation criterion, the decision tree finally set up or splitted construction, continuous or discrete numerical value can be processed and easily to calculate and efficiency is higher, effectively can reduce noise, and effectively solve the low problem of accuracy of detection;
S106, judge whether the quantity of decision tree reaches requirement, if so, then performs S107, otherwise perform S103;
S107, based on each decision tree, data to be tested to be judged, carry out comprehensively, providing final testing result to all result of determination.
Wherein, when single decision tree detects data to be tested, performance is relatively weak, and may produce over-fitting problem, the judgement accuracy for unknown malicious code is not high, is judged, effectively can improve the accuracy of detection by the combination of multiple decision tree.Describedly carry out comprehensively, providing final testing result and can be, but not limited to all result of determination, choose the higher result of determination of wherein accounting as final testing result.
Preferably, describedly concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, comprising:
Computation attribute concentrates various fragmentation criterion at the information gain Gain of this split vertexes:
Wherein, D is the total sample number amount entering current split vertexes, and K is the division kind obtained after the fragmentation criterion chosen, for the sample size of each division kind, m=2, representative the probability of malice sample is belonged in individual sample, representative the probability of non-malicious sample is belonged in individual sample;
Choose in property set the fragmentation criterion of fragmentation criterion as this split vertexes making information gain Gain obtain maximal value.
Preferably, described concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes after, also comprise, by this fragmentation criterion dependency concentrate delete.
Present invention also offers a kind of malicious code detection system embodiment based on information gain, as shown in Figure 2, comprising:
Training sample set generation module 201, for collecting malice sample and non-malicious sample composition training sample set;
Property set generation module 202, for choosing various fragmentation criterion composition property set;
Test sample book collection generation module 203, concentrates the sample that extract predetermined number to form test sample book collection by the form randomly drawed put back to from training sample for utilizing;
Feature samples collection generation module 204, for extracting feature composition characteristic sample set at random for described test sample book collection;
Decision tree generation module 205, for concentrating based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, carry out division from level to level till can not dividing to feature samples collection, the fragmentation criterion of each split vertexes is finally combined to form decision tree again;
Detected rule generation module 206, call test sample book collection generation module for repeating, feature samples collection generation module and decision tree generation module obtain the decision tree of requirement;
Determination module 207, for judging data to be tested based on each decision tree, carries out comprehensively, providing final testing result to all result of determination.
Preferably, describedly concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, comprising:
Computation attribute concentrates various fragmentation criterion at the information gain Gain of this split vertexes;
Wherein, D is the total sample number amount entering current split vertexes, and K is the division kind obtained after the fragmentation criterion chosen, for the sample size of each division kind, m=2, representative the probability of malice sample is belonged in individual sample, representative the probability of non-malicious sample is belonged in individual sample;
Choose in property set the fragmentation criterion of fragmentation criterion as this split vertexes making information gain Gain obtain maximal value.
Preferably, described concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes after, also comprise, by this fragmentation criterion dependency concentrate delete.
As mentioned above, the present invention is divided layer by layer by the sample concentrated feature samples, till most Zhongdao can not be divided, final division terminal is the label of each sample whether malice, the selection of the fragmentation criterion of each split vertexes is followed the maximized standard of information gain and is selected, multiple decision trees that final utilization is formed judge data to be tested, comprehensively provide final testing result to multiple result of determination.
To sum up, by constantly making information gain maximize the fragmentation criterion choosing each split vertexes, thus reach best splitting effect, and then improve the judgement accuracy to data to be tested; By random selecting sample composition test sample book collection, and by the random form composition characteristic sample set extracting feature, the present invention is improved greatly for the recall rate of unknown code, and effectively solves over-fitting problem, improve the detected representation of each decision tree for data to be tested.
Above embodiment is unrestricted technical scheme of the present invention in order to explanation.Do not depart from any modification or partial replacement of spirit and scope of the invention, all should be encompassed in the middle of right of the present invention.

Claims (6)

1. based on a malicious code detecting method for information gain, it is characterized in that, comprising:
Step 1, collection malice sample and non-malicious sample composition training sample set;
Step 2, choose various fragmentation criterion composition property set;
Step 3, utilize to be concentrated from training sample by the form randomly drawed put back to and extract the sample of predetermined number and form test sample book collection;
Step 4, extract feature composition characteristic sample set at random for described test sample book collection;
Step 5, concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, carry out division from level to level till can not dividing to feature samples collection, the fragmentation criterion of each split vertexes is finally combined to form decision tree again;
Step 6, repetition step 3 to step 5 obtain the decision tree of requirement;
Step 7, based on each decision tree, data to be tested to be judged, carry out comprehensively, providing final testing result to all result of determination.
2. the method for claim 1, is characterized in that, describedly concentrates based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, comprising:
Computation attribute concentrates various fragmentation criterion at the information gain Gain of this split vertexes:
Wherein, D is the total sample number amount entering current split vertexes, and K is the division kind obtained after the fragmentation criterion chosen, for the sample size of each division kind, m=2, representative the probability of malice sample is belonged in individual sample, representative the probability of non-malicious sample is belonged in individual sample;
Choose in property set the fragmentation criterion of fragmentation criterion as this split vertexes making information gain Gain obtain maximal value.
3. the method for claim 1, is characterized in that, described concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes after, also comprise, by this fragmentation criterion dependency concentrate delete.
4. based on a malicious code detection system for information gain, it is characterized in that, comprising:
Training sample set generation module, for collecting malice sample and non-malicious sample composition training sample set;
Property set generation module, for choosing various fragmentation criterion composition property set;
Test sample book collection generation module, concentrates the sample that extract predetermined number to form test sample book collection by the form randomly drawed put back to from training sample for utilizing;
Feature samples collection generation module, for extracting feature composition characteristic sample set at random for described test sample book collection;
Decision tree generation module, for concentrating based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, carry out division from level to level till can not dividing to feature samples collection, the fragmentation criterion of each split vertexes is finally combined to form decision tree again;
Detected rule generation module, call test sample book collection generation module for repeating, feature samples collection generation module and decision tree generation module obtain the decision tree of requirement;
Determination module, for judging data to be tested based on each decision tree, carries out comprehensively, providing final testing result to all result of determination.
5. system as claimed in claim 4, is characterized in that, describedly concentrates based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes, comprising:
Computation attribute concentrates various fragmentation criterion at the information gain Gain of this split vertexes;
Wherein, D is the total sample number amount entering current split vertexes, and K is the division kind obtained after the fragmentation criterion chosen, for the sample size of each division kind, m=2, representative the probability of malice sample is belonged in individual sample, representative the probability of non-malicious sample is belonged in individual sample;
Choose in property set the fragmentation criterion of fragmentation criterion as this split vertexes making information gain Gain obtain maximal value.
6. system as claimed in claim 4, is characterized in that, described concentrate based on information gain maximized standard dependency the fragmentation criterion choosing each split vertexes after, also comprise, this fragmentation criterion dependency concentrated and deletes.
CN201510344523.2A 2015-06-19 2015-06-19 Malicious code detection method and system based on information gain Pending CN105488413A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510344523.2A CN105488413A (en) 2015-06-19 2015-06-19 Malicious code detection method and system based on information gain

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510344523.2A CN105488413A (en) 2015-06-19 2015-06-19 Malicious code detection method and system based on information gain

Publications (1)

Publication Number Publication Date
CN105488413A true CN105488413A (en) 2016-04-13

Family

ID=55675387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510344523.2A Pending CN105488413A (en) 2015-06-19 2015-06-19 Malicious code detection method and system based on information gain

Country Status (1)

Country Link
CN (1) CN105488413A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106960154A (en) * 2017-03-30 2017-07-18 兴华永恒(北京)科技有限责任公司 A kind of rogue program dynamic identifying method based on decision-tree model
CN110336835A (en) * 2019-08-05 2019-10-15 深信服科技股份有限公司 Detection method, user equipment, storage medium and the device of malicious act
WO2020168718A1 (en) * 2019-02-20 2020-08-27 深圳大学 Classifier robustness testing method, apparatus, terminal and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530546A (en) * 2013-10-25 2014-01-22 东北大学 Identity authentication method based on mouse behaviors of user
CN104537010A (en) * 2014-12-17 2015-04-22 温州大学 Component classifying method based on net establishing software of decision tree

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530546A (en) * 2013-10-25 2014-01-22 东北大学 Identity authentication method based on mouse behaviors of user
CN104537010A (en) * 2014-12-17 2015-04-22 温州大学 Component classifying method based on net establishing software of decision tree

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张涛等: "基于模糊模式与决策树融合的脚本病毒检测算法", 《电子与信息学报》 *
张福勇等: "基于C4.5决策树的嵌入型恶意代码检测方法", 《华南理工大学学报(自然科学版)》 *
朱立军等: "C4.5算法在未知恶意代码识别中的应用", 《沈阳化工大学学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106960154A (en) * 2017-03-30 2017-07-18 兴华永恒(北京)科技有限责任公司 A kind of rogue program dynamic identifying method based on decision-tree model
WO2020168718A1 (en) * 2019-02-20 2020-08-27 深圳大学 Classifier robustness testing method, apparatus, terminal and storage medium
CN110336835A (en) * 2019-08-05 2019-10-15 深信服科技股份有限公司 Detection method, user equipment, storage medium and the device of malicious act
CN110336835B (en) * 2019-08-05 2021-10-19 深信服科技股份有限公司 Malicious behavior detection method, user equipment, storage medium and device

Similar Documents

Publication Publication Date Title
CN108777873B (en) Wireless sensor network abnormal data detection method based on weighted mixed isolated forest
Li et al. Deep joint discriminative learning for vehicle re-identification and retrieval
CN106776842B (en) Multimedia data detection method and device
CN105550583B (en) Android platform malicious application detection method based on random forest classification method
CN105095238B (en) For detecting the decision tree generation method of fraudulent trading
CN103412888B (en) A kind of point of interest recognition methods and device
CN107577945A (en) URL attack detection methods, device and electronic equipment
CN108446700A (en) A kind of car plate attack generation method based on to attack resistance
CN104933420B (en) A kind of scene image recognition methods and scene image identify equipment
CN106817248A (en) A kind of APT attack detection methods
CN104331436A (en) Rapid classification method of malicious codes based on family genetic codes
CN102571486A (en) Traffic identification method based on bag of word (BOW) model and statistic features
CN109117634A (en) Malware detection method and system based on network flow multi-view integration
CN104700033A (en) Virus detection method and virus detection device
CN111832615A (en) Sample expansion method and system based on foreground and background feature fusion
CN106295502A (en) A kind of method for detecting human face and device
CN112597495B (en) Malicious code detection method, system, equipment and storage medium
CN105389480A (en) Multiclass unbalanced genomics data iterative integrated feature selection method and system
CN103886030B (en) Cost-sensitive decision-making tree based physical information fusion system data classification method
CN104462979A (en) Automatic dynamic detection method and device of application program
CN105893876A (en) Chip hardware Trojan horse detection method and system
CN104463201A (en) Method and device for recognizing driving state and driver
CN107392021A (en) A kind of Android malicious application detection methods based on multiclass feature
CN107274679A (en) Vehicle identification method, device, equipment and computer-readable recording medium
CN106909972A (en) A kind of learning method of sensing data calibrating patterns

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160413