CN108345794A

CN108345794A - The detection method and device of Malware

Info

Publication number: CN108345794A
Application number: CN201711477108.XA
Authority: CN
Inventors: 薛菲; 李俊韬; 苏庆华; 袁瑞萍; 沙宗轩; 阳樊; 汪婷婷; 王慧玲
Original assignee: Beijing Wuzi University
Current assignee: Beijing Wuzi University
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2018-07-31

Abstract

An embodiment of the present invention provides a kind of detection method and device of Malware, belong to information security field.The method includes：From the software sample set of known software type, the static nature and behavioral characteristics of each software sample are extracted, the static nature of each software sample of extraction and behavioral characteristics are effectively combined, forms composite character data set；According to the selection method of principal component analytical method and feature weight, characteristic dimension is reduced, removes redundancy feature, the composite character data set after being optimized；The feature concentrated to the composite character after optimization with supporting vector machine model is trained, and forms classification and Detection model；Inspection software is treated according to classification and Detection model to be detected.Classification and Detection model is formed with supporting vector machine model, not only increases the efficiency of classification, and improve the accuracy of software detection.

Description

The detection method and device of Malware

Technical field

The present invention relates to a kind of detection method of Malware, belongs to information security field more particularly to one kind is based on The detection method and device of the Malware of Android (Android) system.

Background technology

With intelligent terminal be widely current and high speed development, security issues become increasingly urgent for intelligent terminal, a large amount of base It is also following in the Malware of intelligent terminal system.6 years before 2010, system vulnerability in the operating system of Saipan with Malicious code is the major security threat of intelligent mobile phone terminal, and from after 2010, smart mobile phone market is gradually ripe and universal Get up, Apple Inc. had issued iOS4 (Mobile operating system of Apple Inc.'s publication) before this, subsequent Google The version of Android2.1,2.2 and 2.3 is also issued successively.Early in the end of the year 2009, iOS platforms have found that its first worm-type virus Ikee, the worm are mainly used for the iOS device after attack is escaped from prison；And for Android platform, also occur striking at the beginning of 2010 It is the first malice in generally acknowledged Android platform to cheat software and ad plug-in, Fake Player (pseudo-operation person) malicious code Software, main malicious act are to send short message of deducting fees.Second wooden horse " giving you rice " occurred in subsequent android system Family, in less than two months, the rapid mutation of version up to ten is several, nearly million mobile phones is infected, as first state The Android malice wooden horses of production use the remote control attack technology of mainstream while also using a variety of countermeasure techniques, in addition to meeting Column cuts, deletes outside user's short message and back information, can also in the case that do not pass through user allow installation other application software or Person makes a phone call.Then, the root loopholes of android system and the evils such as cross-platform mobile phone Internetbank wooden horse Zitmo may be implemented Meaning software also gradually discloses.

In realizing process of the present invention, inventor has found that at least there are the following problems in the prior art：First, it is selected in feature Aspect is selected, is chosen just for static nature or behavioral characteristics, is unable to thoroughly evaluating software action；Second, characteristic quantity compared with When more, carry out the method unification of characteristic optimization for primitive character, characteristic set pair after optimization distinguish Malware with Normal software has little significance；Third, in terms of malware detection classification, using traditional disaggregated model to Malware Not only precision is not high for detection, but also also to be improved in terms of efficiency.

Invention content

An embodiment of the present invention provides a kind of detection method and device of Malware, not only increase malware detection Accuracy rate, and improve the efficiency of software detection.

On the one hand, an embodiment of the present invention provides a kind of detection method of Malware, the method includes：

From the software sample set of known software type, the static nature and behavioral characteristics of each software sample are extracted, The static nature of each software sample of extraction and behavioral characteristics are effectively combined, composite character data set is formed；

Characteristic dimension is reduced to the feature evaluation in composite character data set according to optimization method, removes redundancy feature, Composite character data set after being optimized；

The feature concentrated to the composite character after optimization with supporting vector machine model is trained, and forms classification and Detection mould Type；

Inspection software is treated according to classification and Detection model to be detected.

On the other hand, an embodiment of the present invention provides a kind of detection device of Malware, described device includes：

Extraction unit, the static state for from the software sample set of known software type, extracting each software sample are special It seeks peace behavioral characteristics, the static nature of each software sample of extraction and behavioral characteristics is effectively combined, form composite character number According to collection；

Optimize unit, for reducing characteristic dimension to the feature evaluation in composite character data set according to optimization method, Remove redundancy feature, the composite character data set after being optimized；

Training unit, the feature concentrated to the composite character after optimization for the model with support vector machines are instructed Practice, forms classification and Detection model；

Detection unit is detected for treating inspection software according to classification and Detection model.

Above-mentioned technical proposal has the advantages that：Because using the software sample set from known software type In, the static nature of each software sample and behavioral characteristics are effectively combined, to generate the skill of composite character data set Art means, so having reached the technique effect of comprehensive acquisition software sample characteristics；Because using optimization method, to composite character Feature evaluation in data set selects the skill of the composite character data set after contributing larger feature to be optimized software classification Art means, so having reached the technique effect for the redundancy feature that removal initial characteristic data is concentrated；Because using supporting vector Machine model is trained the composite character data set after optimization, generates classification and Detection model, then carry out to software to be detected The technological means of detection shortens the model training time so having reached, improves the technique effect of the accuracy rate of disaggregated model.

Description of the drawings

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Obtain other attached drawings according to these attached drawings.

Fig. 1 is the flow chart of the detection method of Malware of the embodiment of the present invention；

Fig. 2 is the structural schematic diagram of the detection device of Malware of the embodiment of the present invention；

Fig. 3 is the sub-process figure of optimization composite character data set of the embodiment of the present invention；

Fig. 4 is the sub-process figure that the embodiment of the present invention forms classification and Detection model；

Fig. 5 is the sub-process figure that the embodiment of the present invention treats that inspection software is detected according to classification and Detection model；

Fig. 6 is the structural schematic diagram of optimization unit of the embodiment of the present invention；

Fig. 7 is the structural schematic diagram of training unit of the embodiment of the present invention；

Fig. 8 is the schematic diagram of static nature vectorization of the embodiment of the present invention；

Fig. 9 is classification accuracy tendency chart；

Figure 10 is that technical solution using the present invention compares the testing result of identical software to be detected with the prior art Figure.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Referring to FIG. 1, Fig. 1 is the flow chart of the detection method of Malware of the embodiment of the present invention, the method includes：

101, from the software sample set of known software type, static nature and the dynamic for extracting each software sample are special Sign, the static nature of each software sample of extraction and behavioral characteristics are effectively combined, and form composite character data set；

102, characteristic dimension is reduced to the feature evaluation in composite character data set according to optimization method, removal redundancy is special Sign, the composite character data set after being optimized；

103, the feature concentrated to the composite character after optimization with supporting vector machine model is trained, and forms classification inspection Survey model；

104, inspection software is treated according to classification and Detection model to be detected.

Preferably, referring to FIG. 3, Fig. 3 is the sub-process figure of optimization composite character data set of the embodiment of the present invention；Described Characteristic dimension is reduced to the feature evaluation in composite character data set according to optimization method, redundancy feature is removed, after obtaining optimization Composite character data set, specifically include：

102.1, according to the method for principal component analysis to the Feature Dimension Reduction in composite character data set, the spy after dimensionality reduction is obtained Levy data set；

102.2, it is deleted special with feature weight selection algorithm according to the feature in the composite character data set after dimensionality reduction Levy the feature that weighted value is less than given threshold, the composite character data set after being optimized.

Preferably, referring to FIG. 4, Fig. 4 is the sub-process figure that the embodiment of the present invention forms classification and Detection model, the utilization The feature that supporting vector machine model concentrates the composite character after optimization is trained, and is formed classification and Detection model, is specifically included：

103.1, the static nature vectorization in the composite character data set after optimization is handled；

103.2, by the behavioral characteristics standardization in the composite character data set after optimization, by the behavioral characteristics Value is mapped to [0,1] section；

103.3, by the static nature of warp-wise quantification treatment and the behavioral characteristics of normalized processing formed composite character to Amount file simultaneously preserves；

103.4, composite character vector file is trained with supporting vector machine model, generates disaggregated model file.

It is further preferred that the static nature vectorization processing in the composite character data set by after optimization, specifically Including：

The static nature that composite character after extraction optimization is concentrated, establishes characteristic set T；

The static nature of each known software sample is compared one by one with characteristic set T；

If being matched to identical static nature inside T, it is labeled as from the static nature of each known software to described 1；

Otherwise, described in giving 0 is labeled as from the static nature in each known software；

Behavioral characteristics standardization in the composite character data set by after optimization, by the value of the behavioral characteristics It is mapped to [0,1] section, is specifically included：

The behavioral characteristics that composite character after extraction optimization is concentrated, are set as m_i；

According to formulaThe behavioral characteristics are mapped in [0,1] section, min (m_i) indicate dynamic State feature m_iMinimum value, max (m_i) indicate behavioral characteristics m_iMaximum value.

Preferably, referring to FIG. 5, Fig. 5 is the embodiment of the present invention treats inspection software according to classification and Detection model and examined The sub-process figure of survey, it is described inspection software is treated according to classification and Detection model to be detected, it specifically includes：

104.1, the static nature for extracting software to be detected and behavioral characteristics are effectively combined, generates composite character data Collection；

104.2, the composite character data set is converted to format as defined in classification and Detection model to store, and inputs classification Detection model；

104.3, classification and Detection model exports the type of software to be detected.

Referring to FIG. 2, Fig. 2 is the structural schematic diagram of the detection device of Malware of the embodiment of the present invention, described device packet It includes：

Extraction unit 21, for from the software sample set of known software type, extracting the static state of each software sample Feature and behavioral characteristics effectively combine the static nature of each software sample of extraction and behavioral characteristics, form composite character Data set；

Optimize unit 22, for reducing feature dimensions to the feature evaluation in composite character data set according to optimization method Degree removes redundancy feature, the composite character data set after being optimized；

Training unit 23, the feature for being concentrated to the composite character after optimization with supporting vector machine model are instructed Practice, forms classification and Detection model；

Detection unit 24 is detected for treating inspection software according to classification and Detection model.

Preferably, referring to FIG. 6, Fig. 6 is the structural schematic diagram of optimization unit of the embodiment of the present invention, the optimization unit 22, it specifically includes：

Principal component analysis module 221, for being dropped to the feature in composite character data set according to the method for principal component analysis Dimension, obtains the characteristic data set after dimensionality reduction；

Feature selection module 222, for according to the feature in the composite character data set after dimensionality reduction, being selected with feature weight Algorithm is selected, the feature that feature weight value is less than given threshold, the composite character data set after being optimized are deleted.

Preferably, referring to FIG. 8, Fig. 8 is the schematic diagram of static nature vectorization of the embodiment of the present invention, the training unit 23, it specifically includes：

Static nature preprocessing module 231, for the static nature vectorization in the composite character data set after optimizing Processing；

Behavioral characteristics preprocessing module 232, for the behavioral characteristics standardization in the composite character data set after optimizing Processing, [0,1] section is mapped to by the value of the behavioral characteristics；

Preserving module 233, for forming the behavioral characteristics of the static nature of warp-wise quantification treatment and normalized processing Composite character vector file simultaneously preserves；

Classification based training module 234, it is raw for being trained to the vector file of composite character with supporting vector machine model Constituent class model file.

It is further preferred that the static nature preprocessing module, specifically includes：

Characteristic set submodule is established, the static nature concentrated for extracting the composite character after optimizing establishes feature set Close T；

Submodule is compared, for comparing the static nature of each software sample and characteristic set T one by one；

Submodule is marked, if for being matched to identical static nature inside T, to the quiet of each software sample State signature is 1；

Submodule is marked, matching is less than identical static nature if being additionally operable to inside T, in each software sample Static nature be labeled as 0.

It is further preferred that the behavioral characteristics preprocessing module, specifically includes：

Value submodule, the behavioral characteristics concentrated for extracting the composite character after optimizing, is set as m_i；

Evaluation submodule, for according to formulaThe behavioral characteristics are mapped to [0,1] section It is interior, min (m_i) indicate behavioral characteristics m_iMinimum value, max (m_i) indicate behavioral characteristics m_iMaximum value.

Preferably, the detection unit, specifically includes：

Characteristic extracting module generates mixed for effectively combining the static nature and behavioral characteristics that extract software to be detected Close characteristic data set；

Input module is stored for the composite character data set to be converted to format as defined in classification and Detection model, and Input classification and Detection model；

Output module, the type for exporting software to be detected.

Above-mentioned technical proposal has the following technical effect that：By in the software sample set by known software type, extracting Each software sample static nature and behavioral characteristics effectively combine, formed composite character data set；Pass through principal component analysis Method and feature weight selection algorithm, selection contribute software classification larger feature, the composite character after being optimized Data set；It is special to the dynamic of composite character collection after optimization by the static nature vectorization processing of composite character collection after optimizing Standardization is levied, the vector file of composite character is input in supporting vector machine model, forms classification and Detection model；Comprehensively Each feature of software sample is acquired, and selects to contribute larger feature to carry out classification and Detection model training disaggregated model, With the classification and Detection model inspection software to be detected, the efficiency of detection is not only increased, and improve the standard of detection True property.

Intelligent terminal is convenient for carrying, and along with becoming stronger day by day for its operational capability, people have surpassed its degree of dependence Traditional functional mobile phone and PC equipment are crossed.The sharpest edges of intelligent terminal are, while providing basic communication functions, moreover it is possible to Enough meet the needs of user surfs the Internet whenever and wherever possible, realizes more intelligentized applications.Global technology research and consulting firm Gartner (Gao Dena) disclosed global operation system of smart phone terminal sale amount in 2013 in 2 months 2014.2013 complete The mobile phone terminal of ball sells total quantity and is up to 9.68 hundred million, and rised appreciably smart mobile phone in 42.3%, and 2013 than 2012 Annual total sales volume for the first time be more than feature phone annual total sales volume, account for the 53.6% of mobile phone total sales volume, android system The market share in operation system of smart phone market also increased 12 percentage points in 2013 than 2012, reached 78.4%.According to Market Research Corporation of America IDC, it is expected that the delivering amount of global smart mobile phone in 2014 will increase substantially.Phase Than in 10.1 hundred million of last year, shipment amount in 2014 will reach 12.5 hundred million, and growth rate is up to 23.8%, by the end of 2018 Year, this number is expected to reach 1,800,000,000.

Android system was issued in 2007 by the OHA (global alliance organization) under Google for the first time.Android systems The development of system is very swift and violent, constantly impacts the intelligent terminal market based on Nokia, Saipan system.The system is only very short Time, become the Liang great main forces system to run neck and neck with apple iOS system, occupation rate of market ranks first always.City The operation system of smart phone of field analysis mechanism (Strategy Analytics) publication is in the whole world in second and third season in 2014 Distribution situation is as shown in table 1, second and third in 2014 in season Android operation system world market share oneself up to 84.6%, 83.6%, and the systems proportion such as iOS and Windows Phone is glided.

Second and third of table 1 2014 years in season operation system of smart phone distribution on global situation

The security study of intelligent terminal is primarily present following 3 directions primarily directed in Android operation system.The A kind of direction be before Android device loading application software just to code in malicious act that may be present be detected. This detection method is divided into two methods of static analysis and dynamic analysis, mainly using oneself know malicious act in Malware or The harm that the features such as code may bring Malware is analyzed.Second of direction is when application program operates in When in Android device, the method that monitor code is inserted into critical applications interface is changed to the source generation of Android platform Code, the various actions of rogue program are monitored with this.The third direction is in enterprise security application, frequently with security isolation skill Art, the main area grade that application program is marked off with virtualization technology, stringent access control is realized with this.

The following detailed description of：102, feature dimensions are reduced to the feature evaluation in composite character data set according to optimization method Degree removes redundancy feature, the detailed process of the composite character data set after being optimized：

The embodiment of the present invention proposes a kind of fusion principal component analysis (PCA) and feature weight selection algorithm (Relief) Feature selecting algorithm Relief-PCA.The Relief-PCA algorithm synthesis advantage of two kinds of algorithms, theoretically, Relief-PCA is not But there is high efficiency, and dimensionality reduction can be carried out to feature, eliminate redundancy feature to improve classification accuracy.Android malice Software detection research is two classification problems, it is assumed that software sample set the S={ (x of known software type₁, y₁), (x₂, y₂) ..., (x_n, y_n), it is made of n sample, x_n∈Rⁿ, each sample has m feature, i.e. x_i=(x_i1, x_i2..., x_im)。 y_n∈ { -1,1 }, is marked x_nClassification, wherein 1 indicate normal software, -1 indicate Malware, then Relief-PCA algorithms have Body step is described as follows：

Input：Sample set s samples number e, screens threshold value σ, the intrinsic dimensionality r finally retained, after PCA is screened Intrinsic dimensionality t, t ＞ r.

Step1：Decentralization processing is carried out to sample set S by formula (1) first：

Step2：Calculate the covariance matrix AA of sample^T, and to AA^TEigenvalues Decomposition is carried out, maximum t feature is taken out It is worth corresponding feature vector F '=(f₁, f₂..., f_t), set of eigenvectors F ' is reduced into the feature set S after dimensionality reduction；

Step3：The weighted value of each sample in feature set S after the dimensionality reduction obtained in Step2 is set as 0, i.e. W (i)= 0, i=1,2 ..., t.

Step4：A sample R is randomly selected from sample set S, selection and R immediate one from the sample similar with R A neighbour, is denoted as H, and selection and the immediate neighbour of R, are denoted as M from the sample with R foreign peoples.

Step5：Each feature weight W is updated using weight equation (2)_d。

Wherein, diff (d, R, H) indicates the distance of sample R and sample H about feature d, is calculated using following formula (3)：

Value (d, R) indicates that the value of d-th of feature on sample R, the distance of sample R and sample H about feature d are exactly For feature d, the distance between two samples are calculated, to judge whether this feature is important feature.If diff (d, R, H) ＜ diff (d, R, M) show this feature in terms of the arest neighbors for distinguishing similar and foreign peoples is contributive, it is therefore desirable to The weight for attempting increase this feature, if conversely, diff (d, R, H) ＞ diff (d, R, M), show this feature not to classifying To beneficial contribution, it is therefore desirable to attempt to reduce the weight of this feature, can be filtered out to classification according to the size of feature weight Big feature is contributed, by the comparison of feature weight and preset threshold value that each iteration is found out, weights are less than the threshold The feature of value is deleted, and the feature that weights are more than the threshold value is left, and is eliminated those with this and is contributed classification little feature.

Step6：Repeat Step4 to Step5 screening processes e times, the relevance weight of each feature in then exporting Value W_dIf some feature weight W found out_d＜ σ, then delete this feature.

Step7：Retain the preceding r feature obtained in Step6 and is ranked up in the form of descending.

Output：Weights ranking forms optimized blended data feature set in preceding r feature.

The following detailed description of：103, the feature concentrated to the composite character after optimization with supporting vector machine model is instructed Practice, forms the detailed process of classification and Detection model：

The feature in the composite character data set after optimization is pre-processed first, feature pretreatment is i.e. in composite character data The feature of concentration is mapped as the process of feature vector before inputting sorting algorithm.Main purpose is by the type of characteristic attribute It is standardized, it is indicated with same type.The characteristic attribute of extraction of the embodiment of the present invention includes static nature And behavioral characteristics, static nature are mainly indicated with character string forms, and behavioral characteristics then indicate in digital form, belong to continuous change Amount.Therefore different pretreating schemes is used for the characteristic data set of both classifications.

A. static nature vectorization

Because the static nature extracted is indicated with character string forms, it cannot be directly transmitted to disaggregated model, therefore right first Static nature is pre-processed, i.e., these are mapped as to the input data of model, mapping using Feature Mapping using characteristic information Process is as shown in Figure 8.

The static nature that composite character first after extraction optimization is concentrated, is indicated with character string forms, then sets up one Characteristic set T is used as " characteristics dictionary ", is the static nature after Relief-PCA algorithms are preferred inside T, using T as standard structure Static nature vector set is built, by being compared with the feature in T, by this feature if being matched to identical feature in T Labeled as 1, it is otherwise labeled as 0, it in this way can be by the static nature of the software sample of each known software type by character string Form is converted into the vector form being made of 0 and 1, that is, completes the process of Feature Mapping.

B. behavioral characteristics standardization

Since behavioral characteristics unit differs, value range also differs, it is therefore necessary to these behavioral characteristics into line number It is worth normalized, in order to convert character numerical value to the set of eigenvectors that supporting vector machine model uses, the present invention is implemented Example carries out linear change using min-max standardized methods to initial data, characteristic value is mapped in [0,1] section, to one A feature m_iThe specific formula that mapping is standardized using min-max methods is as follows：

Min (m in formula (4)_i) indicate characteristic attribute m_iMinimum value, max (m_i) indicate characteristic attribute m_iMaximum value. According to above two feature preprocess method, the embodiment of the present invention by the composite character vector of each sample in the form of table 2 into Row processing and preservation.

The composite character vector set of 2 sample of table

Then the vector of composite character is trained, generates disaggregated model file：The embodiment of the present invention uses The instruction of supporting vector machine model is completed in libsvm (software package of pattern-recognition and the recurrence of supporting vector machine model) tool boxes Practice process.The tool box encapsulates complicated realization process, is adjustable support vector machines by simple parameter configuration Type and kernel function type.Developer only needs to provide attribute matrix and label, by calling svmtrain (support vector machines Model training) training and foundation of disaggregated model can be completed in method.Libsvm tools are integrated in Python and are supported The main process of vector machine training is as follows：

A. training sample feature set or software under testing feature set is converted to format as defined in libsvm to store.

B. libsvm kits are downloaded, svmutil (supporting vector machine model tool) is imported in Python and is wrapped.

C. y, x=svm_read_problem (file reading) readings is used to have been converted into libsvm input formats The vector file for composite character of being association of activity and inertia.Wherein y stores the value of the first column label in this document, and setting 1 and -1 indicates respectively Normal software and Malware, x then store the characteristic value in this document.

D. the training and generation of model can be completed by model=svm_train (y, x, ' 0-t 2' of-s) method. Wherein s indicates that support vector machines type, number 0 indicate that support vector machines type is C-SVC, and t indicates the support vector machines of selection Core type, number 2 indicate to have selected RBF (radial base) function.

E. disaggregated model file classifyModel.model is ultimately generated, and branch (is preserved by svm_save_model Hold vector machine model) method is saved in file for the class prediction of unknown software.

The superiority of technical solution in order to better illustrate the present invention, below in conjunction with application example in the embodiment of the present invention Technical solution is stated to be described in detail：

Pass through the actual effect of malware detection method of the experimental verification based on android system.Experimental situation is Win7 (64) host, 8G memories, 1T hard disks, using by Android Malware Genome Project (Android malice Software Gene Project group) 1000 malice samples providing of project and Google Play (Google's application) and 1000 it is normal Sample carries out model training and verification.

(1) optimization mass experiment of the Relief-PCA algorithms to composite character data set

The static nature and behavioral characteristics of this 2000 known softwares based on android system are extracted first, are extracted altogether Static nature 42321 and running software 30 minutes behavioral characteristics.It needs simply to sieve the data set before experiment Choosing counts the total degree that each feature occurs, and delete those and the feature that total degree is 1 occur, because these features are not With popularity and representativeness, Characteristic Number becomes 30219 after simply screening, by the static nature and behavioral characteristics It effectively combines, forms composite character data set.Relief-PCA algorithms are used in composite character data set, according to algorithm evaluation Go out the weighted value of each feature, we have selected in five class static natures for the ranking static nature of first five, to illustrate Relief- PCA algorithms are to the superiority in feature selecting, as shown in table 3：

As shown in Table 3, remain to the big feature of classification contribution degree mostly to access based on privacy of user data, such as SEND_SMS in permission classes indicates that the permission of transmission short message, the location in Hardware classes indicate application positioning Hardware capability, the SmsSendService in Application classes indicates to send the service of short message, the SIG_ in Intent classes STR indicates that acquisition cellular signal strength, the SmsManager.getDefault in API classes obtain the message manager of acquiescence.This A little access behaviors all shown to private data can be used as the main feature of differentiation Malware and normal software, and INTERNET (internet) and ACCESS_NETWORK_STATE (network state for allowing routine access) etc.

In 3 all kinds of static natures of table before ranking 5 static nature

Common permission feature is not selected, mainly due to the category feature on distinguishing Malware and normal software Contribution is little.It can be seen that Relief-PCA algorithms have excavated the really static state spy for distinguishing Malware and normal software Sign.

(2) performance test of Relief-PCA algorithms

Experiment has used Weka (Waikato intellectual analysis environment) platform, by Relief-PCA methods and traditional Relief Common information gain (Information Gain, IG) method is compared to verify in method and machine learning field Advantage of the Relief-PCA methods in composite character data set in optimization.The data set used in experiment comes from by simple Composite character data set is screened, disaggregated model has selected supporting vector machine model, selected according to the calculation of each method Feature input as disaggregated model of the feature ranking in preceding 50,150,250,350 and 450.Compare three kinds of feature selectings to calculate Classification accuracy of the method on Weka platforms, as shown in table 4.

Influence of 4 algorithms of different of table to accuracy rate

As shown in Table 4, Relief-PCA algorithms have compared to traditional Relief algorithms on classification accuracy larger It is promoted, main reason is that tradition Relief algorithms cannot remove redundancy feature, to affect the ranking of feature, causes to classify Accuracy rate is not high.For the algorithm for comparing information gain simultaneously, information gain algorithm is better than Relief- under low volume data collection PCA algorithms, and the Relief-PCA algorithms tool more advantage under mass data collection, and the Malware based on android system Detection method, what is faced is the great data set of data volume, and therefore, Relief-PCA algorithms are more excellent.

As shown in Figure 9.With the increase of characteristic dimension, the classification accuracy of three kinds of algorithms is gradually increasing, and before the selection 350 When a feature is as data set, classification accuracy reaches highest, and decline situation is then presented later.Cause the master of this Long-term change trend Reason is wanted to be that the addition of uncorrelated features disturbs the judgement of grader.The classification obtained using traditional Relief methods is accurate Rate is relatively low, and in selection ranking before preceding 250 features, the classification accuracy that use information gain obtains is calculated higher than other two kinds Method, but it is higher than other two methods using the classification accuracy that Relief-PCA is obtained after this.

(3) support vector machines detection model performance test

Using technical solution provided by the invention come the accuracy rate and rate of false alarm of testing classification detection model, just from 1000 Static nature and behavioral characteristics are extracted in normal software and 1000 Malwares, behavioral characteristics acquire in software running process Data in half an hour carry out the optimization of composite character data set using Relief-PCA methods, preserve spy of the ranking preceding 350 Value indicative is as experimental subjects.Experiment uses ten folding cross-validation methods, then finds out the mean value of this 10 experimental results as final As a result, experimental result is as shown in table 5.

Influence of the 5 different characteristic selection algorithm of table to accuracy rate

As shown in Table 5, the support vector machines detection model accuracy rate that the embodiment of the present invention is formed is 90.22%, rate of false alarm It is 9.58%, verification and measurement ratio 88.39%.It has chosen and is carried out in the representative achievement in research of field of malware detection herein Comparison, as shown in Figure 10.All it is higher than other three kinds of achievements in research in accuracy rate and verification and measurement ratio herein.

In summary it tests, unknown software is examined using the classification and Detection model generated based on supporting vector machine model It surveys, it on the one hand can be by the diversified extraction of the dynamic static nature of Android malware progress, thoroughly evaluating software action, together Shi Caiyong Relief-PCA algorithms reduce characteristic dimension, remove redundancy feature；On the other hand, with supporting vector machine model pair Composite character collection after optimization is trained, and has not only saved trained cost, while improving nicety of grading.

It should be understood that the particular order or level of the step of during disclosed are the examples of illustrative methods.Based on setting Count preference, it should be appreciated that in the process the step of particular order or level can be in the feelings for the protection domain for not departing from the disclosure It is rearranged under condition.Appended claim to a method is not illustratively sequentially to give the element of various steps, and not It is to be limited to the particular order or level.

Those skilled in the art will also be appreciated that the various illustrative components, blocks that the embodiment of the present invention is listed (illustrative logical block), unit and step can pass through the combination of electronic hardware, computer software, or both It is realized.To clearly show that the replaceability (interchangeability) of hardware and software, above-mentioned is various illustrative Component (illustrative components), unit and step universally describe their function.Such function It is that the design requirement for depending on optimization application and whole system is realized by hardware or software.Those skilled in the art can be with For each optimization application, various methods can be used to realize the function, but this realization is understood not to beyond this The range of inventive embodiments protection.

Various illustrative logical blocks or unit described in the embodiment of the present invention can by general processor, Digital signal processor, application-specific integrated circuit (ASIC), field programmable gate array or other programmable logic devices, discrete gate Or described function is realized or is operated in transistor logic, the design of discrete hardware components or any of the above described combination.General place It can be microprocessor to manage device, and optionally, which may be any traditional processor, controller, microcontroller Device or state machine.Processor can also be realized by the combination of computing device, such as digital signal processor and microprocessor, Multi-microprocessor, one or more microprocessors combine a digital signal processor core or any other like configuration To realize.

The step of method described in the embodiment of the present invention or algorithm can be directly embedded into hardware, processor execute it is soft The combination of part module or the two.Software module can be stored in RAM memory, flash memory, ROM memory, EPROM storages Other any form of storaging mediums in device, eeprom memory, register, hard disk, moveable magnetic disc, CD-ROM or this field In.Illustratively, storaging medium can be connect with processor, so that processor can read information from storaging medium, and It can be to storaging medium stored and written information.Optionally, storaging medium can also be integrated into processor.Processor and storaging medium can To be set in ASIC, ASIC can be set in user terminal.Optionally, processor and storaging medium can also be set to use In different components in the terminal of family.

In one or more illustrative designs, above-mentioned function described in the embodiment of the present invention can be in hardware, soft Part, firmware or the arbitrary of this three combine to realize.If realized in software, these functions can store and computer-readable On medium, or with one or more instruction or code form be transmitted on the medium of computer-readable.Computer readable medium includes electricity Brain storaging medium and convenient for allow computer program to be transferred to from a place telecommunication media in other places.Storaging medium can be with It is that any general or special computer can be with the useable medium of access.For example, such computer readable media may include but It is not limited to RAM, ROM, EEPROM, CD-ROM or other optical disc storage, disk storage or other magnetic storage devices or other What can be used for carry or store with instruct or data structure and it is other can be by general or special computer or general or specially treated The medium of the program code of device reading form.In addition, any connection can be properly termed computer readable medium, example Such as, if software is to pass through a coaxial cable, fiber optic cables, double from a web-site, server or other remote resources Twisted wire, Digital Subscriber Line (DSL) are defined with being also contained in for the wireless way for transmitting such as example infrared, wireless and microwave In computer readable medium.The disk (disk) and disk (disc) includes compress disk, radium-shine disk, CD, DVD, floppy disk And Blu-ray Disc, disk is usually with magnetic duplication data, and disk usually carries out optical reproduction data with laser.Combinations of the above It can also be included in computer readable medium.

Above-described specific implementation mode has carried out further the purpose of the present invention, technical solution and advantageous effect It is described in detail, it should be understood that the foregoing is merely the specific implementation mode of the present invention, is not intended to limit the present invention Protection domain, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims

1. a kind of detection method of Malware, which is characterized in that the method includes：

From the software sample set of known software type, the static nature and behavioral characteristics of each software sample are extracted, will be carried The static nature and behavioral characteristics of each software sample taken effectively combine, and form composite character data set；

Characteristic dimension is reduced to the feature evaluation in composite character data set according to optimization method, redundancy feature is removed, obtains Composite character data set after optimization；

The feature concentrated to the composite character after optimization with supporting vector machine model is trained, and forms classification and Detection model；

2. the detection method of Malware according to claim 1, which is characterized in that it is described according to optimization method, to mixed The feature evaluation that characteristic is concentrated is closed, characteristic dimension is reduced, removes redundancy feature, the composite character data after being optimized Collection, specifically includes：

According to the method for principal component analysis to the Feature Dimension Reduction in composite character data set, the characteristic data set after dimensionality reduction is obtained；

It is low to delete feature weight value with feature weight selection algorithm according to the feature in the composite character data set after dimensionality reduction In the feature of given threshold, the composite character data set after being optimized.

3. the detection method of Malware according to claim 1, which is characterized in that described to use supporting vector machine model The feature concentrated to the composite character after optimization is trained, and is formed classification and Detection model, is specifically included：

By the static nature vectorization processing in the composite character data set after optimization；

By the behavioral characteristics standardization in the composite character data set after optimization, the value of the behavioral characteristics is mapped to [0,1] section；

The static nature of warp-wise quantification treatment and the behavioral characteristics of normalized processing are formed into composite character vector file and protected It deposits；

Composite character vector file is trained with supporting vector machine model, generates disaggregated model file.

4. the detection method of Malware according to claim 3, which is characterized in that the composite character by after optimization Static nature vectorization processing in data set, specifically includes：

The static nature of each software sample and characteristic set T are compared one by one；

If being matched to identical static nature inside T, 1 is labeled as to the static nature of each software sample；

Otherwise, the static nature given in each software sample is labeled as 0；

Behavioral characteristics standardization in the composite character data set by after optimization maps the value of the behavioral characteristics To [0,1] section, specifically include：

According to formulaThe behavioral characteristics are mapped in [0,1] section, min (m_i) indicate that dynamic is special Levy m_iMinimum value, max (m_i) indicate behavioral characteristics m_iMaximum value.

5. the detection method of Malware according to claim 1, which is characterized in that described according to classification and Detection model pair Software to be detected is detected, and is specifically included：

The static nature for extracting software to be detected and behavioral characteristics are effectively combined, composite character data set is generated；

The composite character data set is converted to format as defined in classification and Detection model to store, and inputs classification and Detection model；

Classification and Detection model exports the type of software to be detected.

6. a kind of detection device of Malware, which is characterized in that described device includes：

Extraction unit, for from the software sample set of known software type, extract each software sample static nature and Behavioral characteristics effectively combine the static nature of each software sample of extraction and behavioral characteristics, form composite character data set；

Optimize unit, for reducing characteristic dimension, removal to the feature evaluation in composite character data set according to optimization method Redundancy feature, the composite character data set after being optimized；

Training unit, the feature for being concentrated to the composite character after optimization with supporting vector machine model are trained, and are formed Classification and Detection model；

7. malware detection device according to claim 6, which is characterized in that the optimization unit specifically includes：

Principal component analysis module, for, to the Feature Dimension Reduction in composite character data set, being obtained according to the method for principal component analysis Characteristic data set after dimensionality reduction；

Feature selection module, for according to the feature in the composite character data set after dimensionality reduction, with feature weight selection algorithm, Delete the feature that feature weight value is less than given threshold, the composite character data set after being optimized.

8. malware detection device according to claim 6, which is characterized in that the training unit specifically includes：

Static nature preprocessing module, for the static nature vectorization processing in the composite character data set after optimizing；

Behavioral characteristics preprocessing module will for the behavioral characteristics standardization in the composite character data set after optimizing The value of the behavioral characteristics is mapped to [0,1] section；

Preserving module, for the static nature of warp-wise quantification treatment and the behavioral characteristics of normalized processing to be formed composite character Vector file simultaneously preserves；

Classification based training module generates classification mould for being trained to composite character vector file with supporting vector machine model Type file.

9. malware detection device according to claim 8, which is characterized in that the static nature preprocessing module, It specifically includes：

Characteristic set submodule is established, the static nature concentrated for extracting the composite character after optimizing establishes characteristic set T；

Submodule is marked, if for being matched to identical static nature inside T, to the static state from each software sample Signature is 1；

Submodule is marked, matching is less than identical static nature if being additionally operable to inside T, to described from the quiet of each software sample State signature is 0；

The behavioral characteristics preprocessing module, specifically includes：

Evaluation submodule, for according to formulaThe behavioral characteristics are mapped in [0,1] section, min (m_i) indicate behavioral characteristics m_iMinimum value, max (m_i) indicate behavioral characteristics m_iMaximum value.

10. malware detection device according to claim 6, which is characterized in that the detection unit specifically includes：

Characteristic extracting module effectively combines the static nature for extracting software to be detected and behavioral characteristics, generates composite character number According to collection；

The composite character data set is converted to format as defined in classification and Detection model and stored, and inputs classification by input module Detection model；

Output module, the type for exporting software to be detected.