CN105320887A

CN105320887A - Static characteristic extraction and selection based detection method for Android malicious application

Info

Publication number: CN105320887A
Application number: CN201510661469.4A
Authority: CN
Inventors: 张大方; 赵凯; 苏欣
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2015-10-12
Filing date: 2015-10-12
Publication date: 2016-02-10

Abstract

The present invention discloses a static characteristic extraction and selection based detection method for an Android malicious application. The method, based on the occurrence frequency of an attribute, selects extracted static attributes, thereby increasing the accuracy rate and recall rate of detection, and at the same time, lowering the error judgment rate and the time overhead. Compared with an existing detection method for an Android malicious application, the method disclosed by the present invention has the characteristics that the accuracy rate is increased by 21.4%, the recall rate is increased by 34.7%, and the error judgment rate is reduced by 22.6%.

Description

A kind of Android malicious application detection method extracted based on static nature and select

Technical field

The present invention relates to electronical computer network technology, particularly a kind of Android malicious application detection method extracted based on static nature and select.

Background technology

Android malicious application detection system is the important means of Android platform defence, namely by Android application static nature (such as, the authority of asking, the API etc. called) carry out extracting and selecting, and use the sorting algorithm in machine learning to detect the Android malicious application existed in Android application market.

Along with the increased popularity of Android device and application, Android device stores increasing privacy of user data, such as account information, phone number, short message etc.Android malicious application aims at this platform and manufactures the serious safety that compromises Android user of a large amount of Android malicious application.The method that Android malicious application detects is roughly divided into two classifications: static analysis and performance analysis.For static analysis, to hold malicious application to detect similar with PC, and more artificial means such as code analysis and empirical rule that make use of identify, efficiency is low, extendability is poor.For dynamic analysis technology, because can omnibearing monitoring application behavior, its accuracy be higher, but can take certain system resource.

Due to the existence of the problems referred to above, in recent years there is some research detected by data mining technology implementation malicious application in static analysis classification successively, be different from traditional static analysis, based on the feature that the technology of data mining is maximum, there is predictive ability exactly, extendability is strong, the problems such as manual analysis efficiency is low can be avoided.

The CPPM that Moonsamy [1] proposes is a kind of algorithm finding normal use and malicious application permission mode difference, it considers required authority and used authority (being extracted by Andrubis) two kinds of situations, and emphasis point is the exclusive and total permission mode that discovery one group can distinguish normal use and malicious application.Find that normal use and malicious application all exist over-privileged phenomenon in an experiment.Zhu [2] it is also proposed an application of the exception based on authority and detects framework, it identifies by the reliability of permissions list the application that there is potential threat, and the description text of use in conjunction and permissions list carry out the judgement of reliability, utilization MultinomialEventModel ( bayes) contact of both algorithm foundation.Have evaluated authority forecast model to 5685 free application in an experiment, its True Positive Rate and false positive rate are respectively 90%, 30%.The Drebin that Arp [3] proposes be can directly on mobile phone detection of malicious application instrument, it is by comprehensive static analysis, collect application characteristic (authority as much as possible, API, IP), and forming an associating vector space, this vector can as the mark automatically identifying malicious application.Automatically can be excavated the feature mode of instruction malicious application by machine learning techniques SVM, thus use the significant explanation of this schema creation, application is judged as the reason of malice.In 123453 normal use and the test of 5560 malice samples, have identified the malicious application of 94%, only have the rate of false alarm of 1%, the method is 10s and 1s at smart mobile phone and computer end average used time.A shortcoming of this instrument is invalid to the malicious application of obfuscated codes and dynamic load.The DroidMat that Wu [4] proposes uses the static method of feature based mechanism to detect Android malicious application, it can extract authority, Intent message and assembly (Activity, Service, Receiver, entrance as following the tracks of API Calls), use K-means clustering algorithm to improve malicious application modeling ability, use SVD (SingularValueDecomposition) to determine the number of cluster, finally by KNN algorithm, application is divided into two classes.Experiment proves its recall rate and consuming time all better than Androguard, and accuracy rate and recall rate are respectively 97.87%, 87.39%.The DroidAPIMiner that Aafer [5] proposes is by providing healthy and strong, and the sorter of lightweight reduces the installation of malicious application.By carrying out at api layer comprehensively analyzing to extract correlated characteristic, Androguard basis building DroidAPIMiner for extracting API feature, using RapidMiner to build sorter.Different from above-mentioned research, Barrera [6] and Rasthofer [7] does not directly pay close attention to malicious application and detects, but safety analysis data mining is used for application itself, their experiment neither be classified for application, but its achievement has very high reference value.The security model empirical analyses that Barrera utilizes SOM algorithm to carry out based on authority.In article, their detailed safety analysis to the carrying out of android system, provide the key point that some promote Android authority models performance, and the increase of authority quantity or complexity can not be brought while increasing expressive ability, such as propose current Android authority NameSpace existing problems, some authority should merge and some should segment.They pay close attention to Android authority models and how to be used in practice and its shortcoming, high dimensional data can be carried out two-dimensional visualization by SOM algorithm, and the contact between authority can be identified, by analyzing 1100 application, discovery only has minority authority to be used by high frequency, and most of authority is only used this phenomenon by only a few application.The Susi that Rasthofer proposes is that an energy identifies the instrument of sources and sinks by machine learning method SVM from any AndroidAPI code, when provide more fine granularity reference information, Susi can also classify sources and sinks.Susi is run in Android4.2, accuracy rate is up to 92%, have found in a large number sources and sinks that omit by other information flow trace tools, and find that the major part in these sources and sinks is all actually used in for the experiment of 11000 malice samples.Susi also can well classify in the AndroidAPI (GoogleGlass and theChromecastAPI) of redaction in addition.As can be seen from their experimental result, only rely on authority not high to the application accuracy rate whether malice is classified, and the API that pertinent literature also confirmation part is marked as source does not obtain protection of usage right, this is that is not only unreliable based on the security mechanism of authority, and also can not get ensureing based on the accuracy of the malicious application detection technique of authority, therefore want to utilize data mining to carry out malicious application detection, how the feature so applied is chosen will be the key factor affecting result accuracy, and lack the instrument at user side with practicality.

Summary of the invention

Technical matters to be solved by this invention is, not enough for prior art, provides a kind of Android malicious application detection method extracted based on static nature and select.

For solving the problems of the technologies described above, the technical solution adopted in the present invention is: a kind of Android malicious application detection method extracted based on static nature and select, comprises the following steps:

1) from all application, their characteristic of correspondence set F are separately extracted by AppExtractor automation tools, each characteristic set F comprises a label class label element, wherein the value of label is 0 or 1, respectively corresponding normal use and malicious application;

2) to characteristic set F process, the output F ' of small part feature is only comprised, the label class label element in F ' reservation F;

3) based on the feature in F ', to all application build proper vectors, using the input of all proper vectors as machine learning algorithm, utilize machine learning algorithm to all application class, now all application can be distributed new class label label ' by machine learning algorithm; If label and lable ' of an application is worth equal, this shows that application is correctly classified, otherwise by mis-classification; The value of label ' is identical with the value of label;

4) for the application of unknown classification, step 1 is repeated) ~ step 3), if the label ' of the application correspondence of unknown classification is 0, then represent that this is applied as normal use; If label ' is 1, then represent that this is applied as malicious application.

Described step 2) specific implementation process be:

1) the some feature f in characteristic set F are checked _iwhether meet following two conditions:

Condition1.r _c≥α _c？r _c＝N _c/(N _m+N _b),c∈{m,b}

；

Condition2.N _c/T _c≥β _c？

Wherein, the value of c is that m and b, m and b represent malicious application and normal use two classifications respectively; r _ca rate value, its computing formula as shown in Condition1, N _mand N _brepresent respectively and comprise feature f _imalicious application quantity and normal use quantity; T _ccan value be T _mor T _b, represent the sum of malicious application and normal use respectively; α _ccan value be α _mor α _b, β _ccan value be β _mor β _b, these four are threshold value, and 0.5≤α _m≤ 1,0≤β _m≤ 1,0.5≤α _b≤ 1,0≤β _b≤ 1;

2) if feature f _imeet above-mentioned two conditions, then by feature f _ichoose in F '.

Compared with prior art, the beneficial effect that the present invention has is: the present invention is as a kind of feature selecting algorithm, be applied to existing based in the malicious application detection system of machine learning, can the accuracy rate of elevator system and recall rate significantly, the balance of both guarantees, and the data set size of machine learning algorithm process can be reduced, reduce algorithm room and time complexity.Be specially adapted to the demand in malicious application classification field such as application market; Experimental result shows, compared with existing Android malicious application detection method (AndroGuard), the preparation rate of the inventive method improves 21.4%, and recall rate improves 34.7%, and False Rate have dropped 22.6%.Compared with existing main flow Android malicious application testing tool, FEST also has the raising of 1.3%-13.6% in accuracy rate.And FEST detect each application consumed averaging time be 6.5 seconds.Therefore, the inventive method be a kind of fast and the Android malicious application detection method had compared with high-accuracy and recall rate.。

Accompanying drawing explanation

Fig. 1. system construction drawing of the present invention;

Fig. 2 .FrequenSel algorithm realization;

Fig. 3. α _m, β _m, α _b, β _bwhen being respectively 0.5,0.1,0.5 and 0.5, the Top5 result that the selection of FrequenSel obtains;

Fig. 4. during Different quantitative specificity, 4 kinds of learning algorithm accuracys rate compare;

Fig. 5. during Different quantitative specificity, 4 kinds of learning algorithms are given rate for change and are compared;

The performance that Fig. 6 .3 kind feature selecting algorithm learns in conjunction with KNN;

The performance that Fig. 7 .3 kind feature selecting algorithm learns in conjunction with SVM;

Fig. 8 .Fest and Androguard performance comparison;

Fig. 9 .Fest and the contrast of common fail-safe software accuracy rate;

Figure 10 .Fest is used for application market;

Figure 11 .Fest is used for domestic consumer.

Embodiment

The present invention utilizes the sorting technique in machine learning, in conjunction with unique feature selecting algorithm, and the detection Android malicious application of efficiently and accurately, and make accuracy rate and recall rate reach a higher level simultaneously.Achievement of the present invention can be used for the security of Android application market to newly-increased application and checks, every 1000 application only need just can complete analyzing and processing in 1.8 hours, and this system is run with C/S model, user can install application program provided by the invention on the mobile apparatus, detect to high in the clouds with the unknown applications in uploading device, finally return testing result to equipment, C/S model significantly can also reduce the resource consumption on mobile device.Although there are some Android malicious application testing tools, their accuracy rate or recall rate Shortcomings at present.The present invention, by automatic feature extraction and feature selecting, makes to carry out the accuracy rate of malicious application detection by machine learning and recall rate is highly improved, experiment show these two indexs the highest equal can close to 98%.

First the present invention extracts the information about application function and executable operations comprised in Android application installation kit, these information comprise authority that application uses, the API called, the Action of use and the IP address of access or URL etc., by these information can comprehensively react one application have behavioural characteristic, thus judge apply whether comprise malicious act; Secondly, for the large measure feature that these extract, by the feature selecting algorithm that we design and Implement, choose wherein most representative feature for machine learning (namely classify), so not only greatly can shorten the time overhead in assorting process, improve accuracy rate and recall rate simultaneously.At Android malicious application detection field, the present invention introduces feature selecting algorithm first time, and according to the feature of this Android application data set, designed and Implemented a feature selecting algorithm, experiment shows that the introducing of this algorithm has facilitation energetically for the lifting utilizing machine learning to carry out malicious application Detection accuracy and recall rate.

Method of the present invention can according to User Defined Rule Extraction application characteristic, and pass through feature selecting algorithm, reduce the dimension of machine learning input data, reduce the Time and place demand that algorithm runs, make accuracy rate and recall rate keep higher level simultaneously, be very suitable for detecting Android malicious application.

Below the inventive method is made and further illustrating.

Conveniently introduce, first several concept is described.1) implication of feature extraction is, be the technology of bytecode by decompiling from Android application installation kit, extract wherein about the process of authority (permission), system call (API), Action and IP/URL, all features obtained all belong to these four classification, such as android.permission.INTERNET, android.telephony.TelephonyManager, android.intent.action.USER_PRESENT and http: // 174.36.0.131/api and adhere to this four classifications separately.2) implication of feature selecting is the feature extracted by automation tools, major part may to accurately detecting that malicious application does not have help, in order to reduce the Time and place consumption of machine learning below and classification, need from all features, select the representational feature of a small amount of most according to certain principle (i.e. algorithm).3) proper vector is the computer representation to some embody rule in the input data of machine learning, and its formal definitions is

V = {v_{1}, v_{2}, ..., v_{i}}, {\begin{matrix} v_{i} = 1, & f_{i} &Element; F a n d f_{i}^{a} &Element; F^{a} \\ v_{i} = 0, & o t h e r w i s e \end{matrix}, 1 \leq i \leq n

Wherein

F＝{f ₁,f ₂,...,f _n},

F^{a} = {f^{a}, f_{2}^{a}, ..., f_{m}^{a}}

F represents the set of n the different feature composition comprised in all application, F ^arepresent the set of m the different characteristic composition comprised in application a.In proper vector, the value of element only has 0 and 1, has occurred certain feature, and never occur certain feature in 1 expression application in 0 expression application.

Describe these key concept, first introduce system construction drawing of the present invention below, as Fig. 1.Whole malicious application testing tool can be divided into 3 major parts: feature extraction (AppExtractor), feature selecting (FrequenSel) and application class device (Classifier).AppExtractor is the robotization feature extraction tools of a configurable rule, by writing regular expression, it can extract the feature of matched rule from smali code or xml file, each application finally can obtain one and comprise its characteristic text, and these features can a corresponding F ^aset, wherein the value of a is application ID.Mutually different feature in text corresponding for each application taken out, the new set of composition is F.FrequenSel is to reduce each application characteristic of correspondence vector dimension of V and custom-designed feature selecting algorithm, it is to gather F for input, select the feature meeting some row conditions to form a new set F ' conduct and export, the dimension of usual F ' is far smaller than F.In conventional machine learning algorithm, their treatable input data require it is numeric type usually, therefore different Android application is mapped to proper vector V by us, so indirect realization is to the classification of application, in the present invention, apply or belong to malice classification (malicious), or belonging to normal category (benign) for one.To introduce in detail from feature extraction and feature selecting two aspects below, and be the use to existing machine learning algorithm as application class, only relate in an experiment.

Feature extraction is as the first step of application class, and how to extract the feature of accurate description application behavior and functional characteristics to be comprehensively very important.In research in the past, the mode of manual analysis usually can be used to check the authority that application uses and system call, but efficiency is too low, and is difficult to cover all application codes.Therefore design and Implement the feature extraction tools AppExtractor of a robotization in the present invention, mentioned feature above and mainly pay close attention to four classifications, by the result extracted, in each classification, have the feature that typical malicious application uses.Such as, in permission classification, we find in all uses of note associated rights, and more than 93% all by malicious application application, and application is installed in their applications and process calls the number of times of authority also apparently higher than normal use; And in API classification, more than 87%, malicious application is come to the use of bookmark information API, be also come from malicious application more than 97% to the use of bag Administration API; In Action classification, we find, malicious application monitors the change of mobile phone signal and electricity frequently, and does not almost have normal use to do like this.Illustrated by these a series of data, when we attempt the behavior that pry application is possible from the code of application, thus judge whether maliciously it, and so permission, API and Action are very important three aspects.As for IP/URL, if there is the situation of the malicious websites that access is known or main frame in application, so it has had very large probability to be also malicious application.

Feature selecting is core work of the present invention, and its necessity is two aspects, and first by AppExtractor, we are extracted four classes totally 32247 features, if be directly used in machine learning, so will bring larger time overhead.Secondly, we find because more existing malicious application detection and classification technology do not adopt feature selecting, thus cause a large amount of useless feature interference to the Accurate classification of application, make final accuracy rate and recall rate not ideal enough.And prior art (such as DroidAPIMiner) have a feature be exactly their ability to Accurate classification malicious application (i.e. recall rate) usually far below its ability to normal use Accurate classification, finally cause accuracy rate and recall rate uneven.In other words, prior art can be good at detecting normal use, but still needs to improve to the malicious application detectability that we are concerned about most.

Before making the present invention, do not have and use the research of feature selecting algorithm at Android malicious application detection field, so first we used for reference two kinds of common algorithms from other field: evolution inspection and information gain, attempt carrying out feature selecting.After the result obtained selection is analyzed, find these two kinds of algorithms and be not suitable for Android malicious application detection field, the feature that the feature selecting algorithm that therefore we have proposed FrequenSel detects to choose applicable Android malicious application.

Evolution inspection and information gain are when doing feature selecting, can sort to all features according to the importance of each feature, these two kinds of algorithms have the mathematical formulae of oneself to define to how weighing " importance " this index, therefore the feature that the feature selecting result rank obtained by them is forward, their features after the conceptive ratio of mathematical statistics comes are important.But this brings a problem, feature important inside statistic concept detects Android malicious application and classification just has active promoting function? in order to answer this problem, we analyze the result of these two algorithms, find that answer is negative.Introduce reason below.

We analyze the input and output of these two algorithms, have found two kinds of phenomenons respectively.Input namely our all characteristic sets of extracting, export all characteristic sets after being through sequence.Phenomenon one is " DistributionBias ", i.e. " distributions shift ", and we find that normal use employs the feature more than 93%, and malicious application only used the feature of 35%.This phenomenon shows if only from statistical theory angle to determine the importance of feature, so probably because certain feature occurs in a large number in normal use, and evolution inspection and information gain two kinds of algorithms are caused to over-evaluate the importance of this feature in assorting process.Such as, android.view.ViewGroup.addView is an API for the design of Android application interface, obviously whether this feature exists malicious act without any help to judgement application, no matter because normal use or malicious application can arbitrarily use this API.But it is all come second by these two algorithms, thinks that it is extremely important.Therefore, in feature selection process, rely on pure statistical theory inapplicable to malicious application detection field to pass judgment on importance.Phenomenon two is " LongTailEffect ", i.e. " long tail effect ", although which feature these two algorithms can accurately not provide is really important, still can obtain a conclusion from their result, the feature of the overwhelming majority is obviously do not help for classification.From their result, we find, these useless features in occupation of the overwhelming majority of whole data set, more than 75%, as long " tail " in tow.Though be evolution inspection or information gain they be all that all features are sorted, which does not directly provide in the output is should by the key character selected, cut off wherefrom " tail " so these two kinds of algorithms finally also need manual intervention to carry out decision-making, to obtain final characteristic set.

" distributions shift " phenomenon requires that feature selecting algorithm only can not weigh feature from statistics angle, and therefore need to introduce new thinking, such as some conditions of predefine, the feature only satisfied condition just can be finally chosen.Such benefit is that we can by the change of these conditions, the feature obtained is made automatically to give category feature, that is we can select the feature being conducive to normal use classification or malicious application classification respectively, so just can balance the high and problem that recall rate is low of existing research accuracy rate, because we are absorbed in select some features, make the detection of malicious application more accurate.

" long tail effect " phenomenon requires that feature selecting algorithm is not only do a sequence to the feature of input, but directly reducing to export dimension, i.e. scale, directly provide important feature, no longer need manual intervention, the person's experimental error that can reduce different operating is like this on the impact of accuracy rate.

For above-mentioned two phenomenons, the present invention proposes a kind of feature selecting algorithm FrequenSel being applicable to malicious application classification, introduces this algorithm in detail below.

First T is supposed _mand T _brepresent total number of malicious application and normal use respectively, wherein m and b represents malicious and benign respectively, lower same.Then, we check some feature f _iwhether meet following two conditions:

Condition1.r _c≥α _c？r _c＝N _c/(N _m+N _b),c∈{m,b}

Condition2.N _c/T _c≥β _c？

For c=m, c=b in like manner, work as c=m, this N _mand N _brepresent respectively and comprise feature f _imalicious application quantity and normal use quantity, α _mand β _mrepresent two threshold values, wherein 0.5≤α _m≤ 1,0≤β _m≤ 1.At this moment, condition 1 shows that the frequency that feature uses in malicious application is higher, and condition 2 shows that the number of times that feature occurs in all malicious application must be over certain threshold value, only has the feature f simultaneously meeting these two conditions _ijust can be selected by FrequenSel, the code of algorithm realization is as Fig. 2.α in Fig. 2 _m, β _m, α _b, β _bthe threshold value in condition 1,2 respectively, F represents characteristic set to be selected, F ' represents by the characteristic set selected, as described above, FrequenSel checks with evolution, information gain is different, no longer need after obtaining F ' artificially to select feature, in F ', only comprise the feature meeting two conditions, usually gather the size of F ' much smaller than F.FrequenSel can reduce the dimension of input automatically, reduces human intervention, overcomes " long tail effect " phenomenon.FrequenSel also overcomes the problem that " distributions shift " phenomenon is brought to a certain extent simultaneously, from Fig. 2, the 1st of code the walks to eighth row, be selected in malicious application and use more feature, code afterwards is then selected to use more feature in normal use, this is equivalent to when selecting feature, introduce classification concept, and this is for the correct classification capacity improved malicious application, namely recall rate has very important meaning.Experiment proves that the strategy of this feature selecting is to improving accuracy rate and recall rate has active promoting function simultaneously.

In fact, for identical input set F, the size of the output set F ' of FrequenSel is by parameter alpha _m, β _m, α _b, β _bcommon decision, although these 4 parameters are by artificially determining, for identical application scenarios or data set type, they almost can not make and change, and these are completely different with the manual intervention in evolution inspection, information gain.Fig. 3 shows when these 4 parameters are respectively 0.5,0.1,0.5 and 0.5, the Partial Feature that FrequenSel selects, as shown in Figure 3, the feature that FrequenSel selects, major part be all we be concerned about relatively relevant with application function, 4 such as relevant with note authority features, the authority of application bag installation, send API and the monitoring cell-phone equipment state of note, the Action etc. of data receiver.As can be seen here, the feature selecting algorithm FrequenSel of design of the present invention compares with other feature selecting algorithm, is more conducive to the detection and classification of malicious application from selection result.Investigate after employing feature selecting algorithm FrequenSel, evolution inspection, information gain below by experiment, the difference on classification accuracy and recall rate.

The application data set that experiment adopts comprises 7972 application, and wherein malicious application accounts for 50%.First in order to find the machine learning algorithm being suitable for malicious application detection and classification field, we compare NaiveBayes, J48, KNN and SVM, and the index compared is accuracy rate accuracy and recall rate recall, is defined as follows:

accuracy＝(TP+TN)/(TP+FP+FN+TN),

recall＝TP/(TP+FN)

Wherein TP, FP, TN, FN represent TruePositive respectively, FalsePositive, TrueNegative,

The number of FalseNegative sample.Fig. 4 and Fig. 5 is the difference of 4 kinds of machine learning algorithms when using Different quantitative specificity to classify on accuracy and recall respectively.As can be seen from Figure 4, along with feature increasing number, 4 kinds of algorithm accuracys rate rise all to some extent, and SVM and KNN two kinds of algorithm accuracys rate are apparently higher than other two kinds of algorithms.As can be seen from Figure 5, recall rate is almost consistent with accuracy rate performance, this illustrates after make use of FrequenSel and doing feature selecting, recall rate obviously promotes, closely accuracy rate, this is impossible in research in the past, exactly because the concept that introduced feature is selected, the present invention has accomplished the balance of recall rate and accuracy rate first time.Ensuing experiment can select SVM and KNN two kinds of machine learning algorithms, because although the accuracy rate average specific SVM of KNN is low by 1%, but its modeling and classification time are far fewer than SVM, therefore learning algorithm can be selected according to the performance more paid close attention in reality, such as pay close attention to classification performance and overweight time performance, then should select SVM.

At SVM and KNN as under learning algorithm prerequisite, we can compare the impact on accuracy rate and recall rate of different characteristic selection algorithm in 3, and Fig. 6, Fig. 7 respectively show KNN, SVM performance in conjunction with 3 kinds of selection algorithms.No matter can find out which kind of machine learning algorithm, learn after doing feature selecting through FrequenSel, its accuracy rate and recall rate are all apparently higher than evolution inspection and these two kinds of algorithms of information gain.At malicious application detection field, use FrequenSel to do feature selecting, the most high energy of accuracy rate and recall rate, close to 98%, on average exceeds other feature selecting algorithm 3%.

Selecting the difference in the feature obtained to compare FrequenSel, evolution inspection and information gain further, explaining that FrequenSel is more suitable for the reason of malicious application detection field.We from evolution inspection and information gain ranking results have selected 398 features of 398 characteristic sum FrequenSel respectively as a comparison, we find that the characteristic sum evolution inspection that FrequenSel selects to obtain selects the feature obtained to compare, identical only has 20.9%, select the feature obtained to compare with information gain, identical only has 15.6%.And evolution checks the part identical with information gain up to 88.4%.So from Fig. 6 and Fig. 7, evolution inspection is suitable with information gain performance, all lower than FrequenSel.Moreover, the characteristic set obtained from 3 kinds of feature selecting algorithm is in the distribution of different characteristic classification, the characteristic ratio of authority classification that what FrequenSel obtained belong to has 6.8%, the characteristic ratio belonging to API classification has 91.5%, and these two ratios of other two kinds of selection algorithms are respectively 1.3% and 98.5%.So the distribution of result between feature classification that FrequenSel selects balances more, be more conducive to machine learning classification.

In sum, FrequenSel is the feature selecting algorithm being very suitable for malicious application detection field, after being screened original characteristic data set by it, machine learning algorithm can carry out accuracy rate and the higher classification of recall rate with less room and time cost to application.As a kind of malicious application detection technique being suitable for application market, FrequenSel in conjunction with the accuracy rate of machine learning algorithm and recall rate better than traditional stationary detection technique, as in Fig. 8, Fest is that FrequenSel and SVM that utilize realized in the present invention makees the instrument of malicious application detection, Androguard is Android application risk static analysis tools popular at present, can find out that method accuracy rate of the present invention and recall rate are all far away higher than Androguard.Fig. 9 then shows, accuracy rate of the present invention is suitable with some conventional fail-safe software accuracys rate, even higher than wherein a part of.Although the result degree of confidence of these fail-safe softwares is higher, and to application, whether malice is a kind of prediction behavior in the present invention, but in this environment of application market, all can upgrade a large amount of newly-increased application usual every day, detect with traditional static analysis or anti-viral software, its efficiency and expense are all higher than the Aulomatizeted Detect process that the present invention proposes.Therefore the technology in the present invention can as the first line of defence of detection of malicious application in application market.

Core procedure 1: extract their characteristic of correspondence set F separately by AppExtractor automation tools from all application; each characteristic set F comprises a label class label element; show that F belongs to a normal use or the set of malicious application, the value of usual label was 0 and 1 (this label represents that application belongs to normal or the actual value of malice classification).

Core procedure 2: the feature selecting algorithm FrequenSel proposed by the present invention, to characteristic set F process, is only comprised the output F ' of small part feature (compared with feature quantity in F), the label element in F ' reservation F.

Core procedure 3: based on the feature in F ', to all application build proper vectors, input using all proper vectors as machine learning algorithm (such as SVM, KNN), utilize learning algorithm to all application class, now all application can be distributed new class label label ' (this other predicted value of label ' representation class) by learning algorithm, the same label of its value.If label and lable ' of an application is worth equal, this shows that application is correctly classified, otherwise by mis-classification.After above-mentioned three core procedures, establish the mathematical model that malicious application is detected, may be used for detecting unknown applications.For unknown classification application we do not need for label composes a class label accurately because classification is unknown, need prediction, so only need to pay close attention to the value of label ', if label ' is 0, then this is applied as normal use to represent model prediction; If label ' is 1, then this is applied as malicious application to represent model prediction, thus reaches the object of malicious application detection.

Spread step: in practical application, whole system can be deployed in two kinds of different application scenarioss: application market detects on newly-increased application and domestic consumer's scanning mobile device and applies.

Detect in application market, system can be disposed on one server, as Figure 10, batch type feature extraction, feature selecting are carried out to all newly-increased application and utilizes machine learning algorithm to classify, the application being 1 for the label ' predicted will forbid that it is added, thus help application market supvr to contain that malicious application passes through market communication, Cleaning market environment.

Utilize model to detect malicious application for domestic consumer, usually whole detection system is deployed as the service framework of C/S (or B/S) framework, as Figure 11.Namely detection system is deployed in a web server, provides an external call interface, and input is the cryptographic hash of an application or application, and output is this application whether malice.Need to use the user of this detection system only to need to call this interface and just can obtain result, no matter be that application market manager uses, or provide the client of a friendly interface openly to use to personal user, malicious application detection system can both provide unified service interface, is conducive to expansion and the maintenance of system.

Claims

1., based on the Android malicious application detection method that static nature extracts and selects, it is characterized in that, comprise the following steps:

2. according to claim 1 based on static nature extract and select Android malicious application detection method, it is characterized in that, described step 2) specific implementation process be:

Condition1.r _c≥α _c？r _c＝N _c/(N _m+N _b),c∈{m,b}；

Condition2.N _c/T _c≥β _c？

Wherein, the value of c is that m or b, m and b represent malicious application and normal use two classifications respectively; r _cit is a rate value; N _mand N _brepresent respectively and comprise feature f _imalicious application quantity and normal use quantity; T _cvalue is T _mor T _b, T _mand T _brepresent the sum of malicious application and normal use respectively; α _ccan value be α _mor α _b, β _cvalue is β _mor β _b, these four are threshold value, and 0.5≤α _m≤ 1,0≤β _m≤ 1,0.5≤α _b≤ 1,0≤β _b≤ 1; N _cvalue is N _mor N _b;