CN109753800A - Android malicious application detection method and system integrating frequent itemsets and random forest algorithm - Google Patents

Android malicious application detection method and system integrating frequent itemsets and random forest algorithm Download PDF

Info

Publication number
CN109753800A
CN109753800A CN201910002795.2A CN201910002795A CN109753800A CN 109753800 A CN109753800 A CN 109753800A CN 201910002795 A CN201910002795 A CN 201910002795A CN 109753800 A CN109753800 A CN 109753800A
Authority
CN
China
Prior art keywords
feature
frequent
permission
features
malicious
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910002795.2A
Other languages
Chinese (zh)
Other versions
CN109753800B (en
Inventor
景小荣
王丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201910002795.2A priority Critical patent/CN109753800B/en
Publication of CN109753800A publication Critical patent/CN109753800A/en
Application granted granted Critical
Publication of CN109753800B publication Critical patent/CN109753800B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of Android (Android) malice detection methods for merging frequent item set (Apriori) algorithm and random forests algorithm, are related to technical field of information processing.Decompiling is carried out to Android application sample, according to permission and function call static nature is extracted from each decompiling file, to obtain the incidence relation in sample set between permission;The frequent 3- item collection of malice sample and normal sample is excavated based on Apriori algorithm, and then sensitive applications programming interface (Application Programming Interface, API) function call is combined to generate feature;Study and classification to feature is realized using random forest grader, to realize that Android malicious application detects.It is detected using the malice that the present invention carries out Android application software, system resources consumption is low, and has very high Detection accuracy.

Description

Merge the Android malicious application detection method of frequent item set and random forests algorithm And system
Technical field
The present invention relates to network securitys, information security detection field, and in particular to a kind of Android malicious application detection side Method.
Background technique
Android (Android) as current intelligent terminal system most popular in the world, it is open, free with platform the features such as It is widely used in the world.Therefore, target of attack is targeted by Android and put down by many malicious code researchers Platform.With technological progress, the cost of manufacture of Android rogue program is also lower and lower, leads to the quantity of Android malware It is growing day by day.It is shown according to the data that 360 internet security centers are issued, the newly-increased malice of intercepting and capturing Android platform in 2017 is soft It is 757.3 ten thousand, part sample, average 3.1 ten thousand newly-increased daily.Malware uses the new technologies and methods such as digging mine wooden horse, Botnet It frequently launches a offensive, including steals userspersonal information, the indecent behaviors such as malice fee suction bring massive losses to user.It faces How so extensive malicious attack, effectively realize the detection to Android malicious application, becomes current Android platform peace Full matter of utmost importance.
Static detection and dynamic detection are broadly divided into the malice detection of Android application at present.Static detection refers to Without running application software, and the reverse-engineerings means such as decompiling are used, its source program is analyzed, its feature is extracted, than Such as signature, permission, directly analysis characteristic behavior.Stationary detection technique is mainly to using program-described file (AndroidManifest.xml) and grammar file (smali) code file carries out feature extraction.Guo et al. passes through parsing Information labels in AndroidManifest.xml and smali code file extract the class of application, permission, component, signature, each The processed data of kind and starting information etc..Rashidi B et al. is by permission and application programming interface (Application Programming Interface, API) function call is as characteristic set, using support vector machines (Support Vector Machine, SVM) and K- neighbour (K-Nearest Neighbor, KNN) algorithm malicious application is detected, but exist many Erroneous judgement.Machine learning can be achieved the detection of Android application software manual, improve the efficiency of analysis, but rely on and mention The application feature taken.
The dynamic detection of Android malicious application refers in application software operational process, passes through injection, hook (HOOK) etc. Technology obtains the feature of the application, but defect is that software is needed to run, and system resources consumption is excessive.In dynamic detection research side Face, Mahindru et al. uses tracker (Strace) acquisition applications software action data, and sends it to Analysis Service end, benefit With these behavior samples of classifier training, finally judge to apply whether contain malicious act using K- nearest neighbor algorithm.Singh L etc. People uses API Hook technology, carries out Hook to sensitive API in Android platform, once system or application are to specific API When calling, calling function can be intercepted and captured, proxy function is redirected it to obtain details, behavioural information can be obtained.
Summary of the invention
The technical problem to be solved by the present invention is to for the disadvantages mentioned above of the prior art, by calling machine learning to calculate Method is learnt and is detected to Android application, and Android malicious application detection complexity is reduced, and saves system resources consumption, On solving high dimensional feature and mechanized classification test problems, the Detection accuracy to Malware is further improved.
The technical solution that the present invention solves above-mentioned technical problem is to propose a kind of fusion frequent item set (Apriori) algorithm With the Android malicious application detection method of random forests algorithm, comprising the following steps: it is anti-to carry out batch to Android application software Compiling, the software permission that is applied and sensitive API function static nature;The frequent item set for excavating permission feature makees permission feature Dimension-reduction treatment obtains the frequent 3- item collection of permission, to obtain the incidence relation in sample set between permission;Excavate malice sample With the frequent 3- item collection of normal sample, it is calculated together as feature construction feature set using information gain with sensitive API function Method is screened and is scored to the characteristic attribute in feature set, is extracted important feature, is constructed corresponding vector space;Using Random forests algorithm carries out study and classification and Detection to vector space, carries out just to the vector space of normal sample and malice sample Often or the attribute of malice marks.
The present invention further comprises using static analysis tools to carry out decompiling to application software before feature extraction, obtaining To so file (lib), smali and AndroidManifest.xml comprising resource file (res), third party software development kit File, include various resource files, source code and the other static code features of the application software in file.
The present invention further comprises extracting feature, parsing using programming language (python) script The all permissions that application is extracted in the extended markup language files such as AndroidManifest.xml obtain permission feature, use Method function in python -- os.walk () traverses all smali files, extracts each sample according to canonical matching process Sensitive API function.
The present invention further comprises that the frequent 3- item collection for excavating permission feature specifically includes: respectively from malice sample and just Permission, which is extracted, in normal sample constructs authority set;The 1- item collection of Mining Frequent authority set: the support of each permission in authority set is calculated S is spent, beta pruning is carried out to the frequent 1- item collection for being unsatisfactory for minimum support min_s, obtains Candidate Set L1, then to L1In element into Row connection;Using the Candidate Set after connection as new sample set, Mining Frequent 2- item collection: to being unsatisfactory for minimum support min_s Frequent 2- item collection carry out beta pruning, form new Candidate Set L2, repeat, until obtaining frequent 3- item collection.
The present invention further comprises being specifically included using information gain (information gain, IG) algorithm, is calculated special The entropy of sign and the difference of its conditional entropy obtain the IG value of this feature, and IG value shows that more greatly degree of correlation is bigger, according to related journey Degree retains important feature, and important feature is matched with application software each in system, constructs corresponding vector respectively Space.Building vector space specifically includes, the building feature vector (x different comprising application software1,x2,…,xn) feature set X calls formula ν: s → { 0,1 }|X|, vector space ν is constructed according to the feature vector in set X, wherein s indicates some application Software, per one-dimensional corresponding with feature a certain in X in ν, if s includes a certain feature, in vector space ν with this feature pair The ident value answered is 1, is otherwise 0.
The present invention also proposes a kind of Android malicious application detection system for merging Apriori algorithm and random forests algorithm System, comprising: characteristic extracting module, feature processing block and random forest sorting algorithm module, characteristic extracting module is to by criticizing The Android application software for measuring decompiling carries out feature extraction, the software permission that is applied and sensitive API function static nature; Feature processing block excavates the frequent item set of permission feature, makees dimension-reduction treatment to permission feature, obtains the frequent 3- item collection of permission, To obtain the incidence relation in sample set between permission, excavate the frequent 3- item collection of malice sample and normal sample, by its with Sensitive API function sieves the characteristic attribute in feature set together as feature construction feature set, using information gain algorithm Choosing and scoring, extract important feature, construct corresponding vector space;Random forest sorting algorithm module to vector space into Row study and classification and Detection carry out normal or malice attribute to the vector space of normal sample and malice sample and mark.
The present invention is extracted using static detection mode using data characteristics, and then uses Apriori algorithm to data characteristics The frequent 3- item collection of permission in normal and Malware is excavated, then merges sensitive API and calls function, is created using random forest Classifier learns and classifies to it.Further, it is obtained using IG algorithm by the entropy of calculating feature and the difference of its conditional entropy The IG value of this feature is retained important feature and is constructed respectively corresponding using matching algorithm to application software each in system Vector space.The present invention carries out higher-dimension permission feature to excavate its frequent 3- item collection, less on system resources consumption.
Detailed description of the invention
Fig. 1 is the Android malicious application detection model for merging Apriori algorithm and random forests algorithm.
Specific embodiment
It elaborates below in conjunction with attached drawing to specific implementation process of the invention.
Fig. 1 show the present invention using detection system model schematic.In order to realize the inspection to Android system malicious application It surveys, the present invention merges Apriori algorithm Mining Frequent 3- item collection and random forests algorithm is classified, and proposes a kind of Android malice Using detection system, which includes characteristic extracting module, feature processing block and random forest sorting algorithm module.
Decompiling will be carried out in the sample set of the normal software being collected into and Malware in batches first, after decompiling Application program describes the power that application program is extracted in file AndroidManifest.xml and grammar file smali file It limits (Android permission) and sensitive applications programming interface api function calls, be then directed to permission feature mining The frequent 3- item collection sequence of permission is found in the syntagmatic in normal sample and malice sample between permission, and combines API quick Function is felt as learning characteristic, feature selecting is optimized to it using IG algorithm, and further, the important feature of reservation is embedded in Feature vector forms vector space, and finally it is trained and is classified using random forests algorithm, to detect Android malice Using.
It is illustrated below for each section.
(1) characteristic extracting module, using programming language Python script batch compilation sample set, after extracting decompiling The feature of AndroidManifest.xml and smali file, the feature of extraction mainly include permission feature and sensitive API function. For permission feature extraction, corresponding permission feature is extracted from some access right of application, is such as parsed The all permissions of application are extracted in AndroidManifest.xml file.Due to when user using in system a certain function or When accessing certain sensitive datas, it will apply for the power applied in access right, such as AndroidManifest.xml file Limit ----android.permission.READ_PHONE_STATE indicates that telephone state permission is read in application;For sensitivity Api function, one programming language (Java) class of each smali file representative, the various systems for containing application calling are answered With interface function, use the method in python --- os.walk () function traverses all smali files, from this document with Function (invoke) beginning is called, occurred api function is traversed according to string matching, extracts various kinds from all functions This sensitive API function.By the sensitive API function for traversing each sample that all smali files extract, so that it may correspondingly obtain Application software potentially malicious behavior.Byte code files due to smali as Android virtual machine (Dalvik), each smali One java class of file representative contains the various system application interface functions of application calling;Since Malware generates Malicious act must call corresponding api function.Therefore, using the sensitive API function of calling all in sample set as random The learning characteristic of forest algorithm, it is trained after to detect malicious application.
Before carrying out feature extraction to each sample software, it is necessary to carry out decompiling to sample set.Decompiling can be used File with .apk suffix is carried out decompiling by tool Apktool, includes resource file (res), third party sdk to obtain The files such as so file (lib), smali and AndroidManifest.xml, this class file include various resource files, source code, With other static natures.
Usual Malware can apply for some dangerous permission combinations, these groups before generating malicious act in terms of permission Credit union mutually relies on and generates malicious act.Therefore, the dangerous class permission of Malware not only request slip one, and can apply endangering Dangerous class permission combination, such as in malice sample, application permission combination is usually READ_SMS (short message reading), READ_ PHONE_STATE (reading mobile phone state), WRITE_SMS (editing short message) three, the executable privacy of user that reads are re-send to Malicious operations such as elsewhere, and rarely have this permission to combine in normal software, according to permission combine in different dangerous permissions Working in coordination, there are potentially malicious behaviors, therefore can determine whether it for Malware.
Apriori algorithm is the algorithm for the Mining Boolean Association Rules frequent item set that Agrawal et al. is proposed.Apriori The frequent 3- item collection of algorithm excavation permission.Obtain a large amount of permission feature and sensitive API function.However, the power usually obtained It is very big to limit characteristic dimension, computation complexity is high, therefore, using the frequent item set for excavating permission feature based on Apriori algorithm Dimension-reduction treatment is carried out to permission characteristic dimension, to obtain the frequent 3- item collection of permission.The frequent 3- item collection of permission feature is excavated, with The incidence relation in sample set between permission is obtained, its specific step is described as follows.
The frequent 3- item collection of permission is excavated based on Apriori algorithm, concretely, this is extracted from all samples using Shen Normal software sample authority set P and malice sample authority set M please, wherein P={ p1,p2,…,pnRepresent normal software sample Authority set, indicate whole applied n permissions of normal software sample, M={ m1,m2,…,mxRepresent the power of malice sample Limit collection indicates applied x permission in whole malice samples.It is excavated respectively for the authority set of normal sample and malice sample Frequent 3- item collection.Following method specifically can be used:
To the authority set Mining Frequent 1- item collection of sample permission: calculating the support S of each permission in sample authority set, table Show the probability that the permission occurs in all sample sets, beta pruning carried out to the frequent 1- item collection for being unsatisfactory for minimum support min_s, To obtain the set for meeting condition, and as Candidate Set L1, then to L1In element be attached;It then will be after connection Candidate Set includes all 2- item collections, then the Mining Frequent 2- item collection from new sample set, to discontented as new sample set at this time The frequent 2- item collection of sufficient minimum support min_s carries out beta pruning, forms new Candidate Set L2, according to above-mentioned steps, repeat, Until obtaining the frequent 3- item collection of sample authority set.
Connection: in a certain frequent n- item collection set, before being found downwards since the first item (for example i-th) of the set The nth elements of all elements in i and j are then connected into the (n+1)th item collection by n-1 same items (such as jth item).
From normal software sample authority set P, p is calculated separately1,p2,…,pnSupport of the frequency of appearance as the element Spend S, minimum support be P in the minimum appearance of each element frequency and between 0 to 1, after Mining Frequent 1- item collection, according to Minimum support carries out beta pruning and connection, finally obtains the frequent 3- item collection of normal sample.
From malice sample authority set M, m is calculated separately1,m2,…,mxSupport S of the frequency of appearance as the element, After Mining Frequent 1- item collection, beta pruning and connection are carried out according to minimum support, finally obtain frequent 3 item collection of malice sample.
(2) characteristic processing
After the frequent 3- item collection for excavating malice sample and normal sample using Apriori algorithm, by itself and sensitive API letter Number is screened and is scored to characteristic attribute using information gain IG algorithm together as feature.IG algorithm is by calculating feature Comentropy and the difference of its conditional entropy obtain the IG value of this feature, which shows that more greatly degree of correlation is bigger.Entropy calculates: root Probability P (the C occurred respectively according to normal software in sample set or Malwarei), according to formula:The comentropy H (C) of sample set is calculated.The calculating of conditional entropy: according to formula:Respectively ith feature conditional entropy H (Y | Xi).Therefore, according to formula IGi=H (C)-H(Y|Xi) the IG value that calculates ith feature is, in order to screen that advantageous classification is normal in multiple features of comforming or Malware Feature so that the uncertain reduction degree of feature is maximum, therefore the feature that IG value is 0 is rejected, and is not 0 by its residual value Feature be retained as important feature.
Definition set X is the feature set that application software retains, and includes different feature (x in feature set1,x2,…,xn), In, n is important characteristic.According to formula, ν: s → { 0,1 }|X|, according to the feature construction vector space ν in set X, s is enabled to indicate Some application software, wherein per one-dimensional corresponding with feature a certain in X in ν.If s includes this feature, in vector space ν with The corresponding ident value of this feature is 1, is otherwise 0, and whether ident value representative contains this feature.
It is empty that corresponding vector is constructed respectively to application software each in system using matching algorithm according to the method described above Between ν, then, after Feature Selection, building one include n feature feature set, each sample of correspondence generate it is different to Quantity space ν, and it is deposited into MySQL database, the input as random forest categorization module.
(3) random forests algorithm is classified
After obtaining feature vector, detection substantially becomes a kind of classification problem.Since the result of detection is normal and malice Two classes, so detection substantially just belongs to two classification problems.And random forests algorithm is very suitable to solve two classification problems.It utilizes The vector space ν of acquisition is realized using random forest sorting algorithm and is classified.
Following methods specifically can be used, Supervised classification: for known to being collected into normal and malice sample set it is each Application software belongs to normal or Malware according to each application software, in each vector space corresponding with each application software Behind, normal or malice attribute mark is carried out to each application software, as described in following formula.
Wherein V (S) indicates all application software set, and normal indicates that the application software belongs to normal software, malware Indicate that the application software belongs to Malware.
After obtaining the vector space of training sample set, it is trained to obtain random forest grader.It will be to be measured soft Part obtains vector space ν after feature extraction and characteristic processing, and ν at this time is free of normal or malware identifier, with sky It is white or '? ' its value is replaced, then examined using vector space of the random forest grader of training sample to the software under testing Classification is surveyed, is in the result normal software or Malware with normal the or malware string representation software under testing, by This can realize the detection to Malware.
The present invention utilize inverse compiling technique, to application software sample collection carry out batch decompiling, in file permission and Api function extracts.In face of higher-dimension permission feature, dimension-reduction treatment is carried out using Apriori algorithm, obtains the frequent 3- of permission Item collection carries out Feature Selection by information gain, further obtains important feature in conjunction with sensitive API function.By important feature Be mapped to vector space, indicated with 0 or 1, and normal use and malicious application are marked, finally obtain with it is markd to Quantity space.Sample set is learnt and classified using random forests algorithm.

Claims (12)

1.一种融合频繁项集算法与随机森林算法的安卓恶意应用检测方法,其特征在于,包括以下步骤:对安卓Android应用软件进行批量反编译获得样本集,得到应用软件权限和敏感应用程序编程接口API函数静态特征;挖掘权限特征的频繁项集对权限特征作降维处理,得到权限的频繁3-项集,以获得样本集中权限之间的关联关系;挖掘出恶意样本和正常样本的频繁3-项集,分别将恶意样本和正常样本的频繁3-项集其与敏感API函数一起作为特征构建特征集,采用信息增益算法对特征集中的特征属性进行筛选和评分,提取重要特征,构建与之对应的向量空间;采用随机森林分类器对向量空间进行学习和分类检测,对正常样本和恶意样本的向量空间进行正常或恶意的属性标记。1. an Android malicious application detection method that fuses frequent itemset algorithm and random forest algorithm, is characterized in that, comprises the following steps: carry out batch decompiling to Android Android application software to obtain sample set, obtain application software authority and sensitive application program programming Interface API function static features; mining the frequent itemsets of permission features to reduce the dimension of permission features to obtain frequent 3-itemsets of permissions to obtain the correlation between permissions in the sample set; dig out the frequent 3-itemsets of malicious samples and normal samples 3-item set, the frequent 3-item sets of malicious samples and normal samples, together with sensitive API functions, are used as features to construct feature sets, and the information gain algorithm is used to filter and score the feature attributes in the feature set, extract important features, and construct The corresponding vector space; the random forest classifier is used to learn and classify the vector space, and the vector space of normal samples and malicious samples is labeled with normal or malicious attributes. 2.根据权利要求1所述方法,其特征在于,特征提取之前使用静态分析工具对应用软件进行反编译,得到包含资源文件res、第三方软件开发包的so文件lib、语法文件smali和应用程序描述文件AndroidManifest.xml中包含所述应用软件的各种资源文件、源代码、和其它静态代码特征。2. method according to claim 1, is characterized in that, uses static analysis tool to decompile application software before feature extraction, obtains so file lib, grammar file smali and application that comprise resource file res, third-party software development kit The description file AndroidManifest.xml contains various resource files, source codes, and other static code features of the application software. 3.根据权利要求1所述方法,其特征在于,采用编程语言python脚本提取特征,解析AndroidManifest.xml文件中提取申请的所有权限获得权限特征,使用python中的方法函数---os.walk()遍历所有smali文件,根据正则匹配方法提取样本集中所有样本的敏感API函数。3. method according to claim 1, is characterized in that, adopts programming language python script to extract characteristic, parses all the authority of extraction application in AndroidManifest.xml file to obtain authority characteristic, uses the method function in python---os.walk( ) traverse all smali files and extract sensitive API functions of all samples in the sample set according to the regular matching method. 4.根据权利要求1所述方法,其特征在于,挖掘权限特征的频繁3-项集具体包括:分别从恶意样本或正常样本中提取权限构建权限集;挖掘频繁权限集的1-项集:计算权限集中每个权限的支持度S,对不满足最小支持度min_s的频繁1-项集进行剪枝,得到候选集L1,再对L1中的元素进行连接;将连接后的候选集作为新的2-项集,挖掘频繁2-项集:对不满足最小支持度min_s的频繁2-项集进行剪枝,形成新的候选集L2,重复进行,直到得到频繁3-项集。4. The method according to claim 1, wherein mining the frequent 3-item sets of permission features specifically comprises: extracting permissions from malicious samples or normal samples respectively to construct permission sets; mining the 1-item sets of frequent permission sets: Calculate the support S of each permission in the permission set, prune the frequent 1-itemsets that do not meet the minimum support min_s, get the candidate set L 1 , and then connect the elements in L 1 ; connect the connected candidate set As a new 2-item set, mining frequent 2-itemsets: prune the frequent 2-itemsets that do not satisfy the minimum support min_s to form a new candidate set L 2 , repeat until the frequent 3-itemsets are obtained . 5.根据权利要求1所述方法,其特征在于,采用信息增益(InformationGain,IG)算法具体包括,根据样本集中正常软件或恶意软件分别出现的概率P(Ci),按照公式:计算样本集的信息熵H(C),按照公式:计算第i个特征的条件熵H(Y|Xi),根据公式IGi=H(C)-H(Y|Xi)计算第i个特征的IG值,IG值越大表明相关程恶意样本和正常样本的频繁3-项集度越大,根据相关程度保留重要特征,将重要特征与系统中每个应用软件进行匹配,分别构建与之对应的向量空间。5. method according to claim 1 is characterized in that, adopting information gain (InformationGain, IG) algorithm specifically comprises, according to the probability P (C i ) that normal software or malicious software appear respectively in the sample set, according to formula: Calculate the information entropy H(C) of the sample set, according to the formula: Calculate the conditional entropy H(Y|X i ) of the ith feature, and calculate the IG value of the ith feature according to the formula IG i =H(C)-H(Y|X i ). The larger the IG value, the more malicious the related process is. The larger the frequent 3-itemsets of the samples and the normal samples are, the important features are retained according to the degree of correlation, and the important features are matched with each application software in the system, and the corresponding vector spaces are constructed respectively. 6.根据权利要求5所述方法,其特征在于,构建向量空间具体包括,将IG值为0的特征剔除,而将其余值不为0的特征保留作为重要特征,构建包含应用软件样本不同的特征向量(x1,x2,…,xn)的特征集X,调用公式ν:s→{0,1}|X|,根据集合X中的特征向量构建向量空间ν,其中,s表示某个应用软件,ν中每一维与X中某一特征相对应,如果s包含该某一特征,则向量空间ν中与该特征对应的标识值为1,否则为0。6. method according to claim 5, is characterized in that, constructing vector space specifically comprises, removes the feature with IG value of 0, and retains the feature whose remaining value is not 0 as important feature, constructs and comprises different application software samples. The feature set X of eigenvectors (x 1 , x 2 ,…,x n ), call the formula ν: s→{0,1} |X| , construct a vector space ν according to the eigenvectors in the set X, where s represents For a certain application software, each dimension in ν corresponds to a certain feature in X, if s contains this certain feature, the value of the identifier corresponding to this feature in the vector space ν is 1, otherwise it is 0. 7.一种融合频繁项集算法与随机森林算法的安卓恶意应用检测系统,包括:特征提取模块、特征处理模块和随机森林分类算法模块,其特征在于,特征提取模块对经过批量反编译的Android应用软件进行特征提取,得到应用软件权限和敏感API函数静态特征;特征处理模块挖掘权限特征的频繁项集,对权限特征作降维处理,得到权限的频繁3-项集,以获得样本集中权限之间的关联关系,挖掘出恶意样本和正常样本的频繁3-项集,将其与敏感API函数一起作为特征构建特征集,采用信息增益算法对特征集中的特征属性进行筛选和评分,提取重要特征,构建与之对应的向量空间;随机森林分类算法模块对向量空间进行学习和分类检测,采用随机森林分类器对正常样本和恶意样本的向量空间进行正常或恶意的属性标记。7. An Android malicious application detection system integrating frequent itemset algorithm and random forest algorithm, comprising: a feature extraction module, a feature processing module and a random forest classification algorithm module, characterized in that, the feature extraction module performs batch decompiling on the Android The application software performs feature extraction to obtain application software permissions and static features of sensitive API functions; the feature processing module mines frequent itemsets of permissions features, performs dimension reduction processing on permissions features, and obtains frequent 3-itemsets of permissions to obtain sample set permissions The correlation between the malicious samples and the normal samples is mined, and the frequent 3-item sets of the malicious samples and the normal samples are mined, and the sensitive API functions are used as features to construct a feature set. The random forest classification algorithm module learns and classifies the vector space, and uses the random forest classifier to mark the normal or malicious attributes of the vector space of normal samples and malicious samples. 8.根据权利要求7所述的检测系统,其特征在于,使用静态分析工具对应用软件进行反编译,得到包含res、lib、smali和AndroidManifest.xml的文件,文件中包含所述应用软件的各种资源文件、源代码、和其它静态代码特征。8. detection system according to claim 7, is characterized in that, uses static analysis tool to decompile the application software, obtains the file that comprises res, lib, smali and AndroidManifest.xml, comprises each of described application software in the file. resource files, source code, and other static code features. 9.根据权利要求7所述的检测系统,其特征在于,采用编程语言python脚本提取特征,解析AndroidManifest.xml文件中提取申请的所有权限获得权限特征,使用os.walk()函数遍历所有smali文件,根据正则匹配方法提取各样本的敏感API函数。9. The detection system according to claim 7, is characterized in that, adopts the programming language python script to extract features, parses all the permissions applied for in the AndroidManifest.xml file to obtain the permission features, and uses the os.walk() function to traverse all smali files , and extract the sensitive API functions of each sample according to the regular matching method. 10.根据权利要求7所述的检测系统,其特征在于,挖掘权限特征的频繁3-项集具体包括:分别从恶意样本或正常样本中提取权限构建权限集;挖掘频繁权限集的1-项集:计算权限集中每个权限的支持度S,对不满足最小支持度min_s的频繁1-项集进行剪枝,得到候选集L1,再对L1中的元素进行连接;将连接后的候选集作为新的样本集,挖掘频繁2-项集:对不满足最小支持度min_s的频繁2-项集进行剪枝,形成新的候选集L2,重复进行,直到得到频繁3-项集。10. The detection system according to claim 7, wherein mining frequent 3-item sets of permission features specifically comprises: extracting permissions from malicious samples or normal samples respectively to construct permission sets; mining 1-items of frequent permission sets Set: Calculate the support S of each permission in the permission set, prune the frequent 1-itemsets that do not meet the minimum support min_s, get the candidate set L 1 , and then connect the elements in L 1 ; The candidate set is used as a new sample set to mine frequent 2-itemsets: prune the frequent 2-itemsets that do not meet the minimum support min_s to form a new candidate set L 2 , and repeat until the frequent 3-itemsets are obtained . 11.根据权利要求7所述的检测系统,其特征在于,采用IG算法具体包括,计算特征的熵值与其条件熵的差值得到该特征的IG值,根据样本集中正常软件或恶意软件分别出现的概率P(Ci),按照公式:计算样本集的信息熵H(C),按照公式:计算第i个特征的条件熵H(Y|Xi),根据公式IGi=H(C)-H(Y|Xi)计算第i个特征的IG值,IG值越大表明相关程恶意样本和正常样本的频繁3-项集度越大,根据相关程度保留重要特征,将重要特征与系统中每个应用软件进行匹配,分别构建与之对应的向量空间。11. The detection system according to claim 7, characterized in that, adopting the IG algorithm specifically includes calculating the difference between the entropy value of the feature and its conditional entropy to obtain the IG value of the feature, and according to the sample set, normal software or malicious software appear respectively. The probability P(C i ) of , according to the formula: Calculate the information entropy H(C) of the sample set, according to the formula: Calculate the conditional entropy H(Y|X i ) of the ith feature, and calculate the IG value of the ith feature according to the formula IG i =H(C)-H(Y|X i ). The larger the IG value, the more malicious the related process is. The larger the frequent 3-itemsets of the samples and the normal samples are, the important features are retained according to the degree of correlation, and the important features are matched with each application software in the system, and the corresponding vector spaces are constructed respectively. 12.根据权利要求11所述的检测系统,其特征在于,IG值越大表明相关程度越大,根据相关程度保留重要特征,将重要特征与系统中每个应用软件进行匹配,分别构建与之对应的向量空间,构建向量空间具备包括,将IG值为0的特征剔除,而将其余值不为0的特征保留作为重要特征,构建包含应用软件样本不同的特征向量(x1,x2,…,xn)的特征集X,调用公式ν:s→{0,1}|X|,根据集合X中的特征向量构建向量空间ν,其中,s表示某个应用软件,ν中每一维与X中某一特征相对应,如果s包含该某一特征,则向量空间ν中与该特征对应的标识值为1,否则为0。12. The detection system according to claim 11, wherein the larger the IG value is, the greater the degree of correlation is, the important features are retained according to the degree of correlation, the important features are matched with each application software in the system, and the The corresponding vector space, the construction of the vector space includes, remove the features whose IG value is 0, and keep the other features whose value is not 0 as important features, and construct different feature vectors (x 1 , x 2 , ...,x n ) feature set X, call the formula ν: s→{0,1} |X| , construct a vector space ν according to the feature vectors in the set X, where s represents a certain application software, and each The dimension corresponds to a certain feature in X. If s contains this certain feature, the value of the identifier corresponding to this feature in the vector space ν is 1, otherwise it is 0.
CN201910002795.2A 2019-01-02 2019-01-02 Android malicious application detection method and system fusing frequent item set and random forest algorithm Active CN109753800B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910002795.2A CN109753800B (en) 2019-01-02 2019-01-02 Android malicious application detection method and system fusing frequent item set and random forest algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910002795.2A CN109753800B (en) 2019-01-02 2019-01-02 Android malicious application detection method and system fusing frequent item set and random forest algorithm

Publications (2)

Publication Number Publication Date
CN109753800A true CN109753800A (en) 2019-05-14
CN109753800B CN109753800B (en) 2023-04-07

Family

ID=66405239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910002795.2A Active CN109753800B (en) 2019-01-02 2019-01-02 Android malicious application detection method and system fusing frequent item set and random forest algorithm

Country Status (1)

Country Link
CN (1) CN109753800B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851834A (en) * 2019-11-18 2020-02-28 北京工业大学 Android malicious application detection method based on multi-feature classification
CN111324893A (en) * 2020-02-17 2020-06-23 电子科技大学 Android malware detection method and background system based on sensitive mode
CN111460452A (en) * 2020-03-30 2020-07-28 中国人民解放军国防科技大学 Android malicious software detection method based on frequency fingerprint extraction
CN111723371A (en) * 2020-06-22 2020-09-29 上海斗象信息科技有限公司 Method for constructing detection model of malicious file and method for detecting malicious file
WO2020233322A1 (en) * 2019-05-21 2020-11-26 暨南大学 Description-entropy-based intelligent detection method for big data mobile software similarity
CN112000954A (en) * 2020-08-25 2020-11-27 莫毓昌 A Malware Detection Method Based on Feature Sequence Mining and Reduction
CN112035836A (en) * 2019-06-04 2020-12-04 四川大学 Malicious code family API sequence mining method
CN112100621A (en) * 2020-09-11 2020-12-18 哈尔滨工程大学 Android malicious application detection method based on sensitive permission and API
CN112287345A (en) * 2020-10-29 2021-01-29 中南大学 Trusted edge computing system based on intelligent risk detection
CN112446026A (en) * 2019-09-03 2021-03-05 中移(苏州)软件技术有限公司 Malicious software detection method and device and storage medium
CN112464232A (en) * 2020-11-21 2021-03-09 西北工业大学 Android system malicious software detection method based on mixed feature combination classification
CN112632539A (en) * 2020-12-28 2021-04-09 西北工业大学 Dynamic and static mixed feature extraction method in Android system malicious software detection
CN112651024A (en) * 2020-12-29 2021-04-13 重庆大学 Method, device and equipment for malicious code detection
CN113378167A (en) * 2021-06-30 2021-09-10 哈尔滨理工大学 Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing
CN113378171A (en) * 2021-07-12 2021-09-10 东北大学秦皇岛分校 Android lasso software detection method based on convolutional neural network
CN113592103A (en) * 2021-07-26 2021-11-02 东方红卫星移动通信有限公司 Software malicious behavior identification method based on integrated learning and dynamic analysis
CN113949514A (en) * 2020-07-16 2022-01-18 中国电信股份有限公司 Application override detection method, device and storage medium
CN115249048A (en) * 2022-09-16 2022-10-28 西南民族大学 Confrontation sample generation method
CN115878421A (en) * 2022-12-09 2023-03-31 国网湖北省电力有限公司信息通信公司 A data center equipment-level fault prediction method, system, and medium based on log time-series correlation feature mining
CN117708813A (en) * 2023-11-30 2024-03-15 四川大学 A security detection method and system for software development environment

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138916A (en) * 2015-08-21 2015-12-09 中国人民解放军信息工程大学 Multi-track malicious program feature detecting method based on data mining
CN105530265A (en) * 2016-01-28 2016-04-27 李青山 Mobile Internet malicious application detection method based on frequent itemset description
CN105550583A (en) * 2015-12-22 2016-05-04 电子科技大学 Random forest classification method based detection method for malicious application in Android platform
CN105740712A (en) * 2016-03-09 2016-07-06 哈尔滨工程大学 Android malicious act detection method based on Bayesian network
CN106845240A (en) * 2017-03-10 2017-06-13 西京学院 A kind of Android malware static detection method based on random forest
CN106845220A (en) * 2015-12-07 2017-06-13 深圳先进技术研究院 A kind of Android malware detecting system and method
CN107169355A (en) * 2017-04-28 2017-09-15 北京理工大学 A kind of worm homology analysis method and apparatus
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
US20180046796A1 (en) * 2016-08-12 2018-02-15 Duo Security, Inc. Methods for identifying compromised credentials and controlling account access
CN108108616A (en) * 2017-12-19 2018-06-01 努比亚技术有限公司 Malicious act detection method, mobile terminal and storage medium
US20180322287A1 (en) * 2016-05-05 2018-11-08 Cylance Inc. Machine learning model for malware dynamic analysis
CN108958215A (en) * 2018-06-01 2018-12-07 天泽信息产业股份有限公司 A kind of engineering truck failure prediction system and its prediction technique based on data mining

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138916A (en) * 2015-08-21 2015-12-09 中国人民解放军信息工程大学 Multi-track malicious program feature detecting method based on data mining
CN106845220A (en) * 2015-12-07 2017-06-13 深圳先进技术研究院 A kind of Android malware detecting system and method
CN105550583A (en) * 2015-12-22 2016-05-04 电子科技大学 Random forest classification method based detection method for malicious application in Android platform
CN105530265A (en) * 2016-01-28 2016-04-27 李青山 Mobile Internet malicious application detection method based on frequent itemset description
CN105740712A (en) * 2016-03-09 2016-07-06 哈尔滨工程大学 Android malicious act detection method based on Bayesian network
US20180322287A1 (en) * 2016-05-05 2018-11-08 Cylance Inc. Machine learning model for malware dynamic analysis
US20180046796A1 (en) * 2016-08-12 2018-02-15 Duo Security, Inc. Methods for identifying compromised credentials and controlling account access
CN106845240A (en) * 2017-03-10 2017-06-13 西京学院 A kind of Android malware static detection method based on random forest
CN107169355A (en) * 2017-04-28 2017-09-15 北京理工大学 A kind of worm homology analysis method and apparatus
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
CN108108616A (en) * 2017-12-19 2018-06-01 努比亚技术有限公司 Malicious act detection method, mobile terminal and storage medium
CN108958215A (en) * 2018-06-01 2018-12-07 天泽信息产业股份有限公司 A kind of engineering truck failure prediction system and its prediction technique based on data mining

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ALI IDRI 等: "A data mining-based approach for cardiovascular dysautonomias diagnosis and treatment", 《2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY》 *
杨宏宇 等: "基于改进随机森林算法的Android恶意软件检测", 《通信学报》 *
赵弋: "Android平台恶意应用静态检测方法的研究", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020233322A1 (en) * 2019-05-21 2020-11-26 暨南大学 Description-entropy-based intelligent detection method for big data mobile software similarity
CN112035836A (en) * 2019-06-04 2020-12-04 四川大学 Malicious code family API sequence mining method
CN112446026A (en) * 2019-09-03 2021-03-05 中移(苏州)软件技术有限公司 Malicious software detection method and device and storage medium
CN110851834B (en) * 2019-11-18 2024-02-27 北京工业大学 Android malicious application detection method integrating multi-feature classification
CN110851834A (en) * 2019-11-18 2020-02-28 北京工业大学 Android malicious application detection method based on multi-feature classification
CN111324893B (en) * 2020-02-17 2022-05-10 电子科技大学 Android malware detection method and background system based on sensitive mode
CN111324893A (en) * 2020-02-17 2020-06-23 电子科技大学 Android malware detection method and background system based on sensitive mode
CN111460452A (en) * 2020-03-30 2020-07-28 中国人民解放军国防科技大学 Android malicious software detection method based on frequency fingerprint extraction
CN111460452B (en) * 2020-03-30 2022-09-09 中国人民解放军国防科技大学 An Android malware detection method based on frequency fingerprint extraction
CN111723371A (en) * 2020-06-22 2020-09-29 上海斗象信息科技有限公司 Method for constructing detection model of malicious file and method for detecting malicious file
CN111723371B (en) * 2020-06-22 2024-02-20 上海斗象信息科技有限公司 Method for constructing malicious file detection model and detecting malicious file
CN113949514B (en) * 2020-07-16 2024-01-26 中国电信股份有限公司 Application override detection method, device and storage medium
CN113949514A (en) * 2020-07-16 2022-01-18 中国电信股份有限公司 Application override detection method, device and storage medium
CN112000954A (en) * 2020-08-25 2020-11-27 莫毓昌 A Malware Detection Method Based on Feature Sequence Mining and Reduction
CN112000954B (en) * 2020-08-25 2024-01-30 华侨大学 Malicious software detection method based on feature sequence mining and simplification
CN112100621A (en) * 2020-09-11 2020-12-18 哈尔滨工程大学 Android malicious application detection method based on sensitive permission and API
CN112100621B (en) * 2020-09-11 2022-05-20 哈尔滨工程大学 An Android malicious application detection method based on sensitive permissions and API
CN112287345A (en) * 2020-10-29 2021-01-29 中南大学 Trusted edge computing system based on intelligent risk detection
CN112287345B (en) * 2020-10-29 2024-04-16 中南大学 Trusted edge computing system based on intelligent risk detection
CN112464232B (en) * 2020-11-21 2024-04-09 西北工业大学 Android system malicious software detection method based on mixed feature combination classification
CN112464232A (en) * 2020-11-21 2021-03-09 西北工业大学 Android system malicious software detection method based on mixed feature combination classification
CN112632539A (en) * 2020-12-28 2021-04-09 西北工业大学 Dynamic and static mixed feature extraction method in Android system malicious software detection
CN112632539B (en) * 2020-12-28 2024-04-09 西北工业大学 Dynamic and static hybrid feature extraction method in Android system malicious software detection
CN112651024A (en) * 2020-12-29 2021-04-13 重庆大学 Method, device and equipment for malicious code detection
CN113378167A (en) * 2021-06-30 2021-09-10 哈尔滨理工大学 Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing
CN113378171B (en) * 2021-07-12 2022-06-21 东北大学秦皇岛分校 An Android ransomware detection method based on convolutional neural network
CN113378171A (en) * 2021-07-12 2021-09-10 东北大学秦皇岛分校 Android lasso software detection method based on convolutional neural network
CN113592103A (en) * 2021-07-26 2021-11-02 东方红卫星移动通信有限公司 Software malicious behavior identification method based on integrated learning and dynamic analysis
CN115249048A (en) * 2022-09-16 2022-10-28 西南民族大学 Confrontation sample generation method
CN115878421A (en) * 2022-12-09 2023-03-31 国网湖北省电力有限公司信息通信公司 A data center equipment-level fault prediction method, system, and medium based on log time-series correlation feature mining
CN115878421B (en) * 2022-12-09 2023-11-14 国网湖北省电力有限公司信息通信公司 Data center equipment level fault prediction method, system and medium
CN117708813A (en) * 2023-11-30 2024-03-15 四川大学 A security detection method and system for software development environment
CN117708813B (en) * 2023-11-30 2024-06-21 四川大学 A security detection method and system for software development environment

Also Published As

Publication number Publication date
CN109753800B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN109753800A (en) Android malicious application detection method and system integrating frequent itemsets and random forest algorithm
Aslan et al. A new malware classification framework based on deep learning algorithms
Yakura et al. Malware analysis of imaged binary samples by convolutional neural network with attention mechanism
CN113821804B (en) Cross-architecture automatic detection method and system for third-party components and security risks thereof
CN111639337B (en) Unknown malicious code detection method and system for massive Windows software
CN106503558B (en) An Android malicious code detection method based on community structure analysis
CN103106365B (en) The detection method of the malicious application software on a kind of mobile terminal
AU2019357365B2 (en) Analysis function imparting device, analysis function imparting method, and analysis function imparting program
CN111259388A (en) Malicious software API (application program interface) calling sequence detection method based on graph convolution
CN113139192B (en) Third party library security risk analysis method and system based on knowledge graph
CN101751530B (en) Method for detecting loophole aggressive behavior and device
CN106529294B (en) A method of determine for mobile phone viruses and filters
Martín et al. A new tool for static and dynamic Android malware analysis
CN113297580B (en) Code semantic analysis-based electric power information system safety protection method and device
CN112817877B (en) Abnormal script detection method and device, computer equipment and storage medium
US20210334371A1 (en) Malicious File Detection Technology Based on Random Forest Algorithm
KR102120200B1 (en) Malware Crawling Method and System
CN113901465A (en) Heterogeneous network-based Android malicious software detection method
Sanz et al. Instance-based anomaly method for Android malware detection
CN111460452A (en) Android malicious software detection method based on frequency fingerprint extraction
CN115292674A (en) Fraud application detection method and system based on user comment data
CN108229168B (en) Heuristic detection method, system and storage medium for nested files
CN108959922A (en) A kind of malice document detection method and device based on Bayesian network
CN113626810A (en) Android malicious software detection method and system based on sensitive subgraph
CN118228258A (en) Application program detection method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant