CN111460452A - Android malicious software detection method based on frequency fingerprint extraction - Google Patents
Android malicious software detection method based on frequency fingerprint extraction Download PDFInfo
- Publication number
- CN111460452A CN111460452A CN202010237052.6A CN202010237052A CN111460452A CN 111460452 A CN111460452 A CN 111460452A CN 202010237052 A CN202010237052 A CN 202010237052A CN 111460452 A CN111460452 A CN 111460452A
- Authority
- CN
- China
- Prior art keywords
- api
- equal
- smali
- arm
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 26
- 239000013598 vector Substances 0.000 claims abstract description 100
- 238000001514 detection method Methods 0.000 claims abstract description 67
- 238000000034 method Methods 0.000 claims abstract description 57
- 238000007781 pre-processing Methods 0.000 claims abstract description 50
- 238000012360 testing method Methods 0.000 claims abstract description 32
- 239000008186 active pharmaceutical agent Substances 0.000 claims description 61
- 238000004364 calculation method Methods 0.000 claims description 49
- 238000012216 screening Methods 0.000 claims description 35
- 238000012706 support-vector machine Methods 0.000 claims description 23
- 238000007619 statistical method Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 12
- 238000005457 optimization Methods 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 12
- 101100096985 Mus musculus Strc gene Proteins 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 7
- 230000002155 anti-virotic effect Effects 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 4
- 230000006837 decompression Effects 0.000 claims description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 230000009193 crawling Effects 0.000 claims description 3
- 238000012887 quadratic function Methods 0.000 claims description 3
- 238000002203 pretreatment Methods 0.000 claims 2
- FWBHETKCLVMNFS-UHFFFAOYSA-N 4',6-Diamino-2-phenylindol Chemical compound C1=CC(C(=N)N)=CC=C1C1=CC2=CC=C(C(N)=N)C=C2N1 FWBHETKCLVMNFS-UHFFFAOYSA-N 0.000 claims 1
- 102100029469 WD repeat and HMG-box DNA-binding protein 1 Human genes 0.000 claims 1
- 238000010276 construction Methods 0.000 claims 1
- 238000001914 filtration Methods 0.000 claims 1
- 239000000203 mixture Substances 0.000 claims 1
- 238000012163 sequencing technique Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 description 5
- 230000003068 static effect Effects 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000003062 neural network model Methods 0.000 description 3
- 230000001174 ascending effect Effects 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005206 flow analysis Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000004703 multiple scattering X(a) calculation Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computer Security & Cryptography (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Hardware Design (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Virology (AREA)
- General Health & Medical Sciences (AREA)
- Debugging And Monitoring (AREA)
- Stored Programmes (AREA)
Abstract
本发明公开了一种基于频率指纹提取的安卓恶意软件检测方法,目的是提供一种能对恶意软件准确检测的方法。技术方案是构建由样本预处理模块、频率指纹产生模块、检测模块组成的基于频率指纹提取的安卓恶意软件检测系统,采集恶意及良性软件作为样本,构建基准测试集D;对D内样本解压缩得到AndroidManifest.xml、classes.dex和so库文件,提取权限、API、smali操作码和arm操作码特征,统计这些特征是否出现及出现频率,形成四种不同类型的特征向量并将其首尾相接的频率指纹。通过D内样本的频率指纹训练优化检测模块成为分类器,对待检样本进行检测,输出待检样本是否是恶意软件的结果。本发明能够有效地整合来自安卓软件各个组成部分的的信息,检测既准确又快速。
The invention discloses an Android malware detection method based on frequency fingerprint extraction, and aims to provide a method capable of accurately detecting malware. The technical solution is to build an Android malware detection system based on frequency fingerprint extraction, which consists of a sample preprocessing module, a frequency fingerprint generation module, and a detection module, collect malicious and benign software as samples, and construct a benchmark test set D; decompress the samples in D Get AndroidManifest.xml, classes.dex and so library files, extract permissions, API, smali opcode and arm opcode features, count whether these features appear and their frequency, form four different types of feature vectors and connect them end to end frequency fingerprint. The optimized detection module is trained by the frequency fingerprint of the samples in D to become a classifier, to detect the samples to be inspected, and to output whether the samples to be inspected are the result of malware. The invention can effectively integrate the information from each component of the Android software, and the detection is both accurate and fast.
Description
技术领域technical field
本发明涉及安卓恶意软件检测领域,尤其涉及到一种利用提取的频率指纹对安卓恶意软件进行检测的方法。The invention relates to the field of Android malware detection, in particular to a method for detecting Android malware by using an extracted frequency fingerprint.
背景技术Background technique
近年来,伴随着互联网技术、移动通信技术的日益发展和普及,以智能手机为代表的移动终端给人们的生活带来了极大的便利,成为不可或缺的重要交流工具。在众多的移动操作系统中,安卓(即Android)移动操作系统以其出众的开放性、丰富的第三方应用软件、友好的操作界面和良好的用户体验等显著优势,受到广大用户的欢迎,在全球范围移动智能设备中占据了大量的市场份额。与此同时,安卓应用软件的数量也快速的增长,截止到2020年2月,Google Play中的应用软件数量达到了286万,且仍在不断增长。In recent years, with the increasing development and popularization of Internet technology and mobile communication technology, mobile terminals represented by smart phones have brought great convenience to people's lives and become an indispensable and important communication tool. Among the many mobile operating systems, the Android (ie Android) mobile operating system is welcomed by the majority of users due to its outstanding openness, rich third-party application software, friendly operation interface and good user experience. Global mobile smart devices occupy a large market share. At the same time, the number of Android applications has also grown rapidly. As of February 2020, the number of applications in Google Play has reached 2.86 million and is still growing.
除了安卓官方应用市场Google Play外,还存在着大量的第三方应用市场,这类市场良莠不齐数目众多,缺乏统一有效的管理,发布审核机制并不健全,不法人员也能随意发布安卓应用软件,使得这类市场中难以避免混入恶意应用,在被用户下载后给用户的信息安全带来巨大隐患。更为严重的是各类应用市场中软件存量巨大,且增速很快,在安全机制、检测方法并不健全的情况下,恶意软件在这类市场中长期存在,难以发现和查杀,对安卓生态的健康发展造成了巨大的威胁。In addition to Google Play, the official Android application market, there are also a large number of third-party application markets. There are many different types of markets, lacking unified and effective management, and the release review mechanism is not perfect. Unscrupulous personnel can also release Android application software at will, making It is difficult to avoid malicious applications mixed in such markets, which brings huge hidden dangers to users' information security after being downloaded by users. What's more serious is that there is a huge amount of software in various application markets, and the growth rate is very fast. Under the circumstance that the security mechanism and detection method are not perfect, malware has existed for a long time in such markets, and it is difficult to find and kill it. The healthy development of the Android ecosystem poses a huge threat.
目前典型的安卓恶意软件检测技术包括静态检测和动态检测两种类型。静态检测方法主要使用反汇编、反编译技术或者在smali中间代码上进行控制流和数据流分析技术来进行恶意代码检测。优点是代码覆盖率高,缺点是无法检测代码混淆、加密以及动态加载恶意代码的问题。动态分析方法是在系统运行过程中监控应用运行时的各种变量、跟踪应用的行为路径、收集运行产生的日志,优点是解决了静态方法遇到的代码混淆和加密等方面的问题,缺点是动态测试代码覆盖率低,并且有些恶意程序可以防止自身在模拟器下运行,在模拟器下运行时会崩溃或改变自身行为表现。在实现中,针对海量恶意样本的检测,为了得到更快的检测速度及更高的代码覆盖率,多数方法更倾向于使用静态检测。At present, typical Android malware detection techniques include two types: static detection and dynamic detection. Static detection methods mainly use disassembly, decompilation technology or control flow and data flow analysis technology on smali intermediate code to detect malicious code. The advantage is high code coverage, and the disadvantage is that it cannot detect code obfuscation, encryption, and dynamic loading of malicious code. The dynamic analysis method is to monitor various variables when the application is running, track the behavior path of the application, and collect the logs generated by the operation during the system operation. The advantage is that it solves the problems of code confusion and encryption encountered by the static method. The disadvantage is Dynamic test code coverage is low, and some malicious programs can prevent themselves from running under the emulator, crash or change their behavior when running under the emulator. In implementation, for the detection of massive malicious samples, in order to obtain faster detection speed and higher code coverage, most methods prefer to use static detection.
M.Ganesh等人提取安卓软件Manifest清单中列举的权限作为特征来检测恶意应用。他们将权限排列成12×12的阵列,输入到卷积神经网络模型进行训练,可以检测出软件是否是恶意的;M.Amin等人从字节码文件中提取操作码序列作为特征来检测安卓恶意软件。他们提取软件中的操作码组成一个长序列,将其视为有序文本进行处理,通过训练BiLSTM神经网络模型来分析软件的恶意性;R.Nix等人提取安卓API(ApplicationProgramming Interface,应用程序接口)调用序列研究恶意软件的检测方法,他们使用一个位向量对每个API调用进行编码,然后拆分组合成为大小为n×m的矩阵,用作卷积神经网络模型的输入,最终使用训练出的分类器判定软件的恶意性。M.Ganesh et al. extracted the permissions listed in the Android Manifest list as features to detect malicious applications. They arrange the permissions into a 12×12 array and input them into a convolutional neural network model for training, which can detect whether the software is malicious; M.Amin et al. extracted opcode sequences from bytecode files as features to detect Android malicious software. They extracted the opcodes in the software to form a long sequence, treated it as ordered text, and analyzed the maliciousness of the software by training the BiLSTM neural network model; R. Nix et al. extracted the Android API (Application Programming Interface, application programming interface) ) call sequences to study malware detection methods, they use a bit vector to encode each API call, and then split and combine into a matrix of size n × m, which is used as the input of the convolutional neural network model, and finally uses the trained The classifier determines the maliciousness of software.
上述检测方法在安卓恶意软件检测中取得了一定的成果,但也存在着一些问题,主要有以下两个方面:一是特征提取时考虑软件多种特征的关联分析不足。现有方法多是单方面提取某一种类型的特征刻画安卓软件行为,没有采取多种类型的特征协同进行软件分析,抽象出的特征表示类型单一,导致检测结果准确度不高。二是训练的神经网络模型较为复杂,涉及大量参数调整优化,效率低下,得到训练良好的模型需要消耗大量的时间。The above detection methods have achieved certain results in Android malware detection, but there are also some problems, mainly in the following two aspects: First, the correlation analysis considering multiple software features in feature extraction is insufficient. Most of the existing methods unilaterally extract a certain type of features to describe the behavior of Android software, and do not use multiple types of features to coordinate software analysis, and the abstracted features represent a single type, resulting in low accuracy of detection results. Second, the trained neural network model is relatively complex, involving a large number of parameter adjustment and optimization, which is inefficient, and it takes a lot of time to obtain a well-trained model.
因此,面对大量出现的安卓恶意软件,如何精确、高效的检测是一个非常值得关注的问题。Therefore, in the face of a large number of Android malware, how to detect it accurately and efficiently is a very important issue.
发明内容SUMMARY OF THE INVENTION
本发明要解决的技术问题是针对安卓恶意软件,生成能够唯一标识该软件的频率指纹,并基于该指纹训练和优化多核支持向量机模型,对安卓恶意软件进行准确检测,同时有效提高检测速度。The technical problem to be solved by the present invention is to generate a frequency fingerprint that can uniquely identify the software for Android malware, and train and optimize a multi-core support vector machine model based on the fingerprint to accurately detect the Android malware and effectively improve the detection speed.
本发明的技术方案是:构建由样本预处理模块、频率指纹产生模块、检测模块组成的基于频率指纹提取的安卓恶意软件检测系统,采集安卓恶意及良性软件作为样本,构建基准测试集。对集内的样本解压缩,得到AndroidManifest.xml、classes.dex和so库文件,提取权限、API、smali操作码和arm操作码特征,统计这四类特征是否出现以及出现频率,形成四种不同类型的特征向量并将其首尾相接,形成长向量,作为安卓软件的频率指纹。通过采集基准测试集内众多样本的频率指纹,训练优化检测模块(是一个多核支持向量机模型)成为分类器,对待检样本进行检测,输出待检样本是否是恶意软件的结果。The technical scheme of the present invention is to construct an Android malware detection system based on frequency fingerprint extraction, which is composed of a sample preprocessing module, a frequency fingerprint generation module, and a detection module, collect Android malware and benign software as samples, and construct a benchmark test set. Decompress the samples in the set, get AndroidManifest.xml, classes.dex and so library files, extract permissions, API, smali opcode and arm opcode features, and count whether these four types of features appear and their frequency to form four different Type feature vectors and connect them end to end to form a long vector as the frequency fingerprint of the Android software. By collecting the frequency fingerprints of many samples in the benchmark test set, the optimized detection module (which is a multi-core support vector machine model) is trained to become a classifier, which detects the samples to be tested and outputs whether the samples to be tested are the result of malware.
本发明包括以下步骤:The present invention includes the following steps:
第一步,构建基于频率指纹提取的安卓恶意软件检测系统。该系统安装在谷歌官方或者第三方安卓应用软件市场服务器中,由样本预处理模块、频率指纹产生模块、检测模块组成。The first step is to build an Android malware detection system based on frequency fingerprint extraction. The system is installed in Google's official or third-party Android application software market server, and consists of a sample preprocessing module, a frequency fingerprint generation module, and a detection module.
样本预处理模块与频率指纹产生模块相连,样本预处理模块接收来自开发人员构建的基准测试集的样本和普通用户提交的待检测样本,对样本进行预处理,产生AndroidManifest.xml、smali文件和arm指令文件三种类型的文件,输出至频率指纹产生模块。The sample preprocessing module is connected to the frequency fingerprint generation module. The sample preprocessing module receives samples from the benchmark test set constructed by developers and samples to be tested submitted by ordinary users, preprocesses the samples, and generates AndroidManifest.xml, smali files and arm Three types of command files are output to the frequency fingerprint generation module.
频率指纹产生模块与样本预处理模块、检测模块相连,频率指纹产生模块从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件,进行特征筛选和频率指纹(能够作为安卓软件身份标识的一种向量)计算,产生频率指纹,输出至检测模块;频率指纹产生模块由特征筛选模块和频率指纹计算模块组成。特征筛选模块与样本预处理模块、频率指纹计算模块相连,特征筛选模块从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件,对这三种文件进行特征筛选,得到权限、API、smali操作码和arm操作码特征,将权限、API、smali操作码和arm操作码特征发送给频率指纹计算模块。频率指纹计算模块与样本预处理模块、特征筛选模块、检测模块相连,频率指纹计算模块从特征筛选模块接收权限、API、smali操作码和arm操作码特征,从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件,计算产生频率指纹,将频率指纹发送给检测模块。The frequency fingerprint generation module is connected with the sample preprocessing module and the detection module, and the frequency fingerprint generation module receives AndroidManifest. The frequency fingerprint generation module is composed of a feature screening module and a frequency fingerprint calculation module. The feature screening module is connected with the sample preprocessing module and the frequency fingerprint calculation module. The feature screening module receives AndroidManifest.xml, smali file and arm command file from the sample preprocessing module, and performs feature screening on these three files to obtain permissions, API, smali Opcode and arm opcode feature, send permission, API, smali opcode and arm opcode feature to the frequency fingerprint calculation module. The frequency fingerprint calculation module is connected with the sample preprocessing module, the feature screening module and the detection module. The frequency fingerprint calculation module receives permissions, API, smali opcode and arm opcode features from the feature screening module, and receives AndroidManifest.xml, smali file and arm command file, calculate the frequency fingerprint, and send the frequency fingerprint to the detection module.
检测模块与频率指纹产生模块相连,检测模块是一个多核支持向量机模型,它从频率指纹产生模块接收基准测试集D的频率指纹和待检测软件的频率指纹,利用基准测试集D的频率指纹进行训练优化,成为适合对待检测软件进行检测的分类器,然后根据待检测软件的频率指纹对待检测软件进行检测分类,得出待检测软件是否是恶意软件的判定结果。The detection module is connected with the frequency fingerprint generation module. The detection module is a multi-core support vector machine model. It receives the frequency fingerprint of the benchmark test set D and the frequency fingerprint of the software to be detected from the frequency fingerprint generation module, and uses the frequency fingerprint of the benchmark test set D. The training is optimized to become a classifier suitable for detecting the software to be detected, and then the software to be detected is detected and classified according to the frequency fingerprint of the software to be detected, and the judgment result of whether the software to be detected is malware is obtained.
第二步,构建基准测试集D,方法是:The second step is to construct a benchmark test set D, the method is:
2.1步,从开源的Drebin、Genome和AMD数据集中获得N1个安卓恶意软件作为恶意样本,N1为正整数且N1>1000。Step 2.1, obtain N 1 Android malware from the open source Drebin, Genome and AMD datasets as malicious samples, where N 1 is a positive integer and N 1 >1000.
2.2步,通过爬取GooglePlay和Apkpure应用商店获得良性软件,并使用本地杀毒软件及VirusTotal在线杀毒网站进行检测过滤,形成N2个良性样本,N2为正整数且N2>1000。Step 2.2, obtain benign software by crawling GooglePlay and Apkpure application stores, and use local antivirus software and VirusTotal online antivirus website to detect and filter to form N 2 benign samples, N 2 is a positive integer and N 2 >1000.
2.3步,给恶意样本及良性样本添加标签,组成基准测试集D,N为D内样本总数,N=N1+N2。定义x(i)为D中第i个样本,y(i)为x(i)的标签,y(i)等于1表示x(i)为恶意样本,y(i)等于-1表示x(i)为良性样本,1≤i≤N。Step 2.3, add labels to malicious samples and benign samples to form a benchmark test set D, where N is the total number of samples in D, N=N 1 +N 2 . Define x (i) as the ith sample in D, y (i) as the label of x (i) , y (i) equal to 1 means x (i) is a malicious sample, y (i) equal to -1 means x ( i) is a benign sample, 1≤i≤N.
2.4将D存储在预处理模块、频率指纹产生模块均可以读取的存储器上。2.4 Store D in a memory that can be read by both the preprocessing module and the frequency fingerprint generation module.
第三步,样本预处理模块对D内N个样本进行预处理,得到N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件。In the third step, the sample preprocessing module preprocesses the N samples in D to obtain N AndroidManifest.xml files, N smali files and N arm instruction files.
3.1步,令变量i=1;Step 3.1, let the variable i=1;
3.2步,从D中取第i个样本x(i)。Step 3.2, take the ith sample x (i) from D.
3.3步,采用样本预处理方法对x(i)进行预处理,得到x(i)的AndroidManifest.xml文件、smali文件和arm指令文件,方法是:Step 3.3, use the sample preprocessing method to preprocess x (i) to obtain the AndroidManifest.xml file, smali file and arm command file of x (i) , the method is:
3.3.1步,使用解压缩工具(例如Gzip和7zip),对x(i)进行解压缩,提取x(i)中的AndroidManifest.xml、classes.dex以及so运行库文件。Step 3.3.1, use decompression tools (such as Gzip and 7zip) to decompress x (i) , and extract the AndroidManifest.xml, classes.dex and so runtime library files in x (i).
3.3.2步,使用AndroidManifest.xml文件专用反编译工具AXMLPrinter2(下载地址:https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/android4me/AXMLPrinter2.jar,版本2.0或以上版本),将AndroidManifest.xml文件由二进制形式反编译为文本形式。Step 3.3.2, use the AndroidManifest.xml file dedicated decompilation tool AXMLPrinter2 (download address: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/android4me/AXMLPrinter2.jar , version 2.0 or above), decompile the AndroidManifest.xml file from binary form to text form.
3.3.3步,使用dex文件格式反编译工具baksmali(https://bitbucket.org/JesusFreke/smali/downloads/baksmali-2.4.0.jar,版本2.4.0或以上版本)将classes.dex反编译为smali文件,若产生多个smali文件,将多个smali文件合并成为一个smali文件,转3.3.4步;若只产生1个smali文件,直接转3.3.4步。Step 3.3.3, use the dex file format decompilation tool baksmali (https://bitbucket.org/JesusFreke/smali/downloads/baksmali-2.4.0.jar, version 2.4.0 or above) to decompile classes.dex It is a smali file. If multiple smali files are generated, merge the multiple smali files into one smali file and go to step 3.3.4; if only one smali file is generated, go to step 3.3.4 directly.
3.3.4步,使用arm指令反汇编工具gcc-arm-none-eabi(https://developer.arm.com/-/media/Files/downloads/gnu-rm/9-2019q4/gcc-arm-none-eabi-9-2019-q4-major-x86_64-linux.tar.bz2,版本9-2019-q4-major或以上版本)将so运行库文件反编译为文本形式的arm指令文件,若产生多个arm指令文件,将多个arm指令文件合并成为一个arm指令文件,转3.4步;如若没有产生arm指令文件,则新建一个空的arm指令文件,转3.4步。Step 3.3.4, use the arm instruction disassembly tool gcc-arm-none-eabi (https://developer.arm.com/-/media/Files/downloads/gnu-rm/9-2019q4/gcc-arm-none -eabi-9-2019-q4-major-x86_64-linux.tar.bz2, version 9-2019-q4-major or above) decompile the so runtime library file into a text-based arm command file, if multiple arm command file, combine multiple arm command files into one arm command file, go to step 3.4; if no arm command file is generated, create an empty arm command file and go to step 3.4.
3.4步,令i=i+1,若i≤N,转3.2步;若i>N,此时N个样本产生了对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件,将D的N个样本对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件发送给特征筛选模块,转第四步。Step 3.4, let i=i+1, if i≤N, go to step 3.2; if i>N, then N samples generate corresponding N AndroidManifest.xml files, N smali files and N arm command files , send the N AndroidManifest.xml files, N smali files, and N arm instruction files corresponding to the N samples of D to the feature screening module, and go to the fourth step.
第四步,特征筛选模块对从样本预处理模块收到的D的N个样本对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件进行特征筛选,得到适合对D进行分类的权限特征、API特征、smali操作码特征和arm操作码特征。In the fourth step, the feature screening module performs feature screening on the N AndroidManifest.xml files, N smali files and N arm command files corresponding to the N samples of D received from the sample preprocessing module, and obtains a feature suitable for classifying D. Permission features, API features, smali opcode features, and arm opcode features.
4.1步,选择安卓开发者文档(https://developer.android.com/reference/android/Manifest.permission)中定义的167种android.manifest.permission系统权限,将这167种权限作为特征,称为权限特征。Step 4.1, select the 167 android.manifest.permission system permissions defined in the Android developer documentation (https://developer.android.com/reference/android/Manifest.permission), and use these 167 permissions as features, called Permission features.
4.2步,从pscout列表(https://security.csl.toronto.edu/pscout/?mdocs-file=67)的API中,选择出256个API,方法是:Step 4.2, select 256 APIs from the APIs in the pscout list (https://security.csl.toronto.edu/pscout/?mdocs-file=67) by:
4.2.1步,建立一个列表Lapi,选择pscout列表中全部的32437个API加入Lapi,第v个API记为Lapi[v],1≤v≤32437。Step 4.2.1, create a list L api , select all 32437 APIs in the pscout list to join L api , the vth API is recorded as L api [v], 1≤v≤32437.
4.2.2步,建立一个32437行N列的二维数组Zapi,第v行第i列元素Zapi[v][i]的值限定为1或0,1代表Lapi的第v个API在D中的第i个样本中出现,0代表未出现。Step 4.2.2, build a two-dimensional array Z api with 32437 rows and N columns, the value of the element Z api [v][i] in the vth row and the i column is limited to 1 or 0, 1 represents the vth API of L api appears in the ith sample in D, and 0 means not appearing.
4.2.3步,初始化Zapi内所有元素为0,初始化变量i=1。Step 4.2.3, initialize all elements in Z api to 0, and initialize variable i=1.
4.2.4步,按行扫描D的第i个样本的smali文件,得到第i个样本中出现的属于Lapi的API,对Zapi的第i列元素进行赋值。记smali文件的第u行字符串为str[u],记smali文件的总行数为U,1≤u≤U,方法是:Step 4.2.4, scan the smali file of the ith sample of D by row, obtain the API belonging to the L api appearing in the ith sample, and assign values to the elements of the ith column of Z api . Note that the u-th string of the smali file is str[u], and the total number of lines of the smali file is U, 1≤u≤U, the method is:
4.2.4.1步,初始化u=1。Step 4.2.4.1, initialize u=1.
4.2.4.2步,若str[u]是一个API字符串,转4.2.4.2.1;若str[u]不是一个API字符串,转4.2.4.3。Step 4.2.4.2, if str[u] is an API string, go to 4.2.4.2.1; if str[u] is not an API string, go to 4.2.4.3.
4.2.4.2.1步,初始化变量v=1。Step 4.2.4.2.1, initialize the variable v=1.
4.2.4.2.2步,若str[u]含有内容为Lapi[v]的子字符串,赋值Zapi[v][i]=1,转4.2.4.3;否则,转4.2.4.2.3步。Step 4.2.4.2.2, if str[u] contains a substring whose content is L api [v], assign Z api [v][i]=1, go to 4.2.4.3; otherwise, go to 4.2.4.2.3 step.
4.2.4.2.3步,令v=v+1。若v≤32437,转4.2.4.2.2步;若v>32437,转4.2.4.3步。Step 4.2.4.2.3, let v=v+1. If v≤32437, go to step 4.2.4.2.2; if v>32437, go to step 4.2.4.3.
4.2.4.3步,令u=u+1。若u≤U,转4.2.4.2步;若u>U,说明第i个样本的smali文件扫描完毕,转4.2.5步。Step 4.2.4.3, let u=u+1. If u≤U, go to step 4.2.4.2; if u>U, it means that the smali file of the ith sample is scanned, go to step 4.2.5.
4.2.5步,令i=i+1。若i≤N,转4.2.4步;若i>N,完成了对二维数组Zapi的赋值,转4.2.6。Step 4.2.5, let i=i+1. If i≤N, go to step 4.2.4; if i>N, complete the assignment to the two-dimensional array Z api , go to step 4.2.6.
4.2.6步,计算列表Lapi中每个API对基准测试集D的信息增益IG。第v个API对D的信息增益用IG(D|Lapi[v])表示。Step 4.2.6, calculate the information gain IG of each API in the list L api to the benchmark test set D. The information gain of the vth API to D is denoted by IG(D|L api [v]).
4.2.6.1步,令v=1。Step 4.2.6.1, let v=1.
4.2.6.2步,令i=1。令第一变量M11=0,令第二变量M12=0,令第三变量M21=0,令第四变量M22=0。Step 4.2.6.2, let i=1. Let the first variable M 11 =0, the second variable M 12 =0, the third variable M 21 =0, and the fourth variable M 22 =0.
4.2.6.3步,若Zapi[v][i]等于1并且y(i)等于1,令M11=M11+1;若Zapi[v][i]等于1并且y(i)等于0,令M12=M12+1;若Zapi[v][i]等于0并且y(i)等于1,令M21=M21+1;若Zapi[v][i]等于0并且y(i)等于0,令M22=M22+1。Step 4.2.6.3, if Zapi [v][i] is equal to 1 and y (i) is equal to 1, let M11= M11 + 1 ; if Zapi [v][i] is equal to 1 and y (i) is equal to 0, let M 12 =M 12 +1; if Z api [v][i] is equal to 0 and y (i) is equal to 1, let M 21 =M 21 +1; if Z api [v][i] is equal to 0 and y (i) is equal to 0, let M 22 =M 22 +1.
4.2.6.4步,令i=i+1。若i≤N,转4.2.6.3步;若i>N,转4.2.6.5步。Step 4.2.6.4, let i=i+1. If i≤N, go to step 4.2.6.3; if i>N, go to step 4.2.6.5.
4.2.6.5步计算IG(D|Lapi[v]),方法为:Step 4.2.6.5 Calculate IG(D|L api [v]), the method is:
IG(D|Lapi[v])=H(D)-H(D|Lapi[v]) (1)IG(D|L api [v])=H(D)-H(D|L api [v]) (1)
其中H(D)为基准测试集D的经验熵,H(D)计算方法为:where H(D) is the empirical entropy of the benchmark test set D, and the calculation method of H(D) is:
H(D|Lapi[v])为列表Lapi的第v个API对D的经验条件熵,H(D|Lapi[v])为:H(D|L api [v]) is the empirical conditional entropy of the vth API of the list L api for D, and H(D|L api [v]) is:
4.2.6.6步,令v=v+1。若v≤32437,转4.2.6.2;若v>32437,说明列表Lapi内全部API对D的信息增益计算完毕,按照IG(D|Lapi[v])值从大到小将Lapi内API排序,取排序后的前256个API,作为API特征,转4.3步。Step 4.2.6.6, let v=v+1. If v≤32437, go to 4.2.6.2; if v>32437, it means that the information gain of all APIs in the list L api to D has been calculated, and the APIs in L api are sorted according to the value of IG(D|L api [v]) from large to small. Sort, take the first 256 APIs after sorting, as API features, go to step 4.3.
4.3步,安卓Dalvik虚拟机预定义了长度为8个二进制位的smali操作码(https://developer.android.com/reference/dalvik/bytecode/Opcodes.html),包括预留未定义的类型,最多有256种,将这256种smali操作码作为特征,称为smali操作码特征。In step 4.3, the Android Dalvik virtual machine predefines the smali opcode with a length of 8 binary bits (https://developer.android.com/reference/dalvik/bytecode/Opcodes.html), including reserved undefined types, There are at most 256 kinds, and these 256 kinds of smali opcodes are used as features, which are called smali opcode features.
4.4步,根据arm指令快速参考手册(http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001mc/QRC0001_UAL.pdf),特征筛选模块选择该手册列举的共计197种arm指令操作码作为特征,称为arm操作码特征。Step 4.4, according to the arm command quick reference manual (http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001mc/QRC0001_UAL.pdf), the feature screening module selects a total of 197 arm commands listed in the manual The opcode as a feature is called the arm opcode feature.
4.5步,将权限特征、API特征、smali操作码特征和arm操作码特征发送给频率指纹计算模块。In step 4.5, the permission feature, API feature, smali opcode feature and arm opcode feature are sent to the frequency fingerprint calculation module.
第五步,确定频率指纹格式。The fifth step is to determine the frequency fingerprint format.
将167种权限特征、256种API特征、256种smali操作码特征和197种arm操作码特征分别按字母顺序排列形成向量,分别称之为安卓软件的权限向量、API向量、smali操作码向量和arm操作码向量。The 167 permission features, 256 API features, 256 smali opcode features, and 197 arm opcode features are arranged in alphabetical order to form vectors, which are called Android software permission vectors, API vectors, smali opcode vectors and arm opcode vector.
一个安卓软件的权限向量由167个整数构成,每个整数取值为1或0。若第pa位置上的整数取值为1,说明筛选的167种权限中的第pa种在该安卓软件中被申请;若第pa位置上的整数取值为0,说明筛选的167种权限中的第pa种在该安卓软件中未被申请。pa为整数,1≤pa≤167。The permission vector of an Android software consists of 167 integers, each of which takes the value 1 or 0. If the integer value in the pa-th position is 1, it means that the pa-th type of the 167 kinds of permissions screened is applied for in the Android software; if the integer value of the pa-th position is 0, it means that the 167 kinds of permissions screened are in the pa-th kind. The type pa is not applied for in this Android software. pa is an integer, 1≤pa≤167.
一个安卓软件的API向量由256个小数组成,第pb位置上的小数说明筛选的256种API的第pb种在该安卓软件中出现的频率。pb为整数,1≤pb≤256。The API vector of an Android software consists of 256 decimals, and the decimal in the pbth position indicates the frequency of the pbth type of the 256 kinds of APIs screened in the Android software. pb is an integer, 1≤pb≤256.
一个安卓软件的smali操作码向量由256个小数组成,第pc位置上的小数说明筛选的256种smali操作码的第pc种在该安卓软件中出现的频率。pc为整数,1≤pc≤256。The smali opcode vector of an Android software consists of 256 decimals, and the decimal at the pcth position indicates the frequency of the pcth of the 256 smali opcodes screened in the Android software. pc is an integer, 1≤pc≤256.
一个安卓软件的arm操作码向量由197个小数组成,第pd位置上的小数说明筛选的197种arm操作码的第pd种在该安卓软件中出现的频率。pd为整数,1≤pd≤197。The arm opcode vector of an Android software consists of 197 decimals, and the decimal at the pd-th position indicates the frequency of the pd-th type of the selected 197 arm opcodes appearing in the Android software. pd is an integer, 1≤pd≤197.
四种向量首尾相接,形成一个长度为876的向量,作为样本的身份标识,称为频率指纹。频率指纹中含有的167个整数和709个小数,均称作频率指纹的元素。The four vectors are connected end to end to form a vector with a length of 876, which is used as the identity of the sample, which is called the frequency fingerprint. The 167 integers and 709 decimals contained in the frequency fingerprint are called the elements of the frequency fingerprint.
第六步,频率指纹计算模块从特征筛选模块接收权限特征、API特征、smali操作码特征和arm操作码特征,从样本预处理模块接收AndroidManifest.xml文件、smali文件和arm指令文件,对基准测试集D中N个样本计算产生频率指纹。In the sixth step, the frequency fingerprint calculation module receives permission features, API features, smali opcode features and arm opcode features from the feature screening module, and receives the AndroidManifest.xml file, smali file and arm command file from the sample preprocessing module, and performs benchmark testing The N samples in the set D are calculated to generate frequency fingerprints.
6.1步,令La为权限列表,列表成员La[pa]为167种权限中按字母顺序排列的第pa种权限的名称字符串;令Lb为API列表,列表成员Lb[pb]为256种API中按字母顺序排列的第pb种API的名称字符串;令Lc为smali操作码列表,列表成员Lc[pc]为256种smali操作码中按字母顺序排列的第pc种smali操作码的名称字符串;令Ld为arm操作码列表,列表成员Ld[pd]为197种arm操作码中按字母顺序排列的第pd种arm操作码的名称字符串。令变量i=1。Step 6.1, let L a be the permission list, and the list member L a [pa] is the name string of the pa-th permission in alphabetical order among the 167 permissions; let L b be the API list, and the list member L b [pb] is the name string of the pbth API in alphabetical order among the 256 APIs; let L c be the list of smali opcodes, and the list member L c [pc] is the alphabetical pcth of the 256 smali opcodes The name string of the smali opcode; let Ld be the list of arm opcodes, and the list member Ld [pd] be the name string of the pdth arm opcode in alphabetical order among the 197 arm opcodes. Let variable i=1.
6.2步,取D中第i个样本x(i),为x(i)生成频率指纹含876个元素,初始化每个元素为0。将中的权限向量记为中的第pa个元素记为API向量记为中的第pb个元素记为smali操作码向量记为中的第pc个元素记为arm操作码向量记为中的第pd个元素记为 Step 6.2, take the ith sample x (i) in D, and generate a frequency fingerprint for x (i) Contains 876 elements, initializing each element to 0. Will The permission vector in is denoted as The pa-th element in is denoted as API vector is denoted as The pbth element in is denoted as The smali opcode vector is denoted as The pc-th element in is denoted as The arm opcode vector is denoted as The pd-th element in is denoted as
6.3步,采用权限提取方法提取x(i)申请的权限,得到x(i)的权限向量方法是:Step 6.3, using the permission extraction method to extract the permission applied by x (i ) , and obtain the permission vector of x (i) the way is:
6.3.1步,按行扫描x(i)对应的AndroidManifest.xml文件,记AndroidManifest.xml文件的第qa行字符串为stra[qa],记AndroidManifest.xml文件总行数为numa行。Step 6.3.1, scan the AndroidManifest.xml file corresponding to x (i) by line, record the string of line qa in the AndroidManifest.xml file as stra[qa], and record the total number of lines in the AndroidManifest.xml file as numa lines.
6.3.2步,令qa=1。Step 6.3.2, let qa=1.
6.3.3步,若stra[qa]含有内容为“uses-permission”的子字符串,令pa=1,转6.3.4步;若stra[qa]不含有内容为“uses-permission”的字符串,转6.3.6步。Step 6.3.3, if stra[qa] contains a substring whose content is "uses-permission", set pa=1, go to step 6.3.4; if stra[qa] does not contain a character whose content is "uses-permission" string, go to step 6.3.6.
6.3.4步,若stra[qa]含有内容为La[pa]的子字符串,表明x(i)申请了La[pa]权限,令转6.3.6步;若stra[qa]不含有内容为La[pa]的子字符串,转6.3.5步。In step 6.3.4, if stra[qa] contains a substring whose content is La [pa], it means that x (i) has applied for La [pa] permission, and let Go to step 6.3.6; if stra[qa] does not contain a substring whose content is La [pa], go to step 6.3.5.
6.3.5步,令pa=pa+1。若pa≤167,转6.3.4步;若pa>167,说明完成了一遍对La的检查,转6.3.6步。Step 6.3.5, let pa=pa+1. If pa≤167 , go to step 6.3.4; if pa>167, it means that the check of La has been completed, go to step 6.3.6.
6.3.6步,令qa=qa+1。若qa≤numa,转6.3.3步;若qa>numa,说明x(i)对应的AndroidManifest.xml文件扫描完毕,计算完成,转6.4步。Step 6.3.6, let qa=qa+1. If qa≤numa, go to step 6.3.3; if qa>numa, it means that the AndroidManifest.xml file corresponding to x (i) has been scanned, After the calculation is completed, go to step 6.4.
6.4步,采用API统计方法统计x(i)使用的API,得到x(i)的API向量方法是:Step 6.4, use the API statistics method to count the APIs used by x (i ) , and obtain the API vector of x (i) . the way is:
6.4.1步,按行扫描x(i)对应的smali文件,记smali文件的第qb行字符串为strb[qb],记smali文件总行数为numb行。Step 6.4.1, scan the smali file corresponding to x (i) by line, record the qb line string of the smali file as strb[qb], and record the total number of lines in the smali file as numb lines.
6.4.2步,令qb=1,使用变量inv表示smali文件中API的总数量,令inv=1。Step 6.4.2, let qb=1, use the variable inv to represent the total number of APIs in the smali file, let inv=1.
6.4.3步,令变量pb=1。Step 6.4.3, let the variable pb=1.
6.4.4步,若strb[qb]含有内容为“invoke”的子字符串,令inv=inv+1,转6.4.5步;若不含有“invoke”子字符串,转6.4.7步。Step 6.4.4, if strb[qb] contains a substring whose content is "invoke", let inv=inv+1, go to step 6.4.5; if it does not contain a substring of "invoke", go to step 6.4.7.
6.4.5步,若strb[qb]含有内容为Lb[pb]的子字符串,说明x(i)调用了名字为Lb[pb]的API,令转6.4.7步;若strb[qb]不含有内容为Lb[pb]的子字符串,转6.4.6步。Step 6.4.5, if strb[qb] contains a substring whose content is L b [pb], it means that x (i) calls the API named L b [pb], let Go to step 6.4.7; if strb[qb] does not contain a substring whose content is L b [pb], go to step 6.4.6.
6.4.6步,令pb=pb+1。若pb≤256,转6.4.5步;若pb>256,说明完成了一遍对Lb的检查,转6.4.7步。Step 6.4.6, let pb=pb+1. If pb≤256, go to step 6.4.5; if pb>256, it means that the check of L b is completed, go to step 6.4.7.
6.4.7步,令qb=qb+1。若qb≤numb,转6.4.3步;若qb>numb,说明x(i)对应的smali文件扫描完毕,转6.4.8步。Step 6.4.7, let qb=qb+1. If qb≤numb, go to step 6.4.3; if qb>numb, it means that the smali file corresponding to x (i) is scanned, go to step 6.4.8.
6.4.8步,令pb=1。Step 6.4.8, let pb=1.
6.4.9步,令 Step 6.4.9, let
6.4.10步,令pb=pb+1。若pb≤256,转6.4.9步;若pb>256,说明计算完成,转6.5步。Step 6.4.10, let pb=pb+1. If pb≤256, go to step 6.4.9; if pb>256, explain After the calculation is completed, go to step 6.5.
6.5步,采用smali操作码统计方法统计x(i)使用的smali操作码,得到x(i)的smali操作码向量方法是:Step 6.5, use the smali opcode statistical method to count the smali opcodes used by x (i) , and obtain the smali opcode vector of x (i). the way is:
6.5.1步,按行扫描x(i)对应的smali文件,记smali文件的第qc行字符串为strc[qc],记smali文件总行数为hume行。Step 6.5.1, scan the smali file corresponding to x (i) by line, record the qc line string of the smali file as strc[qc], and record the total number of lines in the smali file as the hume line.
6.5.2步,令qc=1,使用变量ops表示smali文件中smali操作码的总数量,令ops=1。Step 6.5.2, let qc=1, use the variable ops to represent the total number of smali opcodes in the smali file, let ops=1.
6.5.3步,令pc=1。Step 6.5.3, let pc=1.
6.5.4步,若strc[qc]含有内容为Lc[pc]的子字符串,令 ops=ops+1,转6.5.6步;若strc[qc]不含有内容为Lc[pc]的子字符串,转6.5.5步。Step 6.5.4, if strc [qc] contains a substring whose content is Lc[pc], let ops=ops+1, go to step 6.5.6; if strc[qc] does not contain a substring whose content is L c [pc], go to step 6.5.5.
6.5.5步,令pc=pc+1。若pc≤256,转6.5.4步;若pc>256,说明完成了一遍对Lc的检查,转6.5.6步。Step 6.5.5, let pc=pc+1. If pc≤256, go to step 6.5.4; if pc>256, it means that the check of L c has been completed, go to step 6.5.6.
6.5.6步,令qc=qc+1。若qc≤numc,转6.5.3步;若qc>numc,说明x(i)对应的smali文件扫描完毕,转6.5.7步。Step 6.5.6, let qc=qc+1. If qc≤numc, go to step 6.5.3; if qc>numc, it means that the smali file corresponding to x (i) is scanned, go to step 6.5.7.
6.5.7步,令pc=1。Step 6.5.7, let pc=1.
6.5.8步,令 Step 6.5.8, let
6.5.9步,令pc=pc+1。若pc≤256,转6.5.8步;若pc>256,说明计算完成,转6.6步。Step 6.5.9, let pc=pc+1. If pc≤256, go to step 6.5.8; if pc>256, explain After the calculation is completed, go to step 6.6.
6.6步,采用arm操作码统计方法统计x(i)使用的arm操作码,得到x(i)的arm操作码向量方法是:Step 6.6, use the arm opcode statistical method to count the arm opcodes used by x (i) to obtain the arm opcode vector of x (i). the way is:
6.6.1步,按行扫描x(i)对应的arm文件,记arm文件的第qd行字符串为strd[qd],arm文件总行数为numd行。Step 6.6.1, scan the arm file corresponding to x (i) line by line, record the qd line string of the arm file as strd[qd], and the total number of lines in the arm file as numd lines.
6.6.2步,令qd=1,使用变量opa表示arm文件中使用的arm操作码总数量,令opa=1。若qd≤numd,转6.6.3步;若qd>numd,说明x(i)对应的arm文件是空文件,转6.7步。Step 6.6.2, let qd=1, use the variable opa to represent the total number of arm opcodes used in the arm file, let opa=1. If qd≤numd, go to step 6.6.3; if qd>numd, it means that the arm file corresponding to x (i) is an empty file, go to step 6.7.
6.6.3步,令pd=1。Step 6.6.3, let pd=1.
6.6.4步,若strd[qd]含有“>”字符,说明strd[qd]包含一条arm指令,令opa=opa+1,转6.6.5;若strd[qd]不含有“>”字符,转6.6.7。Step 6.6.4, if strd[qd] contains the ">" character, it means that strd[qd] contains an arm instruction, let opa=opa+1, go to 6.6.5; if strd[qd] does not contain the ">" character, Go to 6.6.7.
6.6.5步,若strd[qd]含有内容为Ld[pd]的子字符串,令 转6.6.7步;若strd[qd]不含有内容为Ld[pd]的子字符串,转6.6.6步。Step 6.6.5, if strd[qd] contains a substring whose content is L d [pd], let Go to step 6.6.7; if strd[qd] does not contain a substring whose content is L d [pd], go to step 6.6.6.
6.6.6步,令pd=pd+1。若pd≤197,转6.6.5步;若pd>197,说明完成了一遍对Ld的检查,转6.6.7步。Step 6.6.6, let pd=pd+1. If pd≤197, go to step 6.6.5; if pd>197, it means that the inspection of L d is completed, go to step 6.6.7.
6.6.7步,令qd=qd+1。若qd≤numd,转6.6.3步;若qd>numd,说明x(i)对应的arm文件扫描完毕,转6.6.8步。Step 6.6.7, let qd=qd+1. If qd≤numd, go to step 6.6.3; if qd>numd, it means that the arm file corresponding to x (i) is scanned, go to step 6.6.8.
6.6.8步,令pd=1。Step 6.6.8, let pd=1.
6.6.9步,令 Step 6.6.9, let
6.6.10步,令pd=pd+1。若pd≤197,转6.6.9步;若pd>197,说明计算完成,转6.7步。Step 6.6.10, let pd=pd+1. If pd≤197, go to step 6.6.9; if pd>197, explain After the calculation is completed, go to step 6.7.
6.7步,令i=i+1。若i≤N,转6.2;若i>N,表明对D内的N个样本均计算生成了频率指纹,将频率指纹发送给检测模块,转第七步。Step 6.7, let i=i+1. If i≤N, go to 6.2; if i>N, it means that the frequency fingerprint is generated for all N samples in D, and the frequency fingerprint is sent to the detection module, and the seventh step is performed.
第七步,检测模块从频率指纹产生模块接收频率指纹,训练多核支持向量机模型,成为适合对待检测软件进行分类判断的分类器。多核支持向量机模型是一种基于支持向量机模型、使用多种核函数将特征空间的向量由低维映射到高维来增强分类能力的分类模型。对基准测试集D来说,其特征空间为D内N个样本的频率指纹的集合。令kperm、kapi、ksmali、karm分别表示频率指纹内的权限向量、API向量、smali操作码向量和arm操作码向量使用的核函数,β为权重向量,可表示为(βperm,βapi,βsmali,βarm),β的元素βperm、βapi、βsmali、βarm分别表示kperm、kapi、ksmali、karm的权重,令T为集合{perm,api,smali,arm}(perm,api,smali,arm分别为kperm、kapi、ksmali、karm的下标,为了描述公式(4)的一种表达方式),多核支持向量机模型Y可表示为:In the seventh step, the detection module receives the frequency fingerprint from the frequency fingerprint generation module, trains a multi-core support vector machine model, and becomes a classifier suitable for classifying and judging the software to be detected. The multi-kernel support vector machine model is a classification model based on the support vector machine model, which uses a variety of kernel functions to map the vector of the feature space from low-dimensional to high-dimensional to enhance the classification ability. For the benchmark test set D, its feature space is the set of frequency fingerprints of N samples in D. Let k perm , k api , k smali , and k arm denote the kernel function used by the permission vector, API vector, smali opcode vector, and arm opcode vector in the frequency fingerprint, respectively, and β is the weight vector, which can be expressed as (β perm , β api , β smali , β arm ), the elements β perm , β api , β smali , and β arm of β represent the weights of k perm , k api , k smali , and k arm respectively, let T be the set {perm, api, smali , arm} (perm, api, smali, arm are the subscripts of k perm , k api , k smali , and k arm respectively, in order to describe an expression of formula (4)), the multi-core support vector machine model Y can be expressed as :
α(i)为一个拉格朗日乘子,{α(1),α(2),...,α(i),...,α(N)}构成向量α。sgn(A)为参数A的阶跃函数,当A>0时,sgn(A)=1;当A=0时,sgn(A)=0;当A<0时,sgn(A)=-1。α、β通过求解公式(5)得到:α (i) is a Lagrange multiplier, {α (1) , α (2) , ..., α (i) , ..., α (N) } constitute the vector α. sgn(A) is the step function of parameter A, when A>0, sgn(A)=1; when A=0, sgn(A)=0; when A<0, sgn(A)=- 1. α and β are obtained by solving formula (5):
公式(5)的约束条件为公式(6)至公式(9):The constraints of formula (5) are formula (6) to formula (9):
0≤α(i)≤C (7)0≤α (i) ≤C(7)
∑t∈Tβt=1 (8)∑ t∈T β t = 1 (8)
βt≥0,t∈T (9)β t ≥ 0, t∈T (9)
其中C为惩罚系数,C≥0,用于表示对误分类惩罚的大小。where C is the penalty coefficient, C≥0, which is used to indicate the size of the penalty for misclassification.
b为标量,在求出α、β后,由下面公式得出:b is a scalar, after calculating α and β, it is obtained by the following formula:
其中,为支持向量样本点。in, is the support vector sample point.
对多核支持向量机模型进行训练的方法是:The way to train a multi-core SVM model is:
7.1步,根据从频率指纹产生模块接收的D内样本的频率指纹计算生成核矩阵。令Kt为核矩阵,t∈T,表示四种核矩阵Kperm、Kapi、Ksmali和Karm。Kt规模大小为N行N列,第i行第j列的元素为选定3次多项式核函数,Kt的计算方法为:Step 7.1: Calculate and generate a kernel matrix according to the frequency fingerprints of the samples in D received from the frequency fingerprint generation module. Let K t be a kernel matrix, t∈T, representing the four kernel matrices K perm , K api , K smali and K arm . The size of K t is N rows and N columns, and the elements of the i-th row and the j-th column are The 3rd degree polynomial kernel function is selected, and the calculation method of K t is:
7.1.1步,令i=1。Step 7.1.1, let i=1.
7.1.2步,令j=1。Step 7.1.2, let j=1.
7.1.3步,计算 Step 7.1.3, Calculation
表示与的内积。 express and the inner product.
7.1.4步,若j≤N,令j=j+1,转7.1.3步;若j>N,转7.1.5步。Step 7.1.4, if j≤N, let j=j+1, go to step 7.1.3; if j>N, go to step 7.1.5.
7.1.5步,若i≤N,令i=i+1,转7.1.2步;若i>N,Kt计算完毕,转7.2步。Step 7.1.5, if i≤N, set i=i+1, go to step 7.1.2; if i>N, K t is calculated, go to step 7.2.
7.2步,优化α、β参数,方法是:Step 7.2, optimize the α, β parameters, the method is:
7.2.1初始化α向量内每个元素为0,初始化β向量内每个元素为1/4。7.2.1 Initialize each element in the alpha vector to 0, and initialize each element in the beta vector to 1/4.
7.2.2利用公式(5),按照上标r、s从小到大的顺序,将(α(1),α(2),...,α(r-1),α(r+1),...,α(s),α(s+1),...,α(N))及向量β作为固定值,选择一对α(r)、α(s)对α进行优化,优化方法为:7.2.2 Using formula (5), according to the superscript r, s in ascending order, (α (1) , α (2) ,..., α (r-1) , α (r+1) , ..., α (s) , α (s+1) , ..., α (N) ) and vector β as fixed values, select a pair of α (r) , α (s) to optimize α, The optimization method is:
7.2.2.1利用公式(6)的约束,使公式(5)成为α(r)的一元二次函数g(α(r)),对g(α(r))求导数使求导数之后的结果等于0,求出α(r)。7.2.2.1 Using the constraints of formula (6), formula (5) becomes a quadratic function g(α (r) ) of α (r) in one variable, and the derivative of g(α (r) ) is obtained to obtain the result after the derivative equal to 0, find α (r) .
7.2.2.2利用公式(6)的约束求出α(s)。7.2.2.2 Use the constraints of equation (6) to find α (s) .
7.2.2.3将α(r),α(s)更新,得到优化后的α,命名为α*。7.2.2.3 Update α (r) and α (s) to obtain the optimized α, named α * .
7.2.3将α*作为固定值,对β进行优化,方法为:7.2.3 Taking α * as a fixed value, optimize β by:
7.2.3.1计算公式(5)对β的偏导数,使求偏导数之后的结果等于0,求出满足公式(8)和公式(9)约束条件的解,即求出了βperm、βapi、βsmali、βarm优化后的结果,分别命名为 7.2.3.1 Calculate the partial derivative of formula (5) with respect to β, make the result after the partial derivative equal to 0, and find the solution that satisfies the constraints of formula (8) and formula (9), that is, obtain β perm , β api The optimized results of , β smali and β arm are named as
7.2.3.2将拼接成优化后的β,命名为β*。7.2.3.2 Will spliced into optimized β, named β * .
7.2.4判断α、β是否满足公式(12)~公式(14)的优化终止条件:7.2.4 Judge whether α and β satisfy the optimization termination conditions of formula (12) to formula (14):
L(α*,β*)-L(α,β)≤ε (14)L(α * ,β * )-L(α,β)≤ε(14)
当满足公式(14)时,对α、β参数的优化使得公式(5)中函数值改变小于阈值ε,0<ε≤0.1,说明优化后的α、β满足要求,多核支持向量机模型训练完毕,转7.3步。否则转步骤7.2.2。When the formula (14) is satisfied, the optimization of the α and β parameters makes the change of the function value in the formula (5) less than the threshold ε, 0<ε≤0.1, indicating that the optimized α and β meet the requirements, and the multi-core support vector machine model training Finished, go to step 7.3. Otherwise, go to step 7.2.2.
7.3步,由公式(10)计算得到b的值,公式(4)定义的多核支持向量机模型训练优化完成,成为分类器。In step 7.3, the value of b is calculated by formula (10), and the training and optimization of the multi-core support vector machine model defined by formula (4) is completed and becomes a classifier.
第八步,使用基于频率指纹提取的安卓恶意软件检测系统对谷歌官方或者第三方安卓应用软件市场服务器从用户接收的待检软件进行检测,判断是否为恶意软件,方法是:The eighth step is to use the Android malware detection system based on frequency fingerprint extraction to detect the software to be inspected received from the user by Google's official or third-party Android application software market server, and determine whether it is malware. The method is as follows:
8.1步,样本预处理模块对待检软件进行预处理。将待检测软件作为样本x(a),采用3.3步所述样本预处理方法,对x(a)进行预处理,得到x(a)的AndroidManifest.xml文件、smali文件和arm指令文件,输出至频率指纹计算模块。Step 8.1, the sample preprocessing module preprocesses the software to be tested. Taking the software to be detected as a sample x (a) , using the sample preprocessing method described in step 3.3, preprocessing x (a) to obtain the AndroidManifest.xml file, smali file and arm instruction file of x (a) , output to Frequency fingerprint calculation module.
8.2步,频率指纹计算模块对x(a)计算产生x(a)的频率指纹方法是:Step 8.2, the frequency fingerprint calculation module calculates x (a) to generate the frequency fingerprint of x (a) . the way is:
8.2.1步,采用6.3步所述权限提取方法提取x(a)申请的权限,得到x(a)的权限向量 Step 8.2.1, use the permission extraction method described in step 6.3 to extract the permission applied for by x (a ) , and obtain the permission vector of x (a) .
8.2.2步,采用6.4步所述API统计方法统计x(a)使用的API,得到x(a)的API向量 Step 8.2.2, use the API statistics method described in step 6.4 to count the APIs used by x (a ) , and obtain the API vector of x (a) .
8.2.3步,采用6.5步所述smali操作码统计方法统计x(a)使用的smali操作码,得到x(a)的smali操作码向量 Step 8.2.3, use the smali opcode statistical method described in step 6.5 to count the smali opcodes used by x ( a ), and obtain the smali opcode vector of x (a) .
8.2.4步,采用6.6步所述arm操作码统计方法统计x(a)使用的arm操作码,得到x(a)的arm操作码向量 Step 8.2.4, use the arm opcode statistical method described in step 6.6 to count the arm opcodes used by x ( a ), and obtain the arm opcode vector of x (a) .
8.2.5步,将计算完毕,拼接成x(a)的频率指纹 Step 8.2.5, will After the calculation is completed, spliced into the frequency fingerprint of x (a)
8.3步,将输入检测模块(此时是优化后的适合于检测的分类器),由公式(4)计算输出F的值,F等于+1或者-1,+1代表待检软件为恶意软件,-1代表为良性软件,从而实现了判断待检软件是否为恶意软件的目的。Step 8.3, will Input detection module (at this time it is an optimized classifier suitable for detection), calculate the value of output F by formula (4), F is equal to +1 or -1, +1 means the software to be checked is malware, -1 means It is benign software, thus realizing the purpose of judging whether the software to be checked is malicious software.
相比于其他技术,本发明具有以下优点:Compared with other technologies, the present invention has the following advantages:
一是高精确度。本发明融合使用权限、API、smali操作码和arm操作码特性产生频率指纹,能准确表达安卓软件属性特征,适于作为安卓软件身份标识。基于频率指纹训练出的多核支持向量机,作为分类器,能够有效地整合来自安卓软件各个组成部分的的信息,达到准确的检测结果。One is high precision. The invention integrates the characteristics of use authority, API, smali operation code and arm operation code to generate a frequency fingerprint, can accurately express the attribute characteristics of Android software, and is suitable for being used as an Android software identity mark. The multi-core support vector machine trained based on the frequency fingerprint, as a classifier, can effectively integrate the information from each component of the Android software to achieve accurate detection results.
二是高效率。本发明的效率体现在两个方面:一是频率指纹生成的效率高。本发明扫描AndroidManifest.xml、smali文件及arm指令文件,统计权限、API、smali操作码和arm操作码的频率,可在线性时间内完成。二是分类模型的训练效率高。与大量的神经网络模型参数相比,多核支持向量机模型的参数比较少,优化参数时的计算量较低,训练效率有显著提高。The second is high efficiency. The efficiency of the present invention is embodied in two aspects: First, the efficiency of frequency fingerprint generation is high. The invention scans AndroidManifest.xml, smali file and arm instruction file, and counts the frequency of authority, API, smali operation code and arm operation code, and can be completed in linear time. Second, the training efficiency of the classification model is high. Compared with a large number of neural network model parameters, the multi-core support vector machine model has fewer parameters, and the calculation amount when optimizing parameters is lower, and the training efficiency is significantly improved.
附图说明Description of drawings
图1是基于频率指纹提取的安卓恶意软件检测系统结构图。Figure 1 is a structural diagram of an Android malware detection system based on frequency fingerprint extraction.
图2是本发明总体流程图。Figure 2 is a general flow chart of the present invention.
具体实施方式Detailed ways
下面对照附图对本发明进行详细说明。The present invention will be described in detail below with reference to the accompanying drawings.
本发明技术方案如图2所示,包括以下步骤:The technical solution of the present invention, as shown in Figure 2, includes the following steps:
第一步,构建基于频率指纹提取的安卓恶意软件检测系统。该系统安装在谷歌官方或者第三方安卓应用软件市场服务器中,该系统总体结构如图1所示,由样本预处理模块、频率指纹产生模块、检测模块组成。The first step is to build an Android malware detection system based on frequency fingerprint extraction. The system is installed in Google's official or third-party Android application software market server. The overall structure of the system is shown in Figure 1, which consists of a sample preprocessing module, a frequency fingerprint generation module, and a detection module.
样本预处理模块与频率指纹产生模块相连,样本预处理模块接收来自开发人员构建的基准测试集的样本和普通用户提交的待检测样本,对样本进行预处理,产生AndroidManifest.xml、smali文件和arm指令文件三种类型的文件,输出至频率指纹产生模块。The sample preprocessing module is connected to the frequency fingerprint generation module. The sample preprocessing module receives samples from the benchmark test set constructed by developers and samples to be tested submitted by ordinary users, preprocesses the samples, and generates AndroidManifest.xml, smali files and arm Three types of command files are output to the frequency fingerprint generation module.
频率指纹产生模块与样本预处理模块、检测模块相连,频率指纹产生模块从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件,进行特征筛选和频率指纹计算,产生频率指纹,输出至检测模块;频率指纹产生模块由特征筛选模块和频率指纹计算模块组成。特征筛选模块与样本预处理模块、频率指纹计算模块相连,特征筛选模块从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件,对这三种文件进行特征筛选,得到权限、API、smali操作码和arm操作码特征,将权限、API、smali操作码和arm操作码特征发送给频率指纹计算模块。频率指纹计算模块与样本预处理模块、特征筛选模块、检测模块相连,频率指纹计算模块从特征筛选模块接收权限、API、smali操作码和arm操作码特征,从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件,计算产生频率指纹,将频率指纹发送给检测模块。The frequency fingerprint generation module is connected with the sample preprocessing module and the detection module. The frequency fingerprint generation module receives the AndroidManifest.xml, smali file and arm command file from the sample preprocessing module, performs feature screening and frequency fingerprint calculation, generates frequency fingerprints, and outputs them to the detection. module; the frequency fingerprint generation module is composed of a feature screening module and a frequency fingerprint calculation module. The feature screening module is connected with the sample preprocessing module and the frequency fingerprint calculation module. The feature screening module receives AndroidManifest.xml, smali file and arm command file from the sample preprocessing module, and performs feature screening on these three files to obtain permissions, API, smali Opcode and arm opcode feature, send permission, API, smali opcode and arm opcode feature to the frequency fingerprint calculation module. The frequency fingerprint calculation module is connected with the sample preprocessing module, the feature screening module and the detection module. The frequency fingerprint calculation module receives permissions, API, smali opcode and arm opcode features from the feature screening module, and receives AndroidManifest.xml, smali file and arm command file, calculate the frequency fingerprint, and send the frequency fingerprint to the detection module.
检测模块与频率指纹产生模块相连,检测模块是一个多核支持向量机模型,它从频率指纹产生模块接收基准测试集D的频率指纹和待检测软件的频率指纹,利用基准测试集D的频率指纹进行训练优化,成为适合对待检测软件进行检测的分类器,然后根据待检测软件的频率指纹对待检测软件进行检测分类,得出待检测软件是否是恶意软件的判定结果。The detection module is connected with the frequency fingerprint generation module. The detection module is a multi-core support vector machine model. It receives the frequency fingerprint of the benchmark test set D and the frequency fingerprint of the software to be detected from the frequency fingerprint generation module, and uses the frequency fingerprint of the benchmark test set D. The training is optimized to become a classifier suitable for detecting the software to be detected, and then the software to be detected is detected and classified according to the frequency fingerprint of the software to be detected, and the judgment result of whether the software to be detected is malware is obtained.
图1中样本预处理模块到频率指纹产生模块、检测模块的实线箭头是基于频率指纹提取的安卓恶意软件检测系统对基准测试集D内的样本进行处理的流程,样本预处理模块到频率指纹产生模块、检测模块的虚线箭头是对待检样本进行处理的流程(从第八步可以看出,待检测软件不需要特征筛选模块进行特征筛选)。The solid arrows from the sample preprocessing module to the frequency fingerprint generation module and the detection module in Figure 1 are the flow of the Android malware detection system based on the frequency fingerprint extraction to process the samples in the benchmark test set D. The sample preprocessing module to the frequency fingerprint The dashed arrows of the generation module and the detection module are the flow of processing the sample to be tested (it can be seen from the eighth step that the software to be detected does not need the feature screening module for feature screening).
第二步,构建基准测试集D,方法是:The second step is to construct a benchmark test set D, the method is:
2.1步,从开源的Drebin、Genome和AMD数据集中获得N1个安卓恶意软件作为恶意样本,N1为正整数且N1=2000。Step 2.1, obtain N 1 Android malware from the open source Drebin, Genome and AMD datasets as malicious samples, where N 1 is a positive integer and N 1 =2000.
2.2步,通过爬取GooglePlay和Apkpure应用商店获得良性软件,并使用本地杀毒软件及VirusTotal在线杀毒网站进行检测过滤,形成N2个良性样本,N2为正整数且N2=2000。Step 2.2, obtain benign software by crawling GooglePlay and Apkpure application stores, and use local anti-virus software and VirusTotal online anti-virus website to detect and filter to form N 2 benign samples, N 2 is a positive integer and N 2 =2000.
2.3步,给恶意样本及良性样本添加标签,组成基准测试集D,N为D内样本总数,N=N1+N2。定义x(i)为D中第i个样本,y(i)为x(i)的标签,y(i)等于1表示x(i)为恶意样本,y(i)等于-1表示x(i)为良性样本,1≤i≤N。Step 2.3, add labels to malicious samples and benign samples to form a benchmark test set D, where N is the total number of samples in D, N=N 1 +N 2 . Define x (i) as the ith sample in D, y (i) as the label of x (i) , y (i) equal to 1 means x (i) is a malicious sample, y (i) equal to -1 means x ( i) is a benign sample, 1≤i≤N.
2.4将D存储在预处理模块、频率指纹产生模块均可以读取的存储器(如安装有基于频率指纹提取的安卓恶意软件检测系统的谷歌官方或者第三方安卓应用软件市场服务器的存储器)上。2.4 Store D in a memory that can be read by both the preprocessing module and the frequency fingerprint generation module (such as the memory of Google's official or third-party Android application software market server installed with the Android malware detection system based on frequency fingerprint extraction).
第三步,样本预处理模块对D内N个样本进行预处理,得到N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件。In the third step, the sample preprocessing module preprocesses the N samples in D to obtain N AndroidManifest.xml files, N smali files and N arm instruction files.
3.1步,令变量i=1;Step 3.1, let the variable i=1;
3.2步,从D中取第i个样本x(i)。Step 3.2, take the ith sample x (i) from D.
3.3步,采用样本预处理方法对x(i)进行预处理,得到x(i)的AndroidManifest.xml文件、smali文件和arm指令文件,方法是:Step 3.3, use the sample preprocessing method to preprocess x (i) to obtain the AndroidManifest.xml file, smali file and arm command file of x (i) , the method is:
3.3.1步,使用解压缩工具Gzip,对x(i)进行解压缩,提取x(i)中的AndroidManifest.xml、classes.dex以及so运行库文件。Step 3.3.1, use the decompression tool Gzip to decompress x (i) , and extract the AndroidManifest.xml, classes.dex and so runtime library files in x (i) .
3.3.2步,使用AndroidManifest.xml文件专用反编译工具AXMLPrinter2版本2.0,将AndroidManifest.xml文件由二进制形式反编译为文本形式。Step 3.3.2, use AXMLPrinter2 version 2.0, a special decompilation tool for the AndroidManifest.xml file, to decompile the AndroidManifest.xml file from binary form to text form.
3.3.3步,使用dex文件格式反编译工具baksmali版本2.4.0将classes.dex反编译为smali文件,若产生多个smali文件,将多个smali文件合并成为一个smali文件,转3.3.4步;若只产生1个smali文件,直接转3.3.4步。Step 3.3.3, use the dex file format decompile tool baksmali version 2.4.0 to decompile classes.dex into a smali file, if multiple smali files are generated, merge the multiple smali files into one smali file, go to step 3.3.4 ; If only one smali file is generated, go to step 3.3.4 directly.
3.3.4步,使用arm指令反汇编工具gcc-arm-none-eabi版本9-2019-q4-major将so运行库文件反编译为文本形式的arm指令文件,若产生多个arm指令文件,将多个arm指令文件合并成为一个arm指令文件,转3.4步;如若没有产生arm指令文件,则新建一个空的arm指令文件,转3.4步。Step 3.3.4, use the arm instruction disassembly tool gcc-arm-none-eabi version 9-2019-q4-major to decompile the so runtime library file into a textual arm instruction file. If multiple arm instruction files are generated, set the Combine multiple arm command files into one arm command file, go to step 3.4; if no arm command file is generated, create an empty arm command file and go to step 3.4.
3.4步,令i=i+1,若i≤N,转3.2步;若i>N,此时N个样本产生了对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件,将D的N个样本对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件发送给特征筛选模块,转第四步。Step 3.4, let i=i+1, if i≤N, go to step 3.2; if i>N, then N samples generate corresponding N AndroidManifest.xml files, N smali files and N arm command files , send the N AndroidManifest.xml files, N smali files, and N arm instruction files corresponding to the N samples of D to the feature screening module, and go to the fourth step.
第四步,特征筛选模块对从样本预处理模块收到的D的N个样本对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件进行特征筛选,得到适合对D进行分类的权限特征、API特征、smali操作码特征和arm操作码特征。In the fourth step, the feature screening module performs feature screening on the N AndroidManifest.xml files, N smali files and N arm command files corresponding to the N samples of D received from the sample preprocessing module, and obtains a feature suitable for classifying D. Permission features, API features, smali opcode features, and arm opcode features.
4.1步,选择安卓开发者文档(https://developer.android.com/reference/android/Manifest.permission)中定义的167种android.manifest.permission系统权限,将这167种权限作为特征,称为权限特征。Step 4.1, select the 167 android.manifest.permission system permissions defined in the Android developer documentation (https://developer.android.com/reference/android/Manifest.permission), and use these 167 permissions as features, called Permission features.
4.2步,从pscout列表(https://security.csl.toronto.edu/pscout/?mdocs-file=67)的API中,选择出256个API,方法是:Step 4.2, select 256 APIs from the APIs in the pscout list (https://security.csl.toronto.edu/pscout/?mdocs-file=67) by:
4.2.1步,建立一个列表Lapi,选择pscout列表中全部的32437个API加入Lapi,第v个API记为Lapi[v],1≤v≤32437。Step 4.2.1, create a list L api , select all 32437 APIs in the pscout list to join L api , the vth API is recorded as L api [v], 1≤v≤32437.
4.2.2步,建立一个32437行N列的二维数组Zapi,第v行第i列元素Zapi[v][i]的值限定为1或0,1代表Lapi的第v个API在D中的第i个样本中出现,0代表未出现。Step 4.2.2, build a two-dimensional array Z api with 32437 rows and N columns, the value of the element Z api [v][i] in the vth row and the i column is limited to 1 or 0, 1 represents the vth API of L api appears in the ith sample in D, and 0 means not appearing.
4.2.3步,初始化Zapi内所有元素为0,初始化变量i=1。Step 4.2.3, initialize all elements in Z api to 0, and initialize variable i=1.
4.2.4步,按行扫描D的第i个样本的smali文件,得到第i个样本中出现的属于Lapi的API,对Zapi的第i列元素进行赋值;记smali文件的第u行字符串为str[u],记smali文件的总行数为U,1≤u≤U。Step 4.2.4, scan the smali file of the ith sample of D row by line, get the API belonging to the L api appearing in the ith sample, and assign values to the elements of the ith column of Z api ; record the uth line of the smali file The string is str[u], and the total number of lines in the smali file is U, 1≤u≤U.
4.2.4.1步,初始化u=1。Step 4.2.4.1, initialize u=1.
4.2.4.2步,若str[u]是一个API字符串,转4.2.4.2.1;若str[u]不是一个API字符串,转4.2.4.3。Step 4.2.4.2, if str[u] is an API string, go to 4.2.4.2.1; if str[u] is not an API string, go to 4.2.4.3.
4.2.4.2.1步,初始化变量v=1。Step 4.2.4.2.1, initialize the variable v=1.
4.2.4.2.2步,若str[u]含有内容为Lapi[v]的子字符串,赋值Zapi[v][i]=1,转4.2.4.3;否则,转4.2.4.2.3步。Step 4.2.4.2.2, if str[u] contains a substring whose content is L api [v], assign Z api [v][i]=1, go to 4.2.4.3; otherwise, go to 4.2.4.2.3 step.
4.2.4.2.3步,令v=v+1。若v≤32437,转4.2.4.2.2步;若v>32437,转4.2.4.3步。Step 4.2.4.2.3, let v=v+1. If v≤32437, go to step 4.2.4.2.2; if v>32437, go to step 4.2.4.3.
4.2.4.3步,令u=u+1。若u≤U,转4.2.4.2步;若u>U,转4.2.5步。Step 4.2.4.3, let u=u+1. If u≤U, go to step 4.2.4.2; if u>U, go to step 4.2.5.
4.2.5步,令i=i+1。若i≤N,转4.2.4步;若i>N,完成了对二维数组Zapi的赋值,转4.2.6。Step 4.2.5, let i=i+1. If i≤N, go to step 4.2.4; if i>N, complete the assignment to the two-dimensional array Z api , go to step 4.2.6.
4.2.6步,计算列表Lapi中每个API对基准测试集D的信息增益IG。第v个API对D的信息增益用IG(D|Lapi[v])表示。Step 4.2.6, calculate the information gain IG of each API in the list L api to the benchmark test set D. The information gain of the vth API to D is denoted by IG(D|L api [v]).
4.2.6.1步,令v=1。Step 4.2.6.1, let v=1.
4.2.6.2步,令i=1。令第一变量M11=0,令第二变量M12=0,令第三变量M21=0,令第四变量M22=0。Step 4.2.6.2, let i=1. Let the first variable M 11 =0, the second variable M 12 =0, the third variable M 21 =0, and the fourth variable M 22 =0.
4.2.6.3步,若Zapi[v][i]等于1并且y(i)等于1,令M11=M11+1;若Zapi[v][i]等于l并且y(i)等于0,令M12=M12+1;若Zapi[v][i]等于0并且y(i)等于1,令M21=M21+1;若Zapi[v][i]等于0并且y(i)等于0,令M22=M22+1。Step 4.2.6.3, if Zapi [v][i] is equal to 1 and y (i) is equal to 1, let M11= M11 + 1 ; if Zapi [v][i] is equal to 1 and y (i) is equal to 0, let M 12 =M 12 +1; if Z api [v][i] is equal to 0 and y (i) is equal to 1, let M 21 =M 21 +1; if Z api [v][i] is equal to 0 and y (i) is equal to 0, let M 22 =M 22 +1.
4.2.6.4步,令i=i+1。若i≤N,转4.2.6.3步;若i>N,转4.2.6.5步。Step 4.2.6.4, let i=i+1. If i≤N, go to step 4.2.6.3; if i>N, go to step 4.2.6.5.
4.2.6.5步计算IG(D|Lapi[v]),方法为:Step 4.2.6.5 Calculate IG(D|L api [v]), the method is:
IG(D|Lapi[v])=H(D)-H(D|Lapi[v]) (1)IG(D|L api [v])=H(D)-H(D|L api [v]) (1)
其中H(D)为基准测试集D的经验熵,H(D)计算方法为:where H(D) is the empirical entropy of the benchmark test set D, and the calculation method of H(D) is:
H(D|Lapi[v])为列表Lapi的第v个API对D的经验条件熵,H(D|Lapi[v])为:H(D|L api [v]) is the empirical conditional entropy of the vth API of the list L api for D, and H(D|L api [v]) is:
4.2.6.6步,令v=v+1。若v≤32437,转4.2.6.2;若v>32437,说明列表Lapi内全部API对D的信息增益计算完毕,按照IG(D|Lapi[v])值从大到小将Lapi内API排序,取排序后的前256个API,作为API特征,转4.3步。Step 4.2.6.6, let v=v+1. If v≤32437, go to 4.2.6.2; if v>32437, it means that the information gain of all APIs in the list L api to D has been calculated, and the APIs in L api are sorted according to the value of IG(D|L api [v]) from large to small. Sort, take the first 256 APIs after sorting, as API features, go to step 4.3.
4.3步,安卓Dalvik虚拟机预定义了长度为8个二进制位的smali操作码(https://developer.android.com/reference/dalvik/bytecode/Opcodes.html),包括预留未定义的类型,最多有256种,将这256种smali操作码作为特征,称为smali操作码特征。In step 4.3, the Android Dalvik virtual machine predefines the smali opcode with a length of 8 binary bits (https://developer.android.com/reference/dalvik/bytecode/Opcodes.html), including reserved undefined types, There are at most 256 kinds, and these 256 kinds of smali opcodes are used as features, which are called smali opcode features.
4.4步,根据arm指令快速参考手册(http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001mc/QRC0001_UAL.pdf),特征筛选模块选择该手册列举的共计197种arm指令操作码作为特征,称为arm操作码特征。Step 4.4, according to the arm command quick reference manual (http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001mc/QRC0001_UAL.pdf), the feature screening module selects a total of 197 arm commands listed in the manual The opcode as a feature is called the arm opcode feature.
4.5步,将权限特征、API特征、smali操作码特征和arm操作码特征发送给频率指纹计算模块。In step 4.5, the permission feature, API feature, smali opcode feature and arm opcode feature are sent to the frequency fingerprint calculation module.
第五步,确定频率指纹格式。The fifth step is to determine the frequency fingerprint format.
将167种权限特征、256种API特征、256种smali操作码特征和197种arm操作码特征分别按字母顺序排列形成向量,分别称之为安卓软件的权限向量、API向量、smali操作码向量和arm操作码向量。四种向量首尾相接,形成一个长度为876的向量,作为样本的频率指纹。The 167 permission features, 256 API features, 256 smali opcode features, and 197 arm opcode features are arranged in alphabetical order to form vectors, which are called Android software permission vectors, API vectors, smali opcode vectors and arm opcode vector. The four vectors are connected end to end to form a vector of length 876 as the frequency fingerprint of the sample.
第六步,频率指纹计算模块从特征筛选模块接收权限特征、API特征、smali操作码特征和arm操作码特征,从样本预处理模块接收AndroidManifest.xml文件、smali文件和arm指令文件,对基准测试集D中N个样本计算产生频率指纹。In the sixth step, the frequency fingerprint calculation module receives permission features, API features, smali opcode features and arm opcode features from the feature screening module, and receives the AndroidManifest.xml file, smali file and arm command file from the sample preprocessing module, and performs benchmark testing The N samples in the set D are calculated to generate frequency fingerprints.
6.1步,令La为权限列表,列表成员La[pa]为167种权限中按字母顺序排列的第pa种权限的名称字符串;令Lb为API列表,列表成员Lb[pb]为256种API中按字母顺序排列的第pb种API的名称字符串;令Lc为smali操作码列表,列表成员Lc[pc]为256种smali操作码中按字母顺序排列的第pc种smali操作码的名称字符串;令Ld为arm操作码列表,列表成员Ld[pd]为197种arm操作码中按字母顺序排列的第pd种arm操作码的名称字符串。令变量i=1。Step 6.1, let L a be the permission list, and the list member L a [pa] is the name string of the pa-th permission in alphabetical order among the 167 permissions; let L b be the API list, and the list member L b [pb] is the name string of the pbth API in alphabetical order among the 256 APIs; let L c be the list of smali opcodes, and the list member L c [pc] is the alphabetical pcth of the 256 smali opcodes The name string of the smali opcode; let Ld be the list of arm opcodes, and the list member Ld [pd] be the name string of the pdth arm opcode in alphabetical order among the 197 arm opcodes. Let variable i=1.
6.2步,取D中第i个样本x(i),为x(i)生成频率指纹含876个元素,初始化每个元素为0。将中的权限向量记为中的第pa个元素记为API向量记为中的第pb个元素记为smali操作码向量记为中的第pc个元素记为arm操作码向量记为中的第pd个元素记为 Step 6.2, take the ith sample x (i) in D, and generate a frequency fingerprint for x (i) Contains 876 elements, initializing each element to 0. Will The permission vector in is denoted as The pa-th element in is denoted as API vector is denoted as The pbth element in is denoted as The smali opcode vector is denoted as The pc-th element in is denoted as The arm opcode vector is denoted as The pd-th element in is denoted as
6.3步,采用权限提取方法提取x(i)申请的权限,得到x(i)的权限向量方法是:Step 6.3, using the permission extraction method to extract the permission applied by x (i ) , and obtain the permission vector of x (i) the way is:
6.3.1步,按行扫描x(i)对应的AndroidManifest.xml文件,记AndroidManifest.xml文件的第qa行字符串为stra[qa],记AndroidManifest.xml文件总行数为numa行。Step 6.3.1, scan the AndroidManifest.xml file corresponding to x (i) by line, record the string of line qa in the AndroidManifest.xml file as stra[qa], and record the total number of lines in the AndroidManifest.xml file as numa lines.
6.3.2步,令qa=1。Step 6.3.2, let qa=1.
6.3.3步,若stra[qa]含有内容为“uses-permission”的子字符串,令pa=1,转6.3.4步;若stra[qa]不含有内容为“uses-permission”的字符串,转6.3.6步。Step 6.3.3, if stra[qa] contains a substring whose content is "uses-permission", set pa=1, go to step 6.3.4; if stra[qa] does not contain a character whose content is "uses-permission" string, go to step 6.3.6.
6.3.4步,若stra[qa]含有内容为La[pa]的子字符串,表明x(i)申请了La[pa]权限,令转6.3.6步;若stra[qa]不含有内容为La[pa]的子字符串,转6.3.5步。In step 6.3.4, if stra[qa] contains a substring whose content is La [pa], it means that x (i) has applied for La [pa] permission, and let Go to step 6.3.6; if stra[qa] does not contain a substring whose content is La [pa], go to step 6.3.5.
6.3.5步,令pa=pa+1。若pa≤167,转6.3.4步;若pa>167,说明完成了一遍对La的检查,转6.3.6步。Step 6.3.5, let pa=pa+1. If pa≤167 , go to step 6.3.4; if pa>167, it means that the check of La has been completed, go to step 6.3.6.
6.3.6步,令qa=qa+1。若qa≤numa,转6.3.3步;若qa>numa,说明x(i)对应的AndroidManifest.xml文件扫描完毕,计算完成,转6.4步。Step 6.3.6, let qa=qa+1. If qa≤numa, go to step 6.3.3; if qa>numa, it means that the AndroidManifest.xml file corresponding to x (i) has been scanned, After the calculation is completed, go to step 6.4.
6.4步,采用API统计方法统计x(i)使用的API,得到x(i)的API向量方法是:Step 6.4, use the API statistics method to count the APIs used by x (i ) , and obtain the API vector of x (i) . the way is:
6.4.1步,按行扫描x(i)对应的smali文件,记smali文件的第qb行字符串为strb[qb],记smali文件总行数为numb行。Step 6.4.1, scan the smali file corresponding to x (i) by line, record the qb line string of the smali file as strb[qb], and record the total number of lines in the smali file as numb lines.
6.4.2步,令qb=1,使用变量inv表示smali文件中API的总数量,令inv=1。Step 6.4.2, let qb=1, use the variable inv to represent the total number of APIs in the smali file, let inv=1.
6.4.3步,令变量pb=1。Step 6.4.3, let the variable pb=1.
6.4.4步,若strb[qb]含有内容为“invoke”的子字符串,令inv=inv+1,转6.4.5步;若不含有“invoke”子字符串,转6.4.7步。Step 6.4.4, if strb[qb] contains a substring whose content is "invoke", let inv=inv+1, go to step 6.4.5; if it does not contain a substring of "invoke", go to step 6.4.7.
6.4.5步,若strb[qb]含有内容为Lb[pb]的子字符串,说明x(i)调用了名字为Lb[pb]的API,令转6.4.7步;若strb[qb]不含有内容为Lb[pb]的子字符串,转6.4.6步。Step 6.4.5, if strb[qb] contains a substring whose content is L b [pb], it means that x (i) calls the API named L b [pb], let Go to step 6.4.7; if strb[qb] does not contain a substring whose content is L b [pb], go to step 6.4.6.
6.4.6步,令pb=pb+1。若pb≤256,转6.4.5步;若pb>256,说明完成了一遍对Lb的检查,转6.4.7步。Step 6.4.6, let pb=pb+1. If pb≤256, go to step 6.4.5; if pb>256, it means that the check of L b is completed, go to step 6.4.7.
6.4.7步,令qb=qb+1。若qb≤numb,转6.4.3步;若qb>numb,说明x(i)对应的smali文件扫描完毕,转6.4.8步。Step 6.4.7, let qb=qb+1. If qb≤numb, go to step 6.4.3; if qb>numb, it means that the smali file corresponding to x (i) is scanned, go to step 6.4.8.
6.4.8步,令pb=1。Step 6.4.8, let pb=1.
6.4.9步,令 Step 6.4.9, let
6.4.10步,令pb=pb+1。若pb≤256,转6.4.9步;若pb>256,说明计算完成,转6.5步。Step 6.4.10, let pb=pb+1. If pb≤256, go to step 6.4.9; if pb>256, explain After the calculation is completed, go to step 6.5.
6.5步,采用smali操作码统计方法统计x(i)使用的smali操作码,得到x(i)的smali操作码向量方法是:Step 6.5, use the smali opcode statistical method to count the smali opcodes used by x (i) , and obtain the smali opcode vector of x (i). the way is:
6.5.1步,按行扫描x(i)对应的smali文件,记smali文件的第qc行字符串为strc[qc],记smali文件总行数为numc行。Step 6.5.1, scan the smali file corresponding to x (i) by line, record the qc line string of the smali file as strc[qc], and record the total number of lines in the smali file as numc lines.
6.5.2步,令qc=1,使用变量ops表示smali文件中smali操作码的总数量,令ops=1。Step 6.5.2, let qc=1, use the variable ops to represent the total number of smali opcodes in the smali file, let ops=1.
6.5.3步,令pc=1。Step 6.5.3, let pc=1.
6.5.4步,若strc[qc]含有内容为Lc[pc]的子字符串,令 ops=ops+1,转6.5.6步;若strc[qc]不含有内容为Lc[pc]的子字符串,转6.5.5步。Step 6.5.4, if strc [qc] contains a substring whose content is Lc[pc], let ops=ops+1, go to step 6.5.6; if strc[qc] does not contain a substring whose content is L c [pc], go to step 6.5.5.
6.5.5步,令pc=pc+1。若pc≤256,转6.5.4步;若pc>256,说明完成了一遍对Lc的检查,转6.5.6步。Step 6.5.5, let pc=pc+1. If pc≤256, go to step 6.5.4; if pc>256, it means that the check of L c has been completed, go to step 6.5.6.
6.5.6步,令qc=qc+1。若qc≤numc,转6.5.3步;若qc>numc,说明x(i)对应的smali文件扫描完毕,转6.5.7步。Step 6.5.6, let qc=qc+1. If qc≤numc, go to step 6.5.3; if qc>numc, it means that the smali file corresponding to x (i) is scanned, go to step 6.5.7.
6.5.7步,令pc=1。Step 6.5.7, let pc=1.
6.5.8步,令 Step 6.5.8, let
6.5.9步,令pc=pc+1。若pc≤256,转6.5.8步;若pc>256,说明计算完成,转6.6步。Step 6.5.9, let pc=pc+1. If pc≤256, go to step 6.5.8; if pc>256, explain After the calculation is completed, go to step 6.6.
6.6步,采用arm操作码统计方法统计x(i)使用的arm操作码,得到x(i)的arm操作码向量方法是:Step 6.6, use the arm opcode statistical method to count the arm opcodes used by x (i) to obtain the arm opcode vector of x (i). the way is:
6.6.1步,按行扫描x(i)对应的arm文件,记arm文件的第qd行字符串为strd[qd],arm文件总行数为numd行。Step 6.6.1, scan the arm file corresponding to x (i) line by line, record the qd line string of the arm file as strd[qd], and the total number of lines in the arm file as numd lines.
6.6.2步,令qd=l,使用变量opa表示arm文件中使用的arm操作码总数量,令opa=1。若qd≤numd,转6.6.3步;若qd>numd,说明x(i)对应的arm文件是空文件,转6.7步。Step 6.6.2, let qd=1, use the variable opa to represent the total number of arm opcodes used in the arm file, let opa=1. If qd≤numd, go to step 6.6.3; if qd>numd, it means that the arm file corresponding to x (i) is an empty file, go to step 6.7.
6.6.3步,令pd=1。Step 6.6.3, let pd=1.
6.6.4步,若strd[qd]含有“>”字符,说明strd[qd]包含一条arm指令,令opa=opa+1,转6.6.5;若strd[qd]不含有“>”字符,转6.6.7。Step 6.6.4, if strd[qd] contains the ">" character, it means that strd[qd] contains an arm instruction, let opa=opa+1, go to 6.6.5; if strd[qd] does not contain the ">" character, Go to 6.6.7.
6.6.5步,若strd[qd]含有内容为Ld[pd]的子字符串,令 转6.6.7步;若strd[qd]不含有内容为Ld[pd]的子字符串,转6.6.6步。Step 6.6.5, if strd[qd] contains a substring whose content is L d [pd], let Go to step 6.6.7; if strd[qd] does not contain a substring whose content is L d [pd], go to step 6.6.6.
6.6.6步,令pd=pd+1。若pd≤197,转6.6.5步;若pd>197,说明完成了一遍对Ld的检查,转6.6.7步。Step 6.6.6, let pd=pd+1. If pd≤197, go to step 6.6.5; if pd>197, it means that the inspection of L d is completed, go to step 6.6.7.
6.6.7步,令qd=qd+1。若qd≤numd,转6.6.3步;若qd>numd,说明x(i)对应的arm文件扫描完毕,转6.6.8步。Step 6.6.7, let qd=qd+1. If qd≤numd, go to step 6.6.3; if qd>numd, it means that the arm file corresponding to x (i) is scanned, go to step 6.6.8.
6.6.8步,令pd=1。Step 6.6.8, let pd=1.
6.6.9步,令 Step 6.6.9, let
6.6.10步,令pd=pd+1。若pd≤197,转6.6.9步;若pd>197,说明计算完成,转6.7步。Step 6.6.10, let pd=pd+1. If pd≤197, go to step 6.6.9; if pd>197, explain After the calculation is completed, go to step 6.7.
6.7步,令i=i+1。若i≤N,转6.2;若i>N,表明对D内的N个样本均计算生成了频率指纹,将频率指纹发送给检测模块,转第七步。Step 6.7, let i=i+1. If i≤N, go to 6.2; if i>N, it means that the frequency fingerprint is generated for all N samples in D, and the frequency fingerprint is sent to the detection module, and the seventh step is performed.
第七步,检测模块从频率指纹产生模块接收频率指纹,训练多核支持向量机模型,成为适合对待检测软件进行分类判断的分类器。令kperm、kapi、ksmali、karm分别表示频率指纹内的权限向量、API向量、smali操作码向量和arm操作码向量使用的核函数,β为权重向量,可表示为(βperm,βapi,βsmali,βarm),β的元素βperm、βapi、βsmali、βarm分别表示kperm、kapi、ksmali、karm的权重,令T为集合{perm,api,smali,arm},多核支持向量机模型Y可表示为:In the seventh step, the detection module receives the frequency fingerprint from the frequency fingerprint generation module, trains a multi-core support vector machine model, and becomes a classifier suitable for classifying and judging the software to be detected. Let k perm , k api , k smali , and k arm denote the kernel function used by the permission vector, API vector, smali opcode vector, and arm opcode vector in the frequency fingerprint, respectively, and β is the weight vector, which can be expressed as (β perm , β api , β smali , β arm ), the elements β perm , β api , β smali , and β arm of β represent the weights of k perm , k api , k smali , and k arm respectively, let T be the set {perm, api, smali , arm}, the multi-core support vector machine model Y can be expressed as:
α(i)为一个拉格朗日乘子,{α(1),α(2),...,α(i),...,α(N)}构成向量α。sgn(A)为参数A的阶跃函数,当A>0时,sgn(A)=1;当A=0时,sgn(A)=0;当A<0时,sgn(A)=-1。α、β通过求解公式(5)得到:α (i) is a Lagrange multiplier, {α (1) , α (2) , ..., α (i) , ..., α (N) } constitute the vector α. sgn(A) is the step function of parameter A, when A>0, sgn(A)=1; when A=0, sgn(A)=0; when A<0, sgn(A)=- 1. α and β are obtained by solving formula (5):
公式(5)的约束条件为公式(6)至公式(9):The constraints of formula (5) are formula (6) to formula (9):
0≤α(i)≤C (7)0≤α (i) ≤C(7)
∑t∈Tβt=1 (8)∑ t∈T β t = 1 (8)
βt≥0,t∈T (9)β t ≥ 0, t∈T (9)
其中C为惩罚系数,C≥0,一般令C=100,用于表示对误分类惩罚的大小。Among them, C is the penalty coefficient, C≥0, generally let C=100, which is used to indicate the size of the penalty for misclassification.
b为标量,在求出α、β后,由下面公式得出:b is a scalar, after calculating α and β, it is obtained by the following formula:
其中,为支持向量样本点。in, is the support vector sample point.
对多核支持向量机模型进行训练的方法是:The way to train a multi-core SVM model is:
7.1步,根据从频率指纹产生模块接收的D内样本的频率指纹计算生成核矩阵。令Kt为核矩阵,t∈T,表示四种核矩阵Kperm、Kapi、Ksmali和Karm。Kt规模大小为N行N列,第i行第j列的元素为选定3次多项式核函数,Kt的计算方法为:Step 7.1: Calculate and generate a kernel matrix according to the frequency fingerprints of the samples in D received from the frequency fingerprint generation module. Let K t be a kernel matrix, t∈T, denoting the four kernel matrices K perm , K api , K sma li and K arm . The size of K t is N rows and N columns, and the elements of the i-th row and the j-th column are The 3rd degree polynomial kernel function is selected, and the calculation method of K t is:
7.1.1步,令i=1。Step 7.1.1, let i=1.
7.1.2步,令j=1。Step 7.1.2, let j=1.
7.1.3步,计算 Step 7.1.3, Calculation
表示与的内积。 express and the inner product.
7.1.4步,若j≤N,令j=j+1,转7.1.3步;若j>N,转7.1.5步。Step 7.1.4, if j≤N, let j=j+1, go to step 7.1.3; if j>N, go to step 7.1.5.
7.1.5步,若i≤N,令i=i+1,转7.1.2步;若i>N,Kt计算完毕,转7.2步。Step 7.1.5, if i≤N, set i=i+1, go to step 7.1.2; if i>N, K t is calculated, go to step 7.2.
7.2步,优化α、β参数,方法是:Step 7.2, optimize the α, β parameters, the method is:
7.2.1初始化α向量内每个元素为0,初始化β向量内每个元素为1/4。7.2.1 Initialize each element in the alpha vector to 0, and initialize each element in the beta vector to 1/4.
7.2.2利用公式(5),按照上标r、s从小到大的顺序,选择一对α(r)、α(s)对α进行优化,将及向量β作为固定值。优化方法为:7.2.2 Using formula (5), select a pair of α (r) and α (s) to optimize α according to the superscript r and s in ascending order, and and the vector β as a fixed value. The optimization method is:
7.2.2.1利用公式(6)的约束,使公式(5)成为α(r)的一元二次函数g(α(r)),对g(α(r))求导数使求导数之后的结果等于0,求出α(r)。7.2.2.1 Using the constraints of formula (6), formula (5) becomes a quadratic function g(α (r) ) of α (r) in one variable, and the derivative of g(α (r) ) is obtained to obtain the result after the derivative equal to 0, find α (r) .
7.2.2.2利用公式(6)的约束求出α(s)。7.2.2.2 Use the constraints of equation (6) to find α (s) .
7.2.2.3将α(r),α(s)更新,得到优化后的α,命名为α*。7.2.2.3 Update α (r) and α (s) to obtain the optimized α, named α * .
7.2.3将α*作为固定值,对β进行优化,方法为:7.2.3 Taking α * as a fixed value, optimize β by:
7.2.3.1计算公式(5)对β的偏导数,使求偏导数之后的结果等于0,求出满足公式(8)和公式(9)约束条件的解,即求出了βperm、βapi、βsmali、βarm优化后的结果,分别命名为 7.2.3.1 Calculate the partial derivative of formula (5) with respect to β, make the result after the partial derivative equal to 0, and find the solution that satisfies the constraints of formula (8) and formula (9), that is, obtain β perm , β api The optimized results of , β smali and β arm are named as
7.2.3.2将拼接成优化后的β,命名为β*。7.2.3.2 Will spliced into optimized β, named β * .
7.2.4判断α、β是否满足公式(12)~公式(14)的优化终止条件:7.2.4 Judge whether α and β satisfy the optimization termination conditions of formula (12) to formula (14):
L(α*,β*)-L(α,β)≤ε (14)L(α * ,β * )-L(α,β)≤ε(14)
当满足公式(14)时,对α、β参数的优化使得公式(5)中函数值改变小于阈值ε,,令ε=0.01,说明优化后的α、β满足要求,多核支持向量机模型训练完毕,转7.3步。否则转步骤7.2.2。When formula (14) is satisfied, the optimization of the α and β parameters makes the change of the function value in formula (5) less than the threshold ε, and ε=0.01, indicating that the optimized α and β meet the requirements, and the multi-core support vector machine model is trained Finished, go to step 7.3. Otherwise, go to step 7.2.2.
7.3步,由公式(10)计算得到b的值,公式(4)定义的多核支持向量机模型训练优化完成,成为分类器。In step 7.3, the value of b is calculated by formula (10), and the training and optimization of the multi-core support vector machine model defined by formula (4) is completed and becomes a classifier.
第八步,使用基于频率指纹提取的安卓恶意软件检测系统对待检软件进行检测,判断是否为恶意软件,方法是:The eighth step is to use the Android malware detection system based on frequency fingerprint extraction to detect the software to be inspected, and determine whether it is malware. The method is as follows:
8.1步,样本预处理模块对待检软件进行预处理。将待检测软件作为样本x(a),采用3.3步所述样本预处理方法,对x(a)进行预处理,得到x(a)的AndroidManifest.xml文件、smali文件和arm指令文件,输出至频率指纹计算模块。Step 8.1, the sample preprocessing module preprocesses the software to be tested. Taking the software to be detected as a sample x (a) , using the sample preprocessing method described in step 3.3, preprocessing x (a) to obtain the AndroidManifest.xml file, smali file and arm instruction file of x (a) , output to Frequency fingerprint calculation module.
8.2步,对x(a)计算产生频率指纹 Step 8.2, generate frequency fingerprint for x (a) calculation
8.2.1步,采用6.3步所述权限提取方法提取x(a)申请的权限,得到x(a)的权限向量 Step 8.2.1, use the permission extraction method described in step 6.3 to extract the permission applied for by x (a ) , and obtain the permission vector of x (a) .
8.2.2步,采用6.4步所述API统计方法统计x(a)使用的API,得到x(a)的API向量 Step 8.2.2, use the API statistics method described in step 6.4 to count the APIs used by x (a ) , and obtain the API vector of x (a) .
8.2.3步,采用6.5步所述smali操作码统计方法统计x(a)使用的smali操作码,得到x(a)的smali操作码向量 Step 8.2.3, use the smali opcode statistical method described in step 6.5 to count the smali opcodes used by x ( a ), and obtain the smali opcode vector of x (a) .
8.2.4步,采用6.6步所述arm操作码统计方法统计x(a)使用的arm操作码,得到x(a)的arm操作码向量 Step 8.2.4, use the arm opcode statistical method described in step 6.6 to count the arm opcodes used by x ( a ), and obtain the arm opcode vector of x (a) .
8.2.5步,将计算完毕,拼接成 Step 8.2.5, will After the calculation is completed, it is spliced into
8.3步,将输入检测模块,由公式(4)计算输出F的值,F等于+1或者-1,+1代表待检软件为恶意软件,-1代表为良性软件,从而实现了判断待检测软件是否为恶意软件的目的。Step 8.3, will Input the detection module, calculate and output the value of F by formula (4), F is equal to +1 or -1, +1 means the software to be detected is malicious software, -1 means that it is benign software, thus realizing whether the software to be detected is malicious or not purpose of the software.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010237052.6A CN111460452B (en) | 2020-03-30 | 2020-03-30 | An Android malware detection method based on frequency fingerprint extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010237052.6A CN111460452B (en) | 2020-03-30 | 2020-03-30 | An Android malware detection method based on frequency fingerprint extraction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111460452A true CN111460452A (en) | 2020-07-28 |
CN111460452B CN111460452B (en) | 2022-09-09 |
Family
ID=71683415
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010237052.6A Active CN111460452B (en) | 2020-03-30 | 2020-03-30 | An Android malware detection method based on frequency fingerprint extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111460452B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112001376B (en) * | 2020-10-29 | 2021-02-26 | 深圳开源互联网安全技术有限公司 | Fingerprint identification method, device, equipment and storage medium based on open source component |
CN112632538A (en) * | 2020-12-25 | 2021-04-09 | 北京工业大学 | Android malicious software detection method and system based on mixed features |
CN114091028A (en) * | 2022-01-19 | 2022-02-25 | 南京明博互联网安全创新研究院有限公司 | A data stream-based Android application information leak detection method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107180192A (en) * | 2017-05-09 | 2017-09-19 | 北京理工大学 | Android malicious application detection method and system based on multi-feature fusion |
CN109165510A (en) * | 2018-09-04 | 2019-01-08 | 中国民航大学 | Android malicious application detection method based on binary channels convolutional neural networks |
CN109271788A (en) * | 2018-08-23 | 2019-01-25 | 北京理工大学 | A kind of Android malware detection method based on deep learning |
CN109753800A (en) * | 2019-01-02 | 2019-05-14 | 重庆邮电大学 | Android malicious application detection method and system integrating frequent itemsets and random forest algorithm |
-
2020
- 2020-03-30 CN CN202010237052.6A patent/CN111460452B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107180192A (en) * | 2017-05-09 | 2017-09-19 | 北京理工大学 | Android malicious application detection method and system based on multi-feature fusion |
CN109271788A (en) * | 2018-08-23 | 2019-01-25 | 北京理工大学 | A kind of Android malware detection method based on deep learning |
CN109165510A (en) * | 2018-09-04 | 2019-01-08 | 中国民航大学 | Android malicious application detection method based on binary channels convolutional neural networks |
CN109753800A (en) * | 2019-01-02 | 2019-05-14 | 重庆邮电大学 | Android malicious application detection method and system integrating frequent itemsets and random forest algorithm |
Non-Patent Citations (2)
Title |
---|
李创丰等: "基于CNN和朴素贝叶斯方法的安卓恶意应用检测算法", 《信息安全研究》 * |
苗博等: "基于随机森林的Android恶意代码检测系统", 《信息技术与信息化》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112001376B (en) * | 2020-10-29 | 2021-02-26 | 深圳开源互联网安全技术有限公司 | Fingerprint identification method, device, equipment and storage medium based on open source component |
CN112632538A (en) * | 2020-12-25 | 2021-04-09 | 北京工业大学 | Android malicious software detection method and system based on mixed features |
CN114091028A (en) * | 2022-01-19 | 2022-02-25 | 南京明博互联网安全创新研究院有限公司 | A data stream-based Android application information leak detection method |
CN114091028B (en) * | 2022-01-19 | 2022-04-19 | 南京明博互联网安全创新研究院有限公司 | A data stream-based Android application information leak detection method |
Also Published As
Publication number | Publication date |
---|---|
CN111460452B (en) | 2022-09-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP4058916B1 (en) | Detecting unknown malicious content in computer systems | |
CN109960726B (en) | Text classification model construction method, device, terminal and storage medium | |
CN109784056B (en) | Malicious software detection method based on deep learning | |
Drew et al. | Polymorphic malware detection using sequence classification methods and ensembles: BioSTAR 2016 Recommended Submission-EURASIP Journal on Information Security | |
CN104699772B (en) | A kind of big data file classification method based on cloud computing | |
Gao et al. | Android malware detection via graphlet sampling | |
CN111460452B (en) | An Android malware detection method based on frequency fingerprint extraction | |
CN109753801A (en) | Dynamic detection method of intelligent terminal malware based on system call | |
CN109753800A (en) | Android malicious application detection method and system integrating frequent itemsets and random forest algorithm | |
Pfeffer et al. | Malware analysis and attribution using genetic information | |
Song et al. | Malicious JavaScript detection based on bidirectional LSTM model | |
Khan et al. | Malware classification framework using convolutional neural network | |
Wu et al. | $ K $-Ary Tree Hashing for Fast Graph Classification | |
CN113591093A (en) | Industrial software vulnerability detection method based on self-attention mechanism | |
Zhu et al. | Malware homology determination using visualized images and feature fusion | |
Zhang et al. | Exploring function call graph vectorization and file statistical features in malicious PE file classification | |
Naeem et al. | Digital forensics for malware classification: An approach for binary code to pixel vector transition | |
Russell et al. | A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences | |
Guyet et al. | Incremental mining of frequent serial episodes considering multiple occurrences | |
WO2016093839A1 (en) | Structuring of semi-structured log messages | |
Pei et al. | Combining multi-features with a neural joint model for Android malware detection | |
CN114662109A (en) | Webshell detection method and device | |
CN111444502B (en) | A Population-Oriented Model Base Method for Android Malware Detection | |
De La Rosa et al. | Efficient characterization and classification of malware using deep learning | |
CN117150512A (en) | Source code vulnerability detection method, model training method, device and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |