CN111460452A - Android malicious software detection method based on frequency fingerprint extraction - Google Patents

Android malicious software detection method based on frequency fingerprint extraction Download PDF

Info

Publication number
CN111460452A
CN111460452A CN202010237052.6A CN202010237052A CN111460452A CN 111460452 A CN111460452 A CN 111460452A CN 202010237052 A CN202010237052 A CN 202010237052A CN 111460452 A CN111460452 A CN 111460452A
Authority
CN
China
Prior art keywords
api
equal
smali
arm
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010237052.6A
Other languages
Chinese (zh)
Other versions
CN111460452B (en
Inventor
吴庆
刘波
洪学恕
马行空
胡乃天
陆潼
刘鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202010237052.6A priority Critical patent/CN111460452B/en
Publication of CN111460452A publication Critical patent/CN111460452A/en
Application granted granted Critical
Publication of CN111460452B publication Critical patent/CN111460452B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Virology (AREA)
  • General Health & Medical Sciences (AREA)
  • Debugging And Monitoring (AREA)
  • Stored Programmes (AREA)

Abstract

本发明公开了一种基于频率指纹提取的安卓恶意软件检测方法,目的是提供一种能对恶意软件准确检测的方法。技术方案是构建由样本预处理模块、频率指纹产生模块、检测模块组成的基于频率指纹提取的安卓恶意软件检测系统,采集恶意及良性软件作为样本,构建基准测试集D;对D内样本解压缩得到AndroidManifest.xml、classes.dex和so库文件,提取权限、API、smali操作码和arm操作码特征,统计这些特征是否出现及出现频率,形成四种不同类型的特征向量并将其首尾相接的频率指纹。通过D内样本的频率指纹训练优化检测模块成为分类器,对待检样本进行检测,输出待检样本是否是恶意软件的结果。本发明能够有效地整合来自安卓软件各个组成部分的的信息,检测既准确又快速。

Figure 202010237052

The invention discloses an Android malware detection method based on frequency fingerprint extraction, and aims to provide a method capable of accurately detecting malware. The technical solution is to build an Android malware detection system based on frequency fingerprint extraction, which consists of a sample preprocessing module, a frequency fingerprint generation module, and a detection module, collect malicious and benign software as samples, and construct a benchmark test set D; decompress the samples in D Get AndroidManifest.xml, classes.dex and so library files, extract permissions, API, smali opcode and arm opcode features, count whether these features appear and their frequency, form four different types of feature vectors and connect them end to end frequency fingerprint. The optimized detection module is trained by the frequency fingerprint of the samples in D to become a classifier, to detect the samples to be inspected, and to output whether the samples to be inspected are the result of malware. The invention can effectively integrate the information from each component of the Android software, and the detection is both accurate and fast.

Figure 202010237052

Description

一种基于频率指纹提取的安卓恶意软件检测方法An Android malware detection method based on frequency fingerprint extraction

技术领域technical field

本发明涉及安卓恶意软件检测领域,尤其涉及到一种利用提取的频率指纹对安卓恶意软件进行检测的方法。The invention relates to the field of Android malware detection, in particular to a method for detecting Android malware by using an extracted frequency fingerprint.

背景技术Background technique

近年来,伴随着互联网技术、移动通信技术的日益发展和普及,以智能手机为代表的移动终端给人们的生活带来了极大的便利,成为不可或缺的重要交流工具。在众多的移动操作系统中,安卓(即Android)移动操作系统以其出众的开放性、丰富的第三方应用软件、友好的操作界面和良好的用户体验等显著优势,受到广大用户的欢迎,在全球范围移动智能设备中占据了大量的市场份额。与此同时,安卓应用软件的数量也快速的增长,截止到2020年2月,Google Play中的应用软件数量达到了286万,且仍在不断增长。In recent years, with the increasing development and popularization of Internet technology and mobile communication technology, mobile terminals represented by smart phones have brought great convenience to people's lives and become an indispensable and important communication tool. Among the many mobile operating systems, the Android (ie Android) mobile operating system is welcomed by the majority of users due to its outstanding openness, rich third-party application software, friendly operation interface and good user experience. Global mobile smart devices occupy a large market share. At the same time, the number of Android applications has also grown rapidly. As of February 2020, the number of applications in Google Play has reached 2.86 million and is still growing.

除了安卓官方应用市场Google Play外,还存在着大量的第三方应用市场,这类市场良莠不齐数目众多,缺乏统一有效的管理,发布审核机制并不健全,不法人员也能随意发布安卓应用软件,使得这类市场中难以避免混入恶意应用,在被用户下载后给用户的信息安全带来巨大隐患。更为严重的是各类应用市场中软件存量巨大,且增速很快,在安全机制、检测方法并不健全的情况下,恶意软件在这类市场中长期存在,难以发现和查杀,对安卓生态的健康发展造成了巨大的威胁。In addition to Google Play, the official Android application market, there are also a large number of third-party application markets. There are many different types of markets, lacking unified and effective management, and the release review mechanism is not perfect. Unscrupulous personnel can also release Android application software at will, making It is difficult to avoid malicious applications mixed in such markets, which brings huge hidden dangers to users' information security after being downloaded by users. What's more serious is that there is a huge amount of software in various application markets, and the growth rate is very fast. Under the circumstance that the security mechanism and detection method are not perfect, malware has existed for a long time in such markets, and it is difficult to find and kill it. The healthy development of the Android ecosystem poses a huge threat.

目前典型的安卓恶意软件检测技术包括静态检测和动态检测两种类型。静态检测方法主要使用反汇编、反编译技术或者在smali中间代码上进行控制流和数据流分析技术来进行恶意代码检测。优点是代码覆盖率高,缺点是无法检测代码混淆、加密以及动态加载恶意代码的问题。动态分析方法是在系统运行过程中监控应用运行时的各种变量、跟踪应用的行为路径、收集运行产生的日志,优点是解决了静态方法遇到的代码混淆和加密等方面的问题,缺点是动态测试代码覆盖率低,并且有些恶意程序可以防止自身在模拟器下运行,在模拟器下运行时会崩溃或改变自身行为表现。在实现中,针对海量恶意样本的检测,为了得到更快的检测速度及更高的代码覆盖率,多数方法更倾向于使用静态检测。At present, typical Android malware detection techniques include two types: static detection and dynamic detection. Static detection methods mainly use disassembly, decompilation technology or control flow and data flow analysis technology on smali intermediate code to detect malicious code. The advantage is high code coverage, and the disadvantage is that it cannot detect code obfuscation, encryption, and dynamic loading of malicious code. The dynamic analysis method is to monitor various variables when the application is running, track the behavior path of the application, and collect the logs generated by the operation during the system operation. The advantage is that it solves the problems of code confusion and encryption encountered by the static method. The disadvantage is Dynamic test code coverage is low, and some malicious programs can prevent themselves from running under the emulator, crash or change their behavior when running under the emulator. In implementation, for the detection of massive malicious samples, in order to obtain faster detection speed and higher code coverage, most methods prefer to use static detection.

M.Ganesh等人提取安卓软件Manifest清单中列举的权限作为特征来检测恶意应用。他们将权限排列成12×12的阵列,输入到卷积神经网络模型进行训练,可以检测出软件是否是恶意的;M.Amin等人从字节码文件中提取操作码序列作为特征来检测安卓恶意软件。他们提取软件中的操作码组成一个长序列,将其视为有序文本进行处理,通过训练BiLSTM神经网络模型来分析软件的恶意性;R.Nix等人提取安卓API(ApplicationProgramming Interface,应用程序接口)调用序列研究恶意软件的检测方法,他们使用一个位向量对每个API调用进行编码,然后拆分组合成为大小为n×m的矩阵,用作卷积神经网络模型的输入,最终使用训练出的分类器判定软件的恶意性。M.Ganesh et al. extracted the permissions listed in the Android Manifest list as features to detect malicious applications. They arrange the permissions into a 12×12 array and input them into a convolutional neural network model for training, which can detect whether the software is malicious; M.Amin et al. extracted opcode sequences from bytecode files as features to detect Android malicious software. They extracted the opcodes in the software to form a long sequence, treated it as ordered text, and analyzed the maliciousness of the software by training the BiLSTM neural network model; R. Nix et al. extracted the Android API (Application Programming Interface, application programming interface) ) call sequences to study malware detection methods, they use a bit vector to encode each API call, and then split and combine into a matrix of size n × m, which is used as the input of the convolutional neural network model, and finally uses the trained The classifier determines the maliciousness of software.

上述检测方法在安卓恶意软件检测中取得了一定的成果,但也存在着一些问题,主要有以下两个方面:一是特征提取时考虑软件多种特征的关联分析不足。现有方法多是单方面提取某一种类型的特征刻画安卓软件行为,没有采取多种类型的特征协同进行软件分析,抽象出的特征表示类型单一,导致检测结果准确度不高。二是训练的神经网络模型较为复杂,涉及大量参数调整优化,效率低下,得到训练良好的模型需要消耗大量的时间。The above detection methods have achieved certain results in Android malware detection, but there are also some problems, mainly in the following two aspects: First, the correlation analysis considering multiple software features in feature extraction is insufficient. Most of the existing methods unilaterally extract a certain type of features to describe the behavior of Android software, and do not use multiple types of features to coordinate software analysis, and the abstracted features represent a single type, resulting in low accuracy of detection results. Second, the trained neural network model is relatively complex, involving a large number of parameter adjustment and optimization, which is inefficient, and it takes a lot of time to obtain a well-trained model.

因此,面对大量出现的安卓恶意软件,如何精确、高效的检测是一个非常值得关注的问题。Therefore, in the face of a large number of Android malware, how to detect it accurately and efficiently is a very important issue.

发明内容SUMMARY OF THE INVENTION

本发明要解决的技术问题是针对安卓恶意软件,生成能够唯一标识该软件的频率指纹,并基于该指纹训练和优化多核支持向量机模型,对安卓恶意软件进行准确检测,同时有效提高检测速度。The technical problem to be solved by the present invention is to generate a frequency fingerprint that can uniquely identify the software for Android malware, and train and optimize a multi-core support vector machine model based on the fingerprint to accurately detect the Android malware and effectively improve the detection speed.

本发明的技术方案是:构建由样本预处理模块、频率指纹产生模块、检测模块组成的基于频率指纹提取的安卓恶意软件检测系统,采集安卓恶意及良性软件作为样本,构建基准测试集。对集内的样本解压缩,得到AndroidManifest.xml、classes.dex和so库文件,提取权限、API、smali操作码和arm操作码特征,统计这四类特征是否出现以及出现频率,形成四种不同类型的特征向量并将其首尾相接,形成长向量,作为安卓软件的频率指纹。通过采集基准测试集内众多样本的频率指纹,训练优化检测模块(是一个多核支持向量机模型)成为分类器,对待检样本进行检测,输出待检样本是否是恶意软件的结果。The technical scheme of the present invention is to construct an Android malware detection system based on frequency fingerprint extraction, which is composed of a sample preprocessing module, a frequency fingerprint generation module, and a detection module, collect Android malware and benign software as samples, and construct a benchmark test set. Decompress the samples in the set, get AndroidManifest.xml, classes.dex and so library files, extract permissions, API, smali opcode and arm opcode features, and count whether these four types of features appear and their frequency to form four different Type feature vectors and connect them end to end to form a long vector as the frequency fingerprint of the Android software. By collecting the frequency fingerprints of many samples in the benchmark test set, the optimized detection module (which is a multi-core support vector machine model) is trained to become a classifier, which detects the samples to be tested and outputs whether the samples to be tested are the result of malware.

本发明包括以下步骤:The present invention includes the following steps:

第一步,构建基于频率指纹提取的安卓恶意软件检测系统。该系统安装在谷歌官方或者第三方安卓应用软件市场服务器中,由样本预处理模块、频率指纹产生模块、检测模块组成。The first step is to build an Android malware detection system based on frequency fingerprint extraction. The system is installed in Google's official or third-party Android application software market server, and consists of a sample preprocessing module, a frequency fingerprint generation module, and a detection module.

样本预处理模块与频率指纹产生模块相连,样本预处理模块接收来自开发人员构建的基准测试集的样本和普通用户提交的待检测样本,对样本进行预处理,产生AndroidManifest.xml、smali文件和arm指令文件三种类型的文件,输出至频率指纹产生模块。The sample preprocessing module is connected to the frequency fingerprint generation module. The sample preprocessing module receives samples from the benchmark test set constructed by developers and samples to be tested submitted by ordinary users, preprocesses the samples, and generates AndroidManifest.xml, smali files and arm Three types of command files are output to the frequency fingerprint generation module.

频率指纹产生模块与样本预处理模块、检测模块相连,频率指纹产生模块从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件,进行特征筛选和频率指纹(能够作为安卓软件身份标识的一种向量)计算,产生频率指纹,输出至检测模块;频率指纹产生模块由特征筛选模块和频率指纹计算模块组成。特征筛选模块与样本预处理模块、频率指纹计算模块相连,特征筛选模块从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件,对这三种文件进行特征筛选,得到权限、API、smali操作码和arm操作码特征,将权限、API、smali操作码和arm操作码特征发送给频率指纹计算模块。频率指纹计算模块与样本预处理模块、特征筛选模块、检测模块相连,频率指纹计算模块从特征筛选模块接收权限、API、smali操作码和arm操作码特征,从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件,计算产生频率指纹,将频率指纹发送给检测模块。The frequency fingerprint generation module is connected with the sample preprocessing module and the detection module, and the frequency fingerprint generation module receives AndroidManifest. The frequency fingerprint generation module is composed of a feature screening module and a frequency fingerprint calculation module. The feature screening module is connected with the sample preprocessing module and the frequency fingerprint calculation module. The feature screening module receives AndroidManifest.xml, smali file and arm command file from the sample preprocessing module, and performs feature screening on these three files to obtain permissions, API, smali Opcode and arm opcode feature, send permission, API, smali opcode and arm opcode feature to the frequency fingerprint calculation module. The frequency fingerprint calculation module is connected with the sample preprocessing module, the feature screening module and the detection module. The frequency fingerprint calculation module receives permissions, API, smali opcode and arm opcode features from the feature screening module, and receives AndroidManifest.xml, smali file and arm command file, calculate the frequency fingerprint, and send the frequency fingerprint to the detection module.

检测模块与频率指纹产生模块相连,检测模块是一个多核支持向量机模型,它从频率指纹产生模块接收基准测试集D的频率指纹和待检测软件的频率指纹,利用基准测试集D的频率指纹进行训练优化,成为适合对待检测软件进行检测的分类器,然后根据待检测软件的频率指纹对待检测软件进行检测分类,得出待检测软件是否是恶意软件的判定结果。The detection module is connected with the frequency fingerprint generation module. The detection module is a multi-core support vector machine model. It receives the frequency fingerprint of the benchmark test set D and the frequency fingerprint of the software to be detected from the frequency fingerprint generation module, and uses the frequency fingerprint of the benchmark test set D. The training is optimized to become a classifier suitable for detecting the software to be detected, and then the software to be detected is detected and classified according to the frequency fingerprint of the software to be detected, and the judgment result of whether the software to be detected is malware is obtained.

第二步,构建基准测试集D,方法是:The second step is to construct a benchmark test set D, the method is:

2.1步,从开源的Drebin、Genome和AMD数据集中获得N1个安卓恶意软件作为恶意样本,N1为正整数且N1>1000。Step 2.1, obtain N 1 Android malware from the open source Drebin, Genome and AMD datasets as malicious samples, where N 1 is a positive integer and N 1 >1000.

2.2步,通过爬取GooglePlay和Apkpure应用商店获得良性软件,并使用本地杀毒软件及VirusTotal在线杀毒网站进行检测过滤,形成N2个良性样本,N2为正整数且N2>1000。Step 2.2, obtain benign software by crawling GooglePlay and Apkpure application stores, and use local antivirus software and VirusTotal online antivirus website to detect and filter to form N 2 benign samples, N 2 is a positive integer and N 2 >1000.

2.3步,给恶意样本及良性样本添加标签,组成基准测试集D,N为D内样本总数,N=N1+N2。定义x(i)为D中第i个样本,y(i)为x(i)的标签,y(i)等于1表示x(i)为恶意样本,y(i)等于-1表示x(i)为良性样本,1≤i≤N。Step 2.3, add labels to malicious samples and benign samples to form a benchmark test set D, where N is the total number of samples in D, N=N 1 +N 2 . Define x (i) as the ith sample in D, y (i) as the label of x (i) , y (i) equal to 1 means x (i) is a malicious sample, y (i) equal to -1 means x ( i) is a benign sample, 1≤i≤N.

2.4将D存储在预处理模块、频率指纹产生模块均可以读取的存储器上。2.4 Store D in a memory that can be read by both the preprocessing module and the frequency fingerprint generation module.

第三步,样本预处理模块对D内N个样本进行预处理,得到N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件。In the third step, the sample preprocessing module preprocesses the N samples in D to obtain N AndroidManifest.xml files, N smali files and N arm instruction files.

3.1步,令变量i=1;Step 3.1, let the variable i=1;

3.2步,从D中取第i个样本x(i)Step 3.2, take the ith sample x (i) from D.

3.3步,采用样本预处理方法对x(i)进行预处理,得到x(i)的AndroidManifest.xml文件、smali文件和arm指令文件,方法是:Step 3.3, use the sample preprocessing method to preprocess x (i) to obtain the AndroidManifest.xml file, smali file and arm command file of x (i) , the method is:

3.3.1步,使用解压缩工具(例如Gzip和7zip),对x(i)进行解压缩,提取x(i)中的AndroidManifest.xml、classes.dex以及so运行库文件。Step 3.3.1, use decompression tools (such as Gzip and 7zip) to decompress x (i) , and extract the AndroidManifest.xml, classes.dex and so runtime library files in x (i).

3.3.2步,使用AndroidManifest.xml文件专用反编译工具AXMLPrinter2(下载地址:https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/android4me/AXMLPrinter2.jar,版本2.0或以上版本),将AndroidManifest.xml文件由二进制形式反编译为文本形式。Step 3.3.2, use the AndroidManifest.xml file dedicated decompilation tool AXMLPrinter2 (download address: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/android4me/AXMLPrinter2.jar , version 2.0 or above), decompile the AndroidManifest.xml file from binary form to text form.

3.3.3步,使用dex文件格式反编译工具baksmali(https://bitbucket.org/JesusFreke/smali/downloads/baksmali-2.4.0.jar,版本2.4.0或以上版本)将classes.dex反编译为smali文件,若产生多个smali文件,将多个smali文件合并成为一个smali文件,转3.3.4步;若只产生1个smali文件,直接转3.3.4步。Step 3.3.3, use the dex file format decompilation tool baksmali (https://bitbucket.org/JesusFreke/smali/downloads/baksmali-2.4.0.jar, version 2.4.0 or above) to decompile classes.dex It is a smali file. If multiple smali files are generated, merge the multiple smali files into one smali file and go to step 3.3.4; if only one smali file is generated, go to step 3.3.4 directly.

3.3.4步,使用arm指令反汇编工具gcc-arm-none-eabi(https://developer.arm.com/-/media/Files/downloads/gnu-rm/9-2019q4/gcc-arm-none-eabi-9-2019-q4-major-x86_64-linux.tar.bz2,版本9-2019-q4-major或以上版本)将so运行库文件反编译为文本形式的arm指令文件,若产生多个arm指令文件,将多个arm指令文件合并成为一个arm指令文件,转3.4步;如若没有产生arm指令文件,则新建一个空的arm指令文件,转3.4步。Step 3.3.4, use the arm instruction disassembly tool gcc-arm-none-eabi (https://developer.arm.com/-/media/Files/downloads/gnu-rm/9-2019q4/gcc-arm-none -eabi-9-2019-q4-major-x86_64-linux.tar.bz2, version 9-2019-q4-major or above) decompile the so runtime library file into a text-based arm command file, if multiple arm command file, combine multiple arm command files into one arm command file, go to step 3.4; if no arm command file is generated, create an empty arm command file and go to step 3.4.

3.4步,令i=i+1,若i≤N,转3.2步;若i>N,此时N个样本产生了对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件,将D的N个样本对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件发送给特征筛选模块,转第四步。Step 3.4, let i=i+1, if i≤N, go to step 3.2; if i>N, then N samples generate corresponding N AndroidManifest.xml files, N smali files and N arm command files , send the N AndroidManifest.xml files, N smali files, and N arm instruction files corresponding to the N samples of D to the feature screening module, and go to the fourth step.

第四步,特征筛选模块对从样本预处理模块收到的D的N个样本对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件进行特征筛选,得到适合对D进行分类的权限特征、API特征、smali操作码特征和arm操作码特征。In the fourth step, the feature screening module performs feature screening on the N AndroidManifest.xml files, N smali files and N arm command files corresponding to the N samples of D received from the sample preprocessing module, and obtains a feature suitable for classifying D. Permission features, API features, smali opcode features, and arm opcode features.

4.1步,选择安卓开发者文档(https://developer.android.com/reference/android/Manifest.permission)中定义的167种android.manifest.permission系统权限,将这167种权限作为特征,称为权限特征。Step 4.1, select the 167 android.manifest.permission system permissions defined in the Android developer documentation (https://developer.android.com/reference/android/Manifest.permission), and use these 167 permissions as features, called Permission features.

4.2步,从pscout列表(https://security.csl.toronto.edu/pscout/?mdocs-file=67)的API中,选择出256个API,方法是:Step 4.2, select 256 APIs from the APIs in the pscout list (https://security.csl.toronto.edu/pscout/?mdocs-file=67) by:

4.2.1步,建立一个列表Lapi,选择pscout列表中全部的32437个API加入Lapi,第v个API记为Lapi[v],1≤v≤32437。Step 4.2.1, create a list L api , select all 32437 APIs in the pscout list to join L api , the vth API is recorded as L api [v], 1≤v≤32437.

4.2.2步,建立一个32437行N列的二维数组Zapi,第v行第i列元素Zapi[v][i]的值限定为1或0,1代表Lapi的第v个API在D中的第i个样本中出现,0代表未出现。Step 4.2.2, build a two-dimensional array Z api with 32437 rows and N columns, the value of the element Z api [v][i] in the vth row and the i column is limited to 1 or 0, 1 represents the vth API of L api appears in the ith sample in D, and 0 means not appearing.

4.2.3步,初始化Zapi内所有元素为0,初始化变量i=1。Step 4.2.3, initialize all elements in Z api to 0, and initialize variable i=1.

4.2.4步,按行扫描D的第i个样本的smali文件,得到第i个样本中出现的属于Lapi的API,对Zapi的第i列元素进行赋值。记smali文件的第u行字符串为str[u],记smali文件的总行数为U,1≤u≤U,方法是:Step 4.2.4, scan the smali file of the ith sample of D by row, obtain the API belonging to the L api appearing in the ith sample, and assign values to the elements of the ith column of Z api . Note that the u-th string of the smali file is str[u], and the total number of lines of the smali file is U, 1≤u≤U, the method is:

4.2.4.1步,初始化u=1。Step 4.2.4.1, initialize u=1.

4.2.4.2步,若str[u]是一个API字符串,转4.2.4.2.1;若str[u]不是一个API字符串,转4.2.4.3。Step 4.2.4.2, if str[u] is an API string, go to 4.2.4.2.1; if str[u] is not an API string, go to 4.2.4.3.

4.2.4.2.1步,初始化变量v=1。Step 4.2.4.2.1, initialize the variable v=1.

4.2.4.2.2步,若str[u]含有内容为Lapi[v]的子字符串,赋值Zapi[v][i]=1,转4.2.4.3;否则,转4.2.4.2.3步。Step 4.2.4.2.2, if str[u] contains a substring whose content is L api [v], assign Z api [v][i]=1, go to 4.2.4.3; otherwise, go to 4.2.4.2.3 step.

4.2.4.2.3步,令v=v+1。若v≤32437,转4.2.4.2.2步;若v>32437,转4.2.4.3步。Step 4.2.4.2.3, let v=v+1. If v≤32437, go to step 4.2.4.2.2; if v>32437, go to step 4.2.4.3.

4.2.4.3步,令u=u+1。若u≤U,转4.2.4.2步;若u>U,说明第i个样本的smali文件扫描完毕,转4.2.5步。Step 4.2.4.3, let u=u+1. If u≤U, go to step 4.2.4.2; if u>U, it means that the smali file of the ith sample is scanned, go to step 4.2.5.

4.2.5步,令i=i+1。若i≤N,转4.2.4步;若i>N,完成了对二维数组Zapi的赋值,转4.2.6。Step 4.2.5, let i=i+1. If i≤N, go to step 4.2.4; if i>N, complete the assignment to the two-dimensional array Z api , go to step 4.2.6.

4.2.6步,计算列表Lapi中每个API对基准测试集D的信息增益IG。第v个API对D的信息增益用IG(D|Lapi[v])表示。Step 4.2.6, calculate the information gain IG of each API in the list L api to the benchmark test set D. The information gain of the vth API to D is denoted by IG(D|L api [v]).

4.2.6.1步,令v=1。Step 4.2.6.1, let v=1.

4.2.6.2步,令i=1。令第一变量M11=0,令第二变量M12=0,令第三变量M21=0,令第四变量M22=0。Step 4.2.6.2, let i=1. Let the first variable M 11 =0, the second variable M 12 =0, the third variable M 21 =0, and the fourth variable M 22 =0.

4.2.6.3步,若Zapi[v][i]等于1并且y(i)等于1,令M11=M11+1;若Zapi[v][i]等于1并且y(i)等于0,令M12=M12+1;若Zapi[v][i]等于0并且y(i)等于1,令M21=M21+1;若Zapi[v][i]等于0并且y(i)等于0,令M22=M22+1。Step 4.2.6.3, if Zapi [v][i] is equal to 1 and y (i) is equal to 1, let M11= M11 + 1 ; if Zapi [v][i] is equal to 1 and y (i) is equal to 0, let M 12 =M 12 +1; if Z api [v][i] is equal to 0 and y (i) is equal to 1, let M 21 =M 21 +1; if Z api [v][i] is equal to 0 and y (i) is equal to 0, let M 22 =M 22 +1.

4.2.6.4步,令i=i+1。若i≤N,转4.2.6.3步;若i>N,转4.2.6.5步。Step 4.2.6.4, let i=i+1. If i≤N, go to step 4.2.6.3; if i>N, go to step 4.2.6.5.

4.2.6.5步计算IG(D|Lapi[v]),方法为:Step 4.2.6.5 Calculate IG(D|L api [v]), the method is:

IG(D|Lapi[v])=H(D)-H(D|Lapi[v]) (1)IG(D|L api [v])=H(D)-H(D|L api [v]) (1)

其中H(D)为基准测试集D的经验熵,H(D)计算方法为:where H(D) is the empirical entropy of the benchmark test set D, and the calculation method of H(D) is:

Figure BDA0002431352400000061
Figure BDA0002431352400000061

H(D|Lapi[v])为列表Lapi的第v个API对D的经验条件熵,H(D|Lapi[v])为:H(D|L api [v]) is the empirical conditional entropy of the vth API of the list L api for D, and H(D|L api [v]) is:

Figure BDA0002431352400000062
Figure BDA0002431352400000062

4.2.6.6步,令v=v+1。若v≤32437,转4.2.6.2;若v>32437,说明列表Lapi内全部API对D的信息增益计算完毕,按照IG(D|Lapi[v])值从大到小将Lapi内API排序,取排序后的前256个API,作为API特征,转4.3步。Step 4.2.6.6, let v=v+1. If v≤32437, go to 4.2.6.2; if v>32437, it means that the information gain of all APIs in the list L api to D has been calculated, and the APIs in L api are sorted according to the value of IG(D|L api [v]) from large to small. Sort, take the first 256 APIs after sorting, as API features, go to step 4.3.

4.3步,安卓Dalvik虚拟机预定义了长度为8个二进制位的smali操作码(https://developer.android.com/reference/dalvik/bytecode/Opcodes.html),包括预留未定义的类型,最多有256种,将这256种smali操作码作为特征,称为smali操作码特征。In step 4.3, the Android Dalvik virtual machine predefines the smali opcode with a length of 8 binary bits (https://developer.android.com/reference/dalvik/bytecode/Opcodes.html), including reserved undefined types, There are at most 256 kinds, and these 256 kinds of smali opcodes are used as features, which are called smali opcode features.

4.4步,根据arm指令快速参考手册(http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001mc/QRC0001_UAL.pdf),特征筛选模块选择该手册列举的共计197种arm指令操作码作为特征,称为arm操作码特征。Step 4.4, according to the arm command quick reference manual (http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001mc/QRC0001_UAL.pdf), the feature screening module selects a total of 197 arm commands listed in the manual The opcode as a feature is called the arm opcode feature.

4.5步,将权限特征、API特征、smali操作码特征和arm操作码特征发送给频率指纹计算模块。In step 4.5, the permission feature, API feature, smali opcode feature and arm opcode feature are sent to the frequency fingerprint calculation module.

第五步,确定频率指纹格式。The fifth step is to determine the frequency fingerprint format.

将167种权限特征、256种API特征、256种smali操作码特征和197种arm操作码特征分别按字母顺序排列形成向量,分别称之为安卓软件的权限向量、API向量、smali操作码向量和arm操作码向量。The 167 permission features, 256 API features, 256 smali opcode features, and 197 arm opcode features are arranged in alphabetical order to form vectors, which are called Android software permission vectors, API vectors, smali opcode vectors and arm opcode vector.

一个安卓软件的权限向量由167个整数构成,每个整数取值为1或0。若第pa位置上的整数取值为1,说明筛选的167种权限中的第pa种在该安卓软件中被申请;若第pa位置上的整数取值为0,说明筛选的167种权限中的第pa种在该安卓软件中未被申请。pa为整数,1≤pa≤167。The permission vector of an Android software consists of 167 integers, each of which takes the value 1 or 0. If the integer value in the pa-th position is 1, it means that the pa-th type of the 167 kinds of permissions screened is applied for in the Android software; if the integer value of the pa-th position is 0, it means that the 167 kinds of permissions screened are in the pa-th kind. The type pa is not applied for in this Android software. pa is an integer, 1≤pa≤167.

一个安卓软件的API向量由256个小数组成,第pb位置上的小数说明筛选的256种API的第pb种在该安卓软件中出现的频率。pb为整数,1≤pb≤256。The API vector of an Android software consists of 256 decimals, and the decimal in the pbth position indicates the frequency of the pbth type of the 256 kinds of APIs screened in the Android software. pb is an integer, 1≤pb≤256.

一个安卓软件的smali操作码向量由256个小数组成,第pc位置上的小数说明筛选的256种smali操作码的第pc种在该安卓软件中出现的频率。pc为整数,1≤pc≤256。The smali opcode vector of an Android software consists of 256 decimals, and the decimal at the pcth position indicates the frequency of the pcth of the 256 smali opcodes screened in the Android software. pc is an integer, 1≤pc≤256.

一个安卓软件的arm操作码向量由197个小数组成,第pd位置上的小数说明筛选的197种arm操作码的第pd种在该安卓软件中出现的频率。pd为整数,1≤pd≤197。The arm opcode vector of an Android software consists of 197 decimals, and the decimal at the pd-th position indicates the frequency of the pd-th type of the selected 197 arm opcodes appearing in the Android software. pd is an integer, 1≤pd≤197.

四种向量首尾相接,形成一个长度为876的向量,作为样本的身份标识,称为频率指纹。频率指纹中含有的167个整数和709个小数,均称作频率指纹的元素。The four vectors are connected end to end to form a vector with a length of 876, which is used as the identity of the sample, which is called the frequency fingerprint. The 167 integers and 709 decimals contained in the frequency fingerprint are called the elements of the frequency fingerprint.

第六步,频率指纹计算模块从特征筛选模块接收权限特征、API特征、smali操作码特征和arm操作码特征,从样本预处理模块接收AndroidManifest.xml文件、smali文件和arm指令文件,对基准测试集D中N个样本计算产生频率指纹。In the sixth step, the frequency fingerprint calculation module receives permission features, API features, smali opcode features and arm opcode features from the feature screening module, and receives the AndroidManifest.xml file, smali file and arm command file from the sample preprocessing module, and performs benchmark testing The N samples in the set D are calculated to generate frequency fingerprints.

6.1步,令La为权限列表,列表成员La[pa]为167种权限中按字母顺序排列的第pa种权限的名称字符串;令Lb为API列表,列表成员Lb[pb]为256种API中按字母顺序排列的第pb种API的名称字符串;令Lc为smali操作码列表,列表成员Lc[pc]为256种smali操作码中按字母顺序排列的第pc种smali操作码的名称字符串;令Ld为arm操作码列表,列表成员Ld[pd]为197种arm操作码中按字母顺序排列的第pd种arm操作码的名称字符串。令变量i=1。Step 6.1, let L a be the permission list, and the list member L a [pa] is the name string of the pa-th permission in alphabetical order among the 167 permissions; let L b be the API list, and the list member L b [pb] is the name string of the pbth API in alphabetical order among the 256 APIs; let L c be the list of smali opcodes, and the list member L c [pc] is the alphabetical pcth of the 256 smali opcodes The name string of the smali opcode; let Ld be the list of arm opcodes, and the list member Ld [pd] be the name string of the pdth arm opcode in alphabetical order among the 197 arm opcodes. Let variable i=1.

6.2步,取D中第i个样本x(i),为x(i)生成频率指纹

Figure BDA0002431352400000081
含876个元素,初始化每个元素为0。将
Figure BDA00024313524000000812
中的权限向量记为
Figure BDA0002431352400000082
中的第pa个元素记为
Figure BDA0002431352400000083
API向量记为
Figure BDA0002431352400000084
中的第pb个元素记为
Figure BDA0002431352400000085
smali操作码向量记为
Figure BDA0002431352400000086
中的第pc个元素记为
Figure BDA0002431352400000087
arm操作码向量记为
Figure BDA0002431352400000088
中的第pd个元素记为
Figure BDA0002431352400000089
Step 6.2, take the ith sample x (i) in D, and generate a frequency fingerprint for x (i)
Figure BDA0002431352400000081
Contains 876 elements, initializing each element to 0. Will
Figure BDA00024313524000000812
The permission vector in is denoted as
Figure BDA0002431352400000082
The pa-th element in is denoted as
Figure BDA0002431352400000083
API vector is denoted as
Figure BDA0002431352400000084
The pbth element in is denoted as
Figure BDA0002431352400000085
The smali opcode vector is denoted as
Figure BDA0002431352400000086
The pc-th element in is denoted as
Figure BDA0002431352400000087
The arm opcode vector is denoted as
Figure BDA0002431352400000088
The pd-th element in is denoted as
Figure BDA0002431352400000089

6.3步,采用权限提取方法提取x(i)申请的权限,得到x(i)的权限向量

Figure BDA00024313524000000810
方法是:Step 6.3, using the permission extraction method to extract the permission applied by x (i ) , and obtain the permission vector of x (i)
Figure BDA00024313524000000810
the way is:

6.3.1步,按行扫描x(i)对应的AndroidManifest.xml文件,记AndroidManifest.xml文件的第qa行字符串为stra[qa],记AndroidManifest.xml文件总行数为numa行。Step 6.3.1, scan the AndroidManifest.xml file corresponding to x (i) by line, record the string of line qa in the AndroidManifest.xml file as stra[qa], and record the total number of lines in the AndroidManifest.xml file as numa lines.

6.3.2步,令qa=1。Step 6.3.2, let qa=1.

6.3.3步,若stra[qa]含有内容为“uses-permission”的子字符串,令pa=1,转6.3.4步;若stra[qa]不含有内容为“uses-permission”的字符串,转6.3.6步。Step 6.3.3, if stra[qa] contains a substring whose content is "uses-permission", set pa=1, go to step 6.3.4; if stra[qa] does not contain a character whose content is "uses-permission" string, go to step 6.3.6.

6.3.4步,若stra[qa]含有内容为La[pa]的子字符串,表明x(i)申请了La[pa]权限,令

Figure BDA00024313524000000813
转6.3.6步;若stra[qa]不含有内容为La[pa]的子字符串,转6.3.5步。In step 6.3.4, if stra[qa] contains a substring whose content is La [pa], it means that x (i) has applied for La [pa] permission, and let
Figure BDA00024313524000000813
Go to step 6.3.6; if stra[qa] does not contain a substring whose content is La [pa], go to step 6.3.5.

6.3.5步,令pa=pa+1。若pa≤167,转6.3.4步;若pa>167,说明完成了一遍对La的检查,转6.3.6步。Step 6.3.5, let pa=pa+1. If pa≤167 , go to step 6.3.4; if pa>167, it means that the check of La has been completed, go to step 6.3.6.

6.3.6步,令qa=qa+1。若qa≤numa,转6.3.3步;若qa>numa,说明x(i)对应的AndroidManifest.xml文件扫描完毕,

Figure BDA00024313524000000811
计算完成,转6.4步。Step 6.3.6, let qa=qa+1. If qa≤numa, go to step 6.3.3; if qa>numa, it means that the AndroidManifest.xml file corresponding to x (i) has been scanned,
Figure BDA00024313524000000811
After the calculation is completed, go to step 6.4.

6.4步,采用API统计方法统计x(i)使用的API,得到x(i)的API向量

Figure BDA0002431352400000091
方法是:Step 6.4, use the API statistics method to count the APIs used by x (i ) , and obtain the API vector of x (i) .
Figure BDA0002431352400000091
the way is:

6.4.1步,按行扫描x(i)对应的smali文件,记smali文件的第qb行字符串为strb[qb],记smali文件总行数为numb行。Step 6.4.1, scan the smali file corresponding to x (i) by line, record the qb line string of the smali file as strb[qb], and record the total number of lines in the smali file as numb lines.

6.4.2步,令qb=1,使用变量inv表示smali文件中API的总数量,令inv=1。Step 6.4.2, let qb=1, use the variable inv to represent the total number of APIs in the smali file, let inv=1.

6.4.3步,令变量pb=1。Step 6.4.3, let the variable pb=1.

6.4.4步,若strb[qb]含有内容为“invoke”的子字符串,令inv=inv+1,转6.4.5步;若不含有“invoke”子字符串,转6.4.7步。Step 6.4.4, if strb[qb] contains a substring whose content is "invoke", let inv=inv+1, go to step 6.4.5; if it does not contain a substring of "invoke", go to step 6.4.7.

6.4.5步,若strb[qb]含有内容为Lb[pb]的子字符串,说明x(i)调用了名字为Lb[pb]的API,令

Figure BDA0002431352400000092
转6.4.7步;若strb[qb]不含有内容为Lb[pb]的子字符串,转6.4.6步。Step 6.4.5, if strb[qb] contains a substring whose content is L b [pb], it means that x (i) calls the API named L b [pb], let
Figure BDA0002431352400000092
Go to step 6.4.7; if strb[qb] does not contain a substring whose content is L b [pb], go to step 6.4.6.

6.4.6步,令pb=pb+1。若pb≤256,转6.4.5步;若pb>256,说明完成了一遍对Lb的检查,转6.4.7步。Step 6.4.6, let pb=pb+1. If pb≤256, go to step 6.4.5; if pb>256, it means that the check of L b is completed, go to step 6.4.7.

6.4.7步,令qb=qb+1。若qb≤numb,转6.4.3步;若qb>numb,说明x(i)对应的smali文件扫描完毕,转6.4.8步。Step 6.4.7, let qb=qb+1. If qb≤numb, go to step 6.4.3; if qb>numb, it means that the smali file corresponding to x (i) is scanned, go to step 6.4.8.

6.4.8步,令pb=1。Step 6.4.8, let pb=1.

6.4.9步,令

Figure BDA0002431352400000093
Step 6.4.9, let
Figure BDA0002431352400000093

6.4.10步,令pb=pb+1。若pb≤256,转6.4.9步;若pb>256,说明

Figure BDA0002431352400000094
计算完成,转6.5步。Step 6.4.10, let pb=pb+1. If pb≤256, go to step 6.4.9; if pb>256, explain
Figure BDA0002431352400000094
After the calculation is completed, go to step 6.5.

6.5步,采用smali操作码统计方法统计x(i)使用的smali操作码,得到x(i)的smali操作码向量

Figure BDA0002431352400000095
方法是:Step 6.5, use the smali opcode statistical method to count the smali opcodes used by x (i) , and obtain the smali opcode vector of x (i).
Figure BDA0002431352400000095
the way is:

6.5.1步,按行扫描x(i)对应的smali文件,记smali文件的第qc行字符串为strc[qc],记smali文件总行数为hume行。Step 6.5.1, scan the smali file corresponding to x (i) by line, record the qc line string of the smali file as strc[qc], and record the total number of lines in the smali file as the hume line.

6.5.2步,令qc=1,使用变量ops表示smali文件中smali操作码的总数量,令ops=1。Step 6.5.2, let qc=1, use the variable ops to represent the total number of smali opcodes in the smali file, let ops=1.

6.5.3步,令pc=1。Step 6.5.3, let pc=1.

6.5.4步,若strc[qc]含有内容为Lc[pc]的子字符串,令

Figure BDA0002431352400000101
Figure BDA0002431352400000102
ops=ops+1,转6.5.6步;若strc[qc]不含有内容为Lc[pc]的子字符串,转6.5.5步。Step 6.5.4, if strc [qc] contains a substring whose content is Lc[pc], let
Figure BDA0002431352400000101
Figure BDA0002431352400000102
ops=ops+1, go to step 6.5.6; if strc[qc] does not contain a substring whose content is L c [pc], go to step 6.5.5.

6.5.5步,令pc=pc+1。若pc≤256,转6.5.4步;若pc>256,说明完成了一遍对Lc的检查,转6.5.6步。Step 6.5.5, let pc=pc+1. If pc≤256, go to step 6.5.4; if pc>256, it means that the check of L c has been completed, go to step 6.5.6.

6.5.6步,令qc=qc+1。若qc≤numc,转6.5.3步;若qc>numc,说明x(i)对应的smali文件扫描完毕,转6.5.7步。Step 6.5.6, let qc=qc+1. If qc≤numc, go to step 6.5.3; if qc>numc, it means that the smali file corresponding to x (i) is scanned, go to step 6.5.7.

6.5.7步,令pc=1。Step 6.5.7, let pc=1.

6.5.8步,令

Figure BDA0002431352400000103
Step 6.5.8, let
Figure BDA0002431352400000103

6.5.9步,令pc=pc+1。若pc≤256,转6.5.8步;若pc>256,说明

Figure BDA0002431352400000104
计算完成,转6.6步。Step 6.5.9, let pc=pc+1. If pc≤256, go to step 6.5.8; if pc>256, explain
Figure BDA0002431352400000104
After the calculation is completed, go to step 6.6.

6.6步,采用arm操作码统计方法统计x(i)使用的arm操作码,得到x(i)的arm操作码向量

Figure BDA0002431352400000105
方法是:Step 6.6, use the arm opcode statistical method to count the arm opcodes used by x (i) to obtain the arm opcode vector of x (i).
Figure BDA0002431352400000105
the way is:

6.6.1步,按行扫描x(i)对应的arm文件,记arm文件的第qd行字符串为strd[qd],arm文件总行数为numd行。Step 6.6.1, scan the arm file corresponding to x (i) line by line, record the qd line string of the arm file as strd[qd], and the total number of lines in the arm file as numd lines.

6.6.2步,令qd=1,使用变量opa表示arm文件中使用的arm操作码总数量,令opa=1。若qd≤numd,转6.6.3步;若qd>numd,说明x(i)对应的arm文件是空文件,转6.7步。Step 6.6.2, let qd=1, use the variable opa to represent the total number of arm opcodes used in the arm file, let opa=1. If qd≤numd, go to step 6.6.3; if qd>numd, it means that the arm file corresponding to x (i) is an empty file, go to step 6.7.

6.6.3步,令pd=1。Step 6.6.3, let pd=1.

6.6.4步,若strd[qd]含有“>”字符,说明strd[qd]包含一条arm指令,令opa=opa+1,转6.6.5;若strd[qd]不含有“>”字符,转6.6.7。Step 6.6.4, if strd[qd] contains the ">" character, it means that strd[qd] contains an arm instruction, let opa=opa+1, go to 6.6.5; if strd[qd] does not contain the ">" character, Go to 6.6.7.

6.6.5步,若strd[qd]含有内容为Ld[pd]的子字符串,令

Figure BDA0002431352400000106
Figure BDA0002431352400000107
转6.6.7步;若strd[qd]不含有内容为Ld[pd]的子字符串,转6.6.6步。Step 6.6.5, if strd[qd] contains a substring whose content is L d [pd], let
Figure BDA0002431352400000106
Figure BDA0002431352400000107
Go to step 6.6.7; if strd[qd] does not contain a substring whose content is L d [pd], go to step 6.6.6.

6.6.6步,令pd=pd+1。若pd≤197,转6.6.5步;若pd>197,说明完成了一遍对Ld的检查,转6.6.7步。Step 6.6.6, let pd=pd+1. If pd≤197, go to step 6.6.5; if pd>197, it means that the inspection of L d is completed, go to step 6.6.7.

6.6.7步,令qd=qd+1。若qd≤numd,转6.6.3步;若qd>numd,说明x(i)对应的arm文件扫描完毕,转6.6.8步。Step 6.6.7, let qd=qd+1. If qd≤numd, go to step 6.6.3; if qd>numd, it means that the arm file corresponding to x (i) is scanned, go to step 6.6.8.

6.6.8步,令pd=1。Step 6.6.8, let pd=1.

6.6.9步,令

Figure BDA0002431352400000111
Step 6.6.9, let
Figure BDA0002431352400000111

6.6.10步,令pd=pd+1。若pd≤197,转6.6.9步;若pd>197,说明

Figure BDA0002431352400000112
计算完成,转6.7步。Step 6.6.10, let pd=pd+1. If pd≤197, go to step 6.6.9; if pd>197, explain
Figure BDA0002431352400000112
After the calculation is completed, go to step 6.7.

6.7步,令i=i+1。若i≤N,转6.2;若i>N,表明对D内的N个样本均计算生成了频率指纹,将频率指纹发送给检测模块,转第七步。Step 6.7, let i=i+1. If i≤N, go to 6.2; if i>N, it means that the frequency fingerprint is generated for all N samples in D, and the frequency fingerprint is sent to the detection module, and the seventh step is performed.

第七步,检测模块从频率指纹产生模块接收频率指纹,训练多核支持向量机模型,成为适合对待检测软件进行分类判断的分类器。多核支持向量机模型是一种基于支持向量机模型、使用多种核函数将特征空间的向量由低维映射到高维来增强分类能力的分类模型。对基准测试集D来说,其特征空间为D内N个样本的频率指纹的集合。令kperm、kapi、ksmali、karm分别表示频率指纹内的权限向量、API向量、smali操作码向量和arm操作码向量使用的核函数,β为权重向量,可表示为(βperm,βapi,βsmali,βarm),β的元素βperm、βapi、βsmali、βarm分别表示kperm、kapi、ksmali、karm的权重,令T为集合{perm,api,smali,arm}(perm,api,smali,arm分别为kperm、kapi、ksmali、karm的下标,为了描述公式(4)的一种表达方式),多核支持向量机模型Y可表示为:In the seventh step, the detection module receives the frequency fingerprint from the frequency fingerprint generation module, trains a multi-core support vector machine model, and becomes a classifier suitable for classifying and judging the software to be detected. The multi-kernel support vector machine model is a classification model based on the support vector machine model, which uses a variety of kernel functions to map the vector of the feature space from low-dimensional to high-dimensional to enhance the classification ability. For the benchmark test set D, its feature space is the set of frequency fingerprints of N samples in D. Let k perm , k api , k smali , and k arm denote the kernel function used by the permission vector, API vector, smali opcode vector, and arm opcode vector in the frequency fingerprint, respectively, and β is the weight vector, which can be expressed as (β perm , β api , β smali , β arm ), the elements β perm , β api , β smali , and β arm of β represent the weights of k perm , k api , k smali , and k arm respectively, let T be the set {perm, api, smali , arm} (perm, api, smali, arm are the subscripts of k perm , k api , k smali , and k arm respectively, in order to describe an expression of formula (4)), the multi-core support vector machine model Y can be expressed as :

Figure BDA0002431352400000113
Figure BDA0002431352400000113

α(i)为一个拉格朗日乘子,{α(1),α(2),...,α(i),...,α(N)}构成向量α。sgn(A)为参数A的阶跃函数,当A>0时,sgn(A)=1;当A=0时,sgn(A)=0;当A<0时,sgn(A)=-1。α、β通过求解公式(5)得到:α (i) is a Lagrange multiplier, {α (1) , α (2) , ..., α (i) , ..., α (N) } constitute the vector α. sgn(A) is the step function of parameter A, when A>0, sgn(A)=1; when A=0, sgn(A)=0; when A<0, sgn(A)=- 1. α and β are obtained by solving formula (5):

Figure BDA0002431352400000114
Figure BDA0002431352400000114

公式(5)的约束条件为公式(6)至公式(9):The constraints of formula (5) are formula (6) to formula (9):

Figure BDA0002431352400000115
Figure BDA0002431352400000115

0≤α(i)≤C (7)0≤α (i) ≤C(7)

t∈Tβt=1 (8)t∈T β t = 1 (8)

βt≥0,t∈T (9)β t ≥ 0, t∈T (9)

其中C为惩罚系数,C≥0,用于表示对误分类惩罚的大小。where C is the penalty coefficient, C≥0, which is used to indicate the size of the penalty for misclassification.

b为标量,在求出α、β后,由下面公式得出:b is a scalar, after calculating α and β, it is obtained by the following formula:

Figure BDA0002431352400000121
Figure BDA0002431352400000121

其中,

Figure BDA0002431352400000122
为支持向量样本点。in,
Figure BDA0002431352400000122
is the support vector sample point.

对多核支持向量机模型进行训练的方法是:The way to train a multi-core SVM model is:

7.1步,根据从频率指纹产生模块接收的D内样本的频率指纹计算生成核矩阵。令Kt为核矩阵,t∈T,表示四种核矩阵Kperm、Kapi、Ksmali和Karm。Kt规模大小为N行N列,第i行第j列的元素为

Figure BDA0002431352400000123
选定3次多项式核函数,Kt的计算方法为:Step 7.1: Calculate and generate a kernel matrix according to the frequency fingerprints of the samples in D received from the frequency fingerprint generation module. Let K t be a kernel matrix, t∈T, representing the four kernel matrices K perm , K api , K smali and K arm . The size of K t is N rows and N columns, and the elements of the i-th row and the j-th column are
Figure BDA0002431352400000123
The 3rd degree polynomial kernel function is selected, and the calculation method of K t is:

7.1.1步,令i=1。Step 7.1.1, let i=1.

7.1.2步,令j=1。Step 7.1.2, let j=1.

7.1.3步,计算

Figure BDA0002431352400000124
Step 7.1.3, Calculation
Figure BDA0002431352400000124

Figure BDA0002431352400000125
Figure BDA0002431352400000125

Figure BDA0002431352400000126
表示
Figure BDA0002431352400000127
Figure BDA0002431352400000128
的内积。
Figure BDA0002431352400000126
express
Figure BDA0002431352400000127
and
Figure BDA0002431352400000128
the inner product.

7.1.4步,若j≤N,令j=j+1,转7.1.3步;若j>N,转7.1.5步。Step 7.1.4, if j≤N, let j=j+1, go to step 7.1.3; if j>N, go to step 7.1.5.

7.1.5步,若i≤N,令i=i+1,转7.1.2步;若i>N,Kt计算完毕,转7.2步。Step 7.1.5, if i≤N, set i=i+1, go to step 7.1.2; if i>N, K t is calculated, go to step 7.2.

7.2步,优化α、β参数,方法是:Step 7.2, optimize the α, β parameters, the method is:

7.2.1初始化α向量内每个元素为0,初始化β向量内每个元素为1/4。7.2.1 Initialize each element in the alpha vector to 0, and initialize each element in the beta vector to 1/4.

7.2.2利用公式(5),按照上标r、s从小到大的顺序,将(α(1),α(2),...,α(r-1),α(r+1),...,α(s),α(s+1),...,α(N))及向量β作为固定值,选择一对α(r)、α(s)对α进行优化,优化方法为:7.2.2 Using formula (5), according to the superscript r, s in ascending order, (α (1) , α (2) ,..., α (r-1) , α (r+1) , ..., α (s) , α (s+1) , ..., α (N) ) and vector β as fixed values, select a pair of α (r) , α (s) to optimize α, The optimization method is:

7.2.2.1利用公式(6)的约束,使公式(5)成为α(r)的一元二次函数g(α(r)),对g(α(r))求导数使求导数之后的结果等于0,求出α(r)7.2.2.1 Using the constraints of formula (6), formula (5) becomes a quadratic function g(α (r) ) of α (r) in one variable, and the derivative of g(α (r) ) is obtained to obtain the result after the derivative equal to 0, find α (r) .

7.2.2.2利用公式(6)的约束求出α(s)7.2.2.2 Use the constraints of equation (6) to find α (s) .

7.2.2.3将α(r),α(s)更新,得到优化后的α,命名为α*7.2.2.3 Update α (r) and α (s) to obtain the optimized α, named α * .

7.2.3将α*作为固定值,对β进行优化,方法为:7.2.3 Taking α * as a fixed value, optimize β by:

7.2.3.1计算公式(5)对β的偏导数,使求偏导数之后的结果等于0,求出满足公式(8)和公式(9)约束条件的解,即求出了βperm、βapi、βsmali、βarm优化后的结果,分别命名为

Figure BDA0002431352400000131
7.2.3.1 Calculate the partial derivative of formula (5) with respect to β, make the result after the partial derivative equal to 0, and find the solution that satisfies the constraints of formula (8) and formula (9), that is, obtain β perm , β api The optimized results of , β smali and β arm are named as
Figure BDA0002431352400000131

7.2.3.2将

Figure BDA0002431352400000132
拼接成优化后的β,命名为β*。7.2.3.2 Will
Figure BDA0002431352400000132
spliced into optimized β, named β * .

7.2.4判断α、β是否满足公式(12)~公式(14)的优化终止条件:7.2.4 Judge whether α and β satisfy the optimization termination conditions of formula (12) to formula (14):

Figure BDA0002431352400000133
Figure BDA0002431352400000133

Figure BDA0002431352400000134
Figure BDA0002431352400000134

L(α*,β*)-L(α,β)≤ε (14)L(α ** )-L(α,β)≤ε(14)

当满足公式(14)时,对α、β参数的优化使得公式(5)中函数值改变小于阈值ε,0<ε≤0.1,说明优化后的α、β满足要求,多核支持向量机模型训练完毕,转7.3步。否则转步骤7.2.2。When the formula (14) is satisfied, the optimization of the α and β parameters makes the change of the function value in the formula (5) less than the threshold ε, 0<ε≤0.1, indicating that the optimized α and β meet the requirements, and the multi-core support vector machine model training Finished, go to step 7.3. Otherwise, go to step 7.2.2.

7.3步,由公式(10)计算得到b的值,公式(4)定义的多核支持向量机模型训练优化完成,成为分类器。In step 7.3, the value of b is calculated by formula (10), and the training and optimization of the multi-core support vector machine model defined by formula (4) is completed and becomes a classifier.

第八步,使用基于频率指纹提取的安卓恶意软件检测系统对谷歌官方或者第三方安卓应用软件市场服务器从用户接收的待检软件进行检测,判断是否为恶意软件,方法是:The eighth step is to use the Android malware detection system based on frequency fingerprint extraction to detect the software to be inspected received from the user by Google's official or third-party Android application software market server, and determine whether it is malware. The method is as follows:

8.1步,样本预处理模块对待检软件进行预处理。将待检测软件作为样本x(a),采用3.3步所述样本预处理方法,对x(a)进行预处理,得到x(a)的AndroidManifest.xml文件、smali文件和arm指令文件,输出至频率指纹计算模块。Step 8.1, the sample preprocessing module preprocesses the software to be tested. Taking the software to be detected as a sample x (a) , using the sample preprocessing method described in step 3.3, preprocessing x (a) to obtain the AndroidManifest.xml file, smali file and arm instruction file of x (a) , output to Frequency fingerprint calculation module.

8.2步,频率指纹计算模块对x(a)计算产生x(a)的频率指纹

Figure BDA0002431352400000137
方法是:Step 8.2, the frequency fingerprint calculation module calculates x (a) to generate the frequency fingerprint of x (a) .
Figure BDA0002431352400000137
the way is:

8.2.1步,采用6.3步所述权限提取方法提取x(a)申请的权限,得到x(a)的权限向量

Figure BDA0002431352400000135
Step 8.2.1, use the permission extraction method described in step 6.3 to extract the permission applied for by x (a ) , and obtain the permission vector of x (a) .
Figure BDA0002431352400000135

8.2.2步,采用6.4步所述API统计方法统计x(a)使用的API,得到x(a)的API向量

Figure BDA0002431352400000136
Step 8.2.2, use the API statistics method described in step 6.4 to count the APIs used by x (a ) , and obtain the API vector of x (a) .
Figure BDA0002431352400000136

8.2.3步,采用6.5步所述smali操作码统计方法统计x(a)使用的smali操作码,得到x(a)的smali操作码向量

Figure BDA0002431352400000141
Step 8.2.3, use the smali opcode statistical method described in step 6.5 to count the smali opcodes used by x ( a ), and obtain the smali opcode vector of x (a) .
Figure BDA0002431352400000141

8.2.4步,采用6.6步所述arm操作码统计方法统计x(a)使用的arm操作码,得到x(a)的arm操作码向量

Figure BDA0002431352400000142
Step 8.2.4, use the arm opcode statistical method described in step 6.6 to count the arm opcodes used by x ( a ), and obtain the arm opcode vector of x (a) .
Figure BDA0002431352400000142

8.2.5步,将

Figure BDA0002431352400000143
计算完毕,拼接成x(a)的频率指纹
Figure BDA0002431352400000144
Step 8.2.5, will
Figure BDA0002431352400000143
After the calculation is completed, spliced into the frequency fingerprint of x (a)
Figure BDA0002431352400000144

8.3步,将

Figure BDA0002431352400000145
输入检测模块(此时是优化后的适合于检测的分类器),由公式(4)计算输出F的值,F等于+1或者-1,+1代表待检软件为恶意软件,-1代表为良性软件,从而实现了判断待检软件是否为恶意软件的目的。Step 8.3, will
Figure BDA0002431352400000145
Input detection module (at this time it is an optimized classifier suitable for detection), calculate the value of output F by formula (4), F is equal to +1 or -1, +1 means the software to be checked is malware, -1 means It is benign software, thus realizing the purpose of judging whether the software to be checked is malicious software.

相比于其他技术,本发明具有以下优点:Compared with other technologies, the present invention has the following advantages:

一是高精确度。本发明融合使用权限、API、smali操作码和arm操作码特性产生频率指纹,能准确表达安卓软件属性特征,适于作为安卓软件身份标识。基于频率指纹训练出的多核支持向量机,作为分类器,能够有效地整合来自安卓软件各个组成部分的的信息,达到准确的检测结果。One is high precision. The invention integrates the characteristics of use authority, API, smali operation code and arm operation code to generate a frequency fingerprint, can accurately express the attribute characteristics of Android software, and is suitable for being used as an Android software identity mark. The multi-core support vector machine trained based on the frequency fingerprint, as a classifier, can effectively integrate the information from each component of the Android software to achieve accurate detection results.

二是高效率。本发明的效率体现在两个方面:一是频率指纹生成的效率高。本发明扫描AndroidManifest.xml、smali文件及arm指令文件,统计权限、API、smali操作码和arm操作码的频率,可在线性时间内完成。二是分类模型的训练效率高。与大量的神经网络模型参数相比,多核支持向量机模型的参数比较少,优化参数时的计算量较低,训练效率有显著提高。The second is high efficiency. The efficiency of the present invention is embodied in two aspects: First, the efficiency of frequency fingerprint generation is high. The invention scans AndroidManifest.xml, smali file and arm instruction file, and counts the frequency of authority, API, smali operation code and arm operation code, and can be completed in linear time. Second, the training efficiency of the classification model is high. Compared with a large number of neural network model parameters, the multi-core support vector machine model has fewer parameters, and the calculation amount when optimizing parameters is lower, and the training efficiency is significantly improved.

附图说明Description of drawings

图1是基于频率指纹提取的安卓恶意软件检测系统结构图。Figure 1 is a structural diagram of an Android malware detection system based on frequency fingerprint extraction.

图2是本发明总体流程图。Figure 2 is a general flow chart of the present invention.

具体实施方式Detailed ways

下面对照附图对本发明进行详细说明。The present invention will be described in detail below with reference to the accompanying drawings.

本发明技术方案如图2所示,包括以下步骤:The technical solution of the present invention, as shown in Figure 2, includes the following steps:

第一步,构建基于频率指纹提取的安卓恶意软件检测系统。该系统安装在谷歌官方或者第三方安卓应用软件市场服务器中,该系统总体结构如图1所示,由样本预处理模块、频率指纹产生模块、检测模块组成。The first step is to build an Android malware detection system based on frequency fingerprint extraction. The system is installed in Google's official or third-party Android application software market server. The overall structure of the system is shown in Figure 1, which consists of a sample preprocessing module, a frequency fingerprint generation module, and a detection module.

样本预处理模块与频率指纹产生模块相连,样本预处理模块接收来自开发人员构建的基准测试集的样本和普通用户提交的待检测样本,对样本进行预处理,产生AndroidManifest.xml、smali文件和arm指令文件三种类型的文件,输出至频率指纹产生模块。The sample preprocessing module is connected to the frequency fingerprint generation module. The sample preprocessing module receives samples from the benchmark test set constructed by developers and samples to be tested submitted by ordinary users, preprocesses the samples, and generates AndroidManifest.xml, smali files and arm Three types of command files are output to the frequency fingerprint generation module.

频率指纹产生模块与样本预处理模块、检测模块相连,频率指纹产生模块从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件,进行特征筛选和频率指纹计算,产生频率指纹,输出至检测模块;频率指纹产生模块由特征筛选模块和频率指纹计算模块组成。特征筛选模块与样本预处理模块、频率指纹计算模块相连,特征筛选模块从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件,对这三种文件进行特征筛选,得到权限、API、smali操作码和arm操作码特征,将权限、API、smali操作码和arm操作码特征发送给频率指纹计算模块。频率指纹计算模块与样本预处理模块、特征筛选模块、检测模块相连,频率指纹计算模块从特征筛选模块接收权限、API、smali操作码和arm操作码特征,从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件,计算产生频率指纹,将频率指纹发送给检测模块。The frequency fingerprint generation module is connected with the sample preprocessing module and the detection module. The frequency fingerprint generation module receives the AndroidManifest.xml, smali file and arm command file from the sample preprocessing module, performs feature screening and frequency fingerprint calculation, generates frequency fingerprints, and outputs them to the detection. module; the frequency fingerprint generation module is composed of a feature screening module and a frequency fingerprint calculation module. The feature screening module is connected with the sample preprocessing module and the frequency fingerprint calculation module. The feature screening module receives AndroidManifest.xml, smali file and arm command file from the sample preprocessing module, and performs feature screening on these three files to obtain permissions, API, smali Opcode and arm opcode feature, send permission, API, smali opcode and arm opcode feature to the frequency fingerprint calculation module. The frequency fingerprint calculation module is connected with the sample preprocessing module, the feature screening module and the detection module. The frequency fingerprint calculation module receives permissions, API, smali opcode and arm opcode features from the feature screening module, and receives AndroidManifest.xml, smali file and arm command file, calculate the frequency fingerprint, and send the frequency fingerprint to the detection module.

检测模块与频率指纹产生模块相连,检测模块是一个多核支持向量机模型,它从频率指纹产生模块接收基准测试集D的频率指纹和待检测软件的频率指纹,利用基准测试集D的频率指纹进行训练优化,成为适合对待检测软件进行检测的分类器,然后根据待检测软件的频率指纹对待检测软件进行检测分类,得出待检测软件是否是恶意软件的判定结果。The detection module is connected with the frequency fingerprint generation module. The detection module is a multi-core support vector machine model. It receives the frequency fingerprint of the benchmark test set D and the frequency fingerprint of the software to be detected from the frequency fingerprint generation module, and uses the frequency fingerprint of the benchmark test set D. The training is optimized to become a classifier suitable for detecting the software to be detected, and then the software to be detected is detected and classified according to the frequency fingerprint of the software to be detected, and the judgment result of whether the software to be detected is malware is obtained.

图1中样本预处理模块到频率指纹产生模块、检测模块的实线箭头是基于频率指纹提取的安卓恶意软件检测系统对基准测试集D内的样本进行处理的流程,样本预处理模块到频率指纹产生模块、检测模块的虚线箭头是对待检样本进行处理的流程(从第八步可以看出,待检测软件不需要特征筛选模块进行特征筛选)。The solid arrows from the sample preprocessing module to the frequency fingerprint generation module and the detection module in Figure 1 are the flow of the Android malware detection system based on the frequency fingerprint extraction to process the samples in the benchmark test set D. The sample preprocessing module to the frequency fingerprint The dashed arrows of the generation module and the detection module are the flow of processing the sample to be tested (it can be seen from the eighth step that the software to be detected does not need the feature screening module for feature screening).

第二步,构建基准测试集D,方法是:The second step is to construct a benchmark test set D, the method is:

2.1步,从开源的Drebin、Genome和AMD数据集中获得N1个安卓恶意软件作为恶意样本,N1为正整数且N1=2000。Step 2.1, obtain N 1 Android malware from the open source Drebin, Genome and AMD datasets as malicious samples, where N 1 is a positive integer and N 1 =2000.

2.2步,通过爬取GooglePlay和Apkpure应用商店获得良性软件,并使用本地杀毒软件及VirusTotal在线杀毒网站进行检测过滤,形成N2个良性样本,N2为正整数且N2=2000。Step 2.2, obtain benign software by crawling GooglePlay and Apkpure application stores, and use local anti-virus software and VirusTotal online anti-virus website to detect and filter to form N 2 benign samples, N 2 is a positive integer and N 2 =2000.

2.3步,给恶意样本及良性样本添加标签,组成基准测试集D,N为D内样本总数,N=N1+N2。定义x(i)为D中第i个样本,y(i)为x(i)的标签,y(i)等于1表示x(i)为恶意样本,y(i)等于-1表示x(i)为良性样本,1≤i≤N。Step 2.3, add labels to malicious samples and benign samples to form a benchmark test set D, where N is the total number of samples in D, N=N 1 +N 2 . Define x (i) as the ith sample in D, y (i) as the label of x (i) , y (i) equal to 1 means x (i) is a malicious sample, y (i) equal to -1 means x ( i) is a benign sample, 1≤i≤N.

2.4将D存储在预处理模块、频率指纹产生模块均可以读取的存储器(如安装有基于频率指纹提取的安卓恶意软件检测系统的谷歌官方或者第三方安卓应用软件市场服务器的存储器)上。2.4 Store D in a memory that can be read by both the preprocessing module and the frequency fingerprint generation module (such as the memory of Google's official or third-party Android application software market server installed with the Android malware detection system based on frequency fingerprint extraction).

第三步,样本预处理模块对D内N个样本进行预处理,得到N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件。In the third step, the sample preprocessing module preprocesses the N samples in D to obtain N AndroidManifest.xml files, N smali files and N arm instruction files.

3.1步,令变量i=1;Step 3.1, let the variable i=1;

3.2步,从D中取第i个样本x(i)Step 3.2, take the ith sample x (i) from D.

3.3步,采用样本预处理方法对x(i)进行预处理,得到x(i)的AndroidManifest.xml文件、smali文件和arm指令文件,方法是:Step 3.3, use the sample preprocessing method to preprocess x (i) to obtain the AndroidManifest.xml file, smali file and arm command file of x (i) , the method is:

3.3.1步,使用解压缩工具Gzip,对x(i)进行解压缩,提取x(i)中的AndroidManifest.xml、classes.dex以及so运行库文件。Step 3.3.1, use the decompression tool Gzip to decompress x (i) , and extract the AndroidManifest.xml, classes.dex and so runtime library files in x (i) .

3.3.2步,使用AndroidManifest.xml文件专用反编译工具AXMLPrinter2版本2.0,将AndroidManifest.xml文件由二进制形式反编译为文本形式。Step 3.3.2, use AXMLPrinter2 version 2.0, a special decompilation tool for the AndroidManifest.xml file, to decompile the AndroidManifest.xml file from binary form to text form.

3.3.3步,使用dex文件格式反编译工具baksmali版本2.4.0将classes.dex反编译为smali文件,若产生多个smali文件,将多个smali文件合并成为一个smali文件,转3.3.4步;若只产生1个smali文件,直接转3.3.4步。Step 3.3.3, use the dex file format decompile tool baksmali version 2.4.0 to decompile classes.dex into a smali file, if multiple smali files are generated, merge the multiple smali files into one smali file, go to step 3.3.4 ; If only one smali file is generated, go to step 3.3.4 directly.

3.3.4步,使用arm指令反汇编工具gcc-arm-none-eabi版本9-2019-q4-major将so运行库文件反编译为文本形式的arm指令文件,若产生多个arm指令文件,将多个arm指令文件合并成为一个arm指令文件,转3.4步;如若没有产生arm指令文件,则新建一个空的arm指令文件,转3.4步。Step 3.3.4, use the arm instruction disassembly tool gcc-arm-none-eabi version 9-2019-q4-major to decompile the so runtime library file into a textual arm instruction file. If multiple arm instruction files are generated, set the Combine multiple arm command files into one arm command file, go to step 3.4; if no arm command file is generated, create an empty arm command file and go to step 3.4.

3.4步,令i=i+1,若i≤N,转3.2步;若i>N,此时N个样本产生了对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件,将D的N个样本对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件发送给特征筛选模块,转第四步。Step 3.4, let i=i+1, if i≤N, go to step 3.2; if i>N, then N samples generate corresponding N AndroidManifest.xml files, N smali files and N arm command files , send the N AndroidManifest.xml files, N smali files, and N arm instruction files corresponding to the N samples of D to the feature screening module, and go to the fourth step.

第四步,特征筛选模块对从样本预处理模块收到的D的N个样本对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件进行特征筛选,得到适合对D进行分类的权限特征、API特征、smali操作码特征和arm操作码特征。In the fourth step, the feature screening module performs feature screening on the N AndroidManifest.xml files, N smali files and N arm command files corresponding to the N samples of D received from the sample preprocessing module, and obtains a feature suitable for classifying D. Permission features, API features, smali opcode features, and arm opcode features.

4.1步,选择安卓开发者文档(https://developer.android.com/reference/android/Manifest.permission)中定义的167种android.manifest.permission系统权限,将这167种权限作为特征,称为权限特征。Step 4.1, select the 167 android.manifest.permission system permissions defined in the Android developer documentation (https://developer.android.com/reference/android/Manifest.permission), and use these 167 permissions as features, called Permission features.

4.2步,从pscout列表(https://security.csl.toronto.edu/pscout/?mdocs-file=67)的API中,选择出256个API,方法是:Step 4.2, select 256 APIs from the APIs in the pscout list (https://security.csl.toronto.edu/pscout/?mdocs-file=67) by:

4.2.1步,建立一个列表Lapi,选择pscout列表中全部的32437个API加入Lapi,第v个API记为Lapi[v],1≤v≤32437。Step 4.2.1, create a list L api , select all 32437 APIs in the pscout list to join L api , the vth API is recorded as L api [v], 1≤v≤32437.

4.2.2步,建立一个32437行N列的二维数组Zapi,第v行第i列元素Zapi[v][i]的值限定为1或0,1代表Lapi的第v个API在D中的第i个样本中出现,0代表未出现。Step 4.2.2, build a two-dimensional array Z api with 32437 rows and N columns, the value of the element Z api [v][i] in the vth row and the i column is limited to 1 or 0, 1 represents the vth API of L api appears in the ith sample in D, and 0 means not appearing.

4.2.3步,初始化Zapi内所有元素为0,初始化变量i=1。Step 4.2.3, initialize all elements in Z api to 0, and initialize variable i=1.

4.2.4步,按行扫描D的第i个样本的smali文件,得到第i个样本中出现的属于Lapi的API,对Zapi的第i列元素进行赋值;记smali文件的第u行字符串为str[u],记smali文件的总行数为U,1≤u≤U。Step 4.2.4, scan the smali file of the ith sample of D row by line, get the API belonging to the L api appearing in the ith sample, and assign values to the elements of the ith column of Z api ; record the uth line of the smali file The string is str[u], and the total number of lines in the smali file is U, 1≤u≤U.

4.2.4.1步,初始化u=1。Step 4.2.4.1, initialize u=1.

4.2.4.2步,若str[u]是一个API字符串,转4.2.4.2.1;若str[u]不是一个API字符串,转4.2.4.3。Step 4.2.4.2, if str[u] is an API string, go to 4.2.4.2.1; if str[u] is not an API string, go to 4.2.4.3.

4.2.4.2.1步,初始化变量v=1。Step 4.2.4.2.1, initialize the variable v=1.

4.2.4.2.2步,若str[u]含有内容为Lapi[v]的子字符串,赋值Zapi[v][i]=1,转4.2.4.3;否则,转4.2.4.2.3步。Step 4.2.4.2.2, if str[u] contains a substring whose content is L api [v], assign Z api [v][i]=1, go to 4.2.4.3; otherwise, go to 4.2.4.2.3 step.

4.2.4.2.3步,令v=v+1。若v≤32437,转4.2.4.2.2步;若v>32437,转4.2.4.3步。Step 4.2.4.2.3, let v=v+1. If v≤32437, go to step 4.2.4.2.2; if v>32437, go to step 4.2.4.3.

4.2.4.3步,令u=u+1。若u≤U,转4.2.4.2步;若u>U,转4.2.5步。Step 4.2.4.3, let u=u+1. If u≤U, go to step 4.2.4.2; if u>U, go to step 4.2.5.

4.2.5步,令i=i+1。若i≤N,转4.2.4步;若i>N,完成了对二维数组Zapi的赋值,转4.2.6。Step 4.2.5, let i=i+1. If i≤N, go to step 4.2.4; if i>N, complete the assignment to the two-dimensional array Z api , go to step 4.2.6.

4.2.6步,计算列表Lapi中每个API对基准测试集D的信息增益IG。第v个API对D的信息增益用IG(D|Lapi[v])表示。Step 4.2.6, calculate the information gain IG of each API in the list L api to the benchmark test set D. The information gain of the vth API to D is denoted by IG(D|L api [v]).

4.2.6.1步,令v=1。Step 4.2.6.1, let v=1.

4.2.6.2步,令i=1。令第一变量M11=0,令第二变量M12=0,令第三变量M21=0,令第四变量M22=0。Step 4.2.6.2, let i=1. Let the first variable M 11 =0, the second variable M 12 =0, the third variable M 21 =0, and the fourth variable M 22 =0.

4.2.6.3步,若Zapi[v][i]等于1并且y(i)等于1,令M11=M11+1;若Zapi[v][i]等于l并且y(i)等于0,令M12=M12+1;若Zapi[v][i]等于0并且y(i)等于1,令M21=M21+1;若Zapi[v][i]等于0并且y(i)等于0,令M22=M22+1。Step 4.2.6.3, if Zapi [v][i] is equal to 1 and y (i) is equal to 1, let M11= M11 + 1 ; if Zapi [v][i] is equal to 1 and y (i) is equal to 0, let M 12 =M 12 +1; if Z api [v][i] is equal to 0 and y (i) is equal to 1, let M 21 =M 21 +1; if Z api [v][i] is equal to 0 and y (i) is equal to 0, let M 22 =M 22 +1.

4.2.6.4步,令i=i+1。若i≤N,转4.2.6.3步;若i>N,转4.2.6.5步。Step 4.2.6.4, let i=i+1. If i≤N, go to step 4.2.6.3; if i>N, go to step 4.2.6.5.

4.2.6.5步计算IG(D|Lapi[v]),方法为:Step 4.2.6.5 Calculate IG(D|L api [v]), the method is:

IG(D|Lapi[v])=H(D)-H(D|Lapi[v]) (1)IG(D|L api [v])=H(D)-H(D|L api [v]) (1)

其中H(D)为基准测试集D的经验熵,H(D)计算方法为:where H(D) is the empirical entropy of the benchmark test set D, and the calculation method of H(D) is:

Figure BDA0002431352400000181
Figure BDA0002431352400000181

H(D|Lapi[v])为列表Lapi的第v个API对D的经验条件熵,H(D|Lapi[v])为:H(D|L api [v]) is the empirical conditional entropy of the vth API of the list L api for D, and H(D|L api [v]) is:

Figure BDA0002431352400000182
Figure BDA0002431352400000182

4.2.6.6步,令v=v+1。若v≤32437,转4.2.6.2;若v>32437,说明列表Lapi内全部API对D的信息增益计算完毕,按照IG(D|Lapi[v])值从大到小将Lapi内API排序,取排序后的前256个API,作为API特征,转4.3步。Step 4.2.6.6, let v=v+1. If v≤32437, go to 4.2.6.2; if v>32437, it means that the information gain of all APIs in the list L api to D has been calculated, and the APIs in L api are sorted according to the value of IG(D|L api [v]) from large to small. Sort, take the first 256 APIs after sorting, as API features, go to step 4.3.

4.3步,安卓Dalvik虚拟机预定义了长度为8个二进制位的smali操作码(https://developer.android.com/reference/dalvik/bytecode/Opcodes.html),包括预留未定义的类型,最多有256种,将这256种smali操作码作为特征,称为smali操作码特征。In step 4.3, the Android Dalvik virtual machine predefines the smali opcode with a length of 8 binary bits (https://developer.android.com/reference/dalvik/bytecode/Opcodes.html), including reserved undefined types, There are at most 256 kinds, and these 256 kinds of smali opcodes are used as features, which are called smali opcode features.

4.4步,根据arm指令快速参考手册(http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001mc/QRC0001_UAL.pdf),特征筛选模块选择该手册列举的共计197种arm指令操作码作为特征,称为arm操作码特征。Step 4.4, according to the arm command quick reference manual (http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001mc/QRC0001_UAL.pdf), the feature screening module selects a total of 197 arm commands listed in the manual The opcode as a feature is called the arm opcode feature.

4.5步,将权限特征、API特征、smali操作码特征和arm操作码特征发送给频率指纹计算模块。In step 4.5, the permission feature, API feature, smali opcode feature and arm opcode feature are sent to the frequency fingerprint calculation module.

第五步,确定频率指纹格式。The fifth step is to determine the frequency fingerprint format.

将167种权限特征、256种API特征、256种smali操作码特征和197种arm操作码特征分别按字母顺序排列形成向量,分别称之为安卓软件的权限向量、API向量、smali操作码向量和arm操作码向量。四种向量首尾相接,形成一个长度为876的向量,作为样本的频率指纹。The 167 permission features, 256 API features, 256 smali opcode features, and 197 arm opcode features are arranged in alphabetical order to form vectors, which are called Android software permission vectors, API vectors, smali opcode vectors and arm opcode vector. The four vectors are connected end to end to form a vector of length 876 as the frequency fingerprint of the sample.

第六步,频率指纹计算模块从特征筛选模块接收权限特征、API特征、smali操作码特征和arm操作码特征,从样本预处理模块接收AndroidManifest.xml文件、smali文件和arm指令文件,对基准测试集D中N个样本计算产生频率指纹。In the sixth step, the frequency fingerprint calculation module receives permission features, API features, smali opcode features and arm opcode features from the feature screening module, and receives the AndroidManifest.xml file, smali file and arm command file from the sample preprocessing module, and performs benchmark testing The N samples in the set D are calculated to generate frequency fingerprints.

6.1步,令La为权限列表,列表成员La[pa]为167种权限中按字母顺序排列的第pa种权限的名称字符串;令Lb为API列表,列表成员Lb[pb]为256种API中按字母顺序排列的第pb种API的名称字符串;令Lc为smali操作码列表,列表成员Lc[pc]为256种smali操作码中按字母顺序排列的第pc种smali操作码的名称字符串;令Ld为arm操作码列表,列表成员Ld[pd]为197种arm操作码中按字母顺序排列的第pd种arm操作码的名称字符串。令变量i=1。Step 6.1, let L a be the permission list, and the list member L a [pa] is the name string of the pa-th permission in alphabetical order among the 167 permissions; let L b be the API list, and the list member L b [pb] is the name string of the pbth API in alphabetical order among the 256 APIs; let L c be the list of smali opcodes, and the list member L c [pc] is the alphabetical pcth of the 256 smali opcodes The name string of the smali opcode; let Ld be the list of arm opcodes, and the list member Ld [pd] be the name string of the pdth arm opcode in alphabetical order among the 197 arm opcodes. Let variable i=1.

6.2步,取D中第i个样本x(i),为x(i)生成频率指纹

Figure BDA0002431352400000191
含876个元素,初始化每个元素为0。将
Figure BDA00024313524000001911
中的权限向量记为
Figure BDA0002431352400000192
中的第pa个元素记为
Figure BDA0002431352400000193
API向量记为
Figure BDA0002431352400000194
中的第pb个元素记为
Figure BDA0002431352400000195
smali操作码向量记为
Figure BDA0002431352400000196
中的第pc个元素记为
Figure BDA0002431352400000197
arm操作码向量记为
Figure BDA0002431352400000198
中的第pd个元素记为
Figure BDA0002431352400000199
Step 6.2, take the ith sample x (i) in D, and generate a frequency fingerprint for x (i)
Figure BDA0002431352400000191
Contains 876 elements, initializing each element to 0. Will
Figure BDA00024313524000001911
The permission vector in is denoted as
Figure BDA0002431352400000192
The pa-th element in is denoted as
Figure BDA0002431352400000193
API vector is denoted as
Figure BDA0002431352400000194
The pbth element in is denoted as
Figure BDA0002431352400000195
The smali opcode vector is denoted as
Figure BDA0002431352400000196
The pc-th element in is denoted as
Figure BDA0002431352400000197
The arm opcode vector is denoted as
Figure BDA0002431352400000198
The pd-th element in is denoted as
Figure BDA0002431352400000199

6.3步,采用权限提取方法提取x(i)申请的权限,得到x(i)的权限向量

Figure BDA00024313524000001910
方法是:Step 6.3, using the permission extraction method to extract the permission applied by x (i ) , and obtain the permission vector of x (i)
Figure BDA00024313524000001910
the way is:

6.3.1步,按行扫描x(i)对应的AndroidManifest.xml文件,记AndroidManifest.xml文件的第qa行字符串为stra[qa],记AndroidManifest.xml文件总行数为numa行。Step 6.3.1, scan the AndroidManifest.xml file corresponding to x (i) by line, record the string of line qa in the AndroidManifest.xml file as stra[qa], and record the total number of lines in the AndroidManifest.xml file as numa lines.

6.3.2步,令qa=1。Step 6.3.2, let qa=1.

6.3.3步,若stra[qa]含有内容为“uses-permission”的子字符串,令pa=1,转6.3.4步;若stra[qa]不含有内容为“uses-permission”的字符串,转6.3.6步。Step 6.3.3, if stra[qa] contains a substring whose content is "uses-permission", set pa=1, go to step 6.3.4; if stra[qa] does not contain a character whose content is "uses-permission" string, go to step 6.3.6.

6.3.4步,若stra[qa]含有内容为La[pa]的子字符串,表明x(i)申请了La[pa]权限,令

Figure BDA0002431352400000201
转6.3.6步;若stra[qa]不含有内容为La[pa]的子字符串,转6.3.5步。In step 6.3.4, if stra[qa] contains a substring whose content is La [pa], it means that x (i) has applied for La [pa] permission, and let
Figure BDA0002431352400000201
Go to step 6.3.6; if stra[qa] does not contain a substring whose content is La [pa], go to step 6.3.5.

6.3.5步,令pa=pa+1。若pa≤167,转6.3.4步;若pa>167,说明完成了一遍对La的检查,转6.3.6步。Step 6.3.5, let pa=pa+1. If pa≤167 , go to step 6.3.4; if pa>167, it means that the check of La has been completed, go to step 6.3.6.

6.3.6步,令qa=qa+1。若qa≤numa,转6.3.3步;若qa>numa,说明x(i)对应的AndroidManifest.xml文件扫描完毕,

Figure BDA0002431352400000202
计算完成,转6.4步。Step 6.3.6, let qa=qa+1. If qa≤numa, go to step 6.3.3; if qa>numa, it means that the AndroidManifest.xml file corresponding to x (i) has been scanned,
Figure BDA0002431352400000202
After the calculation is completed, go to step 6.4.

6.4步,采用API统计方法统计x(i)使用的API,得到x(i)的API向量

Figure BDA0002431352400000203
方法是:Step 6.4, use the API statistics method to count the APIs used by x (i ) , and obtain the API vector of x (i) .
Figure BDA0002431352400000203
the way is:

6.4.1步,按行扫描x(i)对应的smali文件,记smali文件的第qb行字符串为strb[qb],记smali文件总行数为numb行。Step 6.4.1, scan the smali file corresponding to x (i) by line, record the qb line string of the smali file as strb[qb], and record the total number of lines in the smali file as numb lines.

6.4.2步,令qb=1,使用变量inv表示smali文件中API的总数量,令inv=1。Step 6.4.2, let qb=1, use the variable inv to represent the total number of APIs in the smali file, let inv=1.

6.4.3步,令变量pb=1。Step 6.4.3, let the variable pb=1.

6.4.4步,若strb[qb]含有内容为“invoke”的子字符串,令inv=inv+1,转6.4.5步;若不含有“invoke”子字符串,转6.4.7步。Step 6.4.4, if strb[qb] contains a substring whose content is "invoke", let inv=inv+1, go to step 6.4.5; if it does not contain a substring of "invoke", go to step 6.4.7.

6.4.5步,若strb[qb]含有内容为Lb[pb]的子字符串,说明x(i)调用了名字为Lb[pb]的API,令

Figure BDA0002431352400000204
转6.4.7步;若strb[qb]不含有内容为Lb[pb]的子字符串,转6.4.6步。Step 6.4.5, if strb[qb] contains a substring whose content is L b [pb], it means that x (i) calls the API named L b [pb], let
Figure BDA0002431352400000204
Go to step 6.4.7; if strb[qb] does not contain a substring whose content is L b [pb], go to step 6.4.6.

6.4.6步,令pb=pb+1。若pb≤256,转6.4.5步;若pb>256,说明完成了一遍对Lb的检查,转6.4.7步。Step 6.4.6, let pb=pb+1. If pb≤256, go to step 6.4.5; if pb>256, it means that the check of L b is completed, go to step 6.4.7.

6.4.7步,令qb=qb+1。若qb≤numb,转6.4.3步;若qb>numb,说明x(i)对应的smali文件扫描完毕,转6.4.8步。Step 6.4.7, let qb=qb+1. If qb≤numb, go to step 6.4.3; if qb>numb, it means that the smali file corresponding to x (i) is scanned, go to step 6.4.8.

6.4.8步,令pb=1。Step 6.4.8, let pb=1.

6.4.9步,令

Figure BDA0002431352400000205
Step 6.4.9, let
Figure BDA0002431352400000205

6.4.10步,令pb=pb+1。若pb≤256,转6.4.9步;若pb>256,说明

Figure BDA0002431352400000206
计算完成,转6.5步。Step 6.4.10, let pb=pb+1. If pb≤256, go to step 6.4.9; if pb>256, explain
Figure BDA0002431352400000206
After the calculation is completed, go to step 6.5.

6.5步,采用smali操作码统计方法统计x(i)使用的smali操作码,得到x(i)的smali操作码向量

Figure BDA0002431352400000211
方法是:Step 6.5, use the smali opcode statistical method to count the smali opcodes used by x (i) , and obtain the smali opcode vector of x (i).
Figure BDA0002431352400000211
the way is:

6.5.1步,按行扫描x(i)对应的smali文件,记smali文件的第qc行字符串为strc[qc],记smali文件总行数为numc行。Step 6.5.1, scan the smali file corresponding to x (i) by line, record the qc line string of the smali file as strc[qc], and record the total number of lines in the smali file as numc lines.

6.5.2步,令qc=1,使用变量ops表示smali文件中smali操作码的总数量,令ops=1。Step 6.5.2, let qc=1, use the variable ops to represent the total number of smali opcodes in the smali file, let ops=1.

6.5.3步,令pc=1。Step 6.5.3, let pc=1.

6.5.4步,若strc[qc]含有内容为Lc[pc]的子字符串,令

Figure BDA0002431352400000212
Figure BDA0002431352400000213
ops=ops+1,转6.5.6步;若strc[qc]不含有内容为Lc[pc]的子字符串,转6.5.5步。Step 6.5.4, if strc [qc] contains a substring whose content is Lc[pc], let
Figure BDA0002431352400000212
Figure BDA0002431352400000213
ops=ops+1, go to step 6.5.6; if strc[qc] does not contain a substring whose content is L c [pc], go to step 6.5.5.

6.5.5步,令pc=pc+1。若pc≤256,转6.5.4步;若pc>256,说明完成了一遍对Lc的检查,转6.5.6步。Step 6.5.5, let pc=pc+1. If pc≤256, go to step 6.5.4; if pc>256, it means that the check of L c has been completed, go to step 6.5.6.

6.5.6步,令qc=qc+1。若qc≤numc,转6.5.3步;若qc>numc,说明x(i)对应的smali文件扫描完毕,转6.5.7步。Step 6.5.6, let qc=qc+1. If qc≤numc, go to step 6.5.3; if qc>numc, it means that the smali file corresponding to x (i) is scanned, go to step 6.5.7.

6.5.7步,令pc=1。Step 6.5.7, let pc=1.

6.5.8步,令

Figure BDA0002431352400000214
Step 6.5.8, let
Figure BDA0002431352400000214

6.5.9步,令pc=pc+1。若pc≤256,转6.5.8步;若pc>256,说明

Figure BDA0002431352400000215
计算完成,转6.6步。Step 6.5.9, let pc=pc+1. If pc≤256, go to step 6.5.8; if pc>256, explain
Figure BDA0002431352400000215
After the calculation is completed, go to step 6.6.

6.6步,采用arm操作码统计方法统计x(i)使用的arm操作码,得到x(i)的arm操作码向量

Figure BDA0002431352400000216
方法是:Step 6.6, use the arm opcode statistical method to count the arm opcodes used by x (i) to obtain the arm opcode vector of x (i).
Figure BDA0002431352400000216
the way is:

6.6.1步,按行扫描x(i)对应的arm文件,记arm文件的第qd行字符串为strd[qd],arm文件总行数为numd行。Step 6.6.1, scan the arm file corresponding to x (i) line by line, record the qd line string of the arm file as strd[qd], and the total number of lines in the arm file as numd lines.

6.6.2步,令qd=l,使用变量opa表示arm文件中使用的arm操作码总数量,令opa=1。若qd≤numd,转6.6.3步;若qd>numd,说明x(i)对应的arm文件是空文件,转6.7步。Step 6.6.2, let qd=1, use the variable opa to represent the total number of arm opcodes used in the arm file, let opa=1. If qd≤numd, go to step 6.6.3; if qd>numd, it means that the arm file corresponding to x (i) is an empty file, go to step 6.7.

6.6.3步,令pd=1。Step 6.6.3, let pd=1.

6.6.4步,若strd[qd]含有“>”字符,说明strd[qd]包含一条arm指令,令opa=opa+1,转6.6.5;若strd[qd]不含有“>”字符,转6.6.7。Step 6.6.4, if strd[qd] contains the ">" character, it means that strd[qd] contains an arm instruction, let opa=opa+1, go to 6.6.5; if strd[qd] does not contain the ">" character, Go to 6.6.7.

6.6.5步,若strd[qd]含有内容为Ld[pd]的子字符串,令

Figure BDA0002431352400000221
Figure BDA0002431352400000222
转6.6.7步;若strd[qd]不含有内容为Ld[pd]的子字符串,转6.6.6步。Step 6.6.5, if strd[qd] contains a substring whose content is L d [pd], let
Figure BDA0002431352400000221
Figure BDA0002431352400000222
Go to step 6.6.7; if strd[qd] does not contain a substring whose content is L d [pd], go to step 6.6.6.

6.6.6步,令pd=pd+1。若pd≤197,转6.6.5步;若pd>197,说明完成了一遍对Ld的检查,转6.6.7步。Step 6.6.6, let pd=pd+1. If pd≤197, go to step 6.6.5; if pd>197, it means that the inspection of L d is completed, go to step 6.6.7.

6.6.7步,令qd=qd+1。若qd≤numd,转6.6.3步;若qd>numd,说明x(i)对应的arm文件扫描完毕,转6.6.8步。Step 6.6.7, let qd=qd+1. If qd≤numd, go to step 6.6.3; if qd>numd, it means that the arm file corresponding to x (i) is scanned, go to step 6.6.8.

6.6.8步,令pd=1。Step 6.6.8, let pd=1.

6.6.9步,令

Figure BDA0002431352400000223
Step 6.6.9, let
Figure BDA0002431352400000223

6.6.10步,令pd=pd+1。若pd≤197,转6.6.9步;若pd>197,说明

Figure BDA0002431352400000224
计算完成,转6.7步。Step 6.6.10, let pd=pd+1. If pd≤197, go to step 6.6.9; if pd>197, explain
Figure BDA0002431352400000224
After the calculation is completed, go to step 6.7.

6.7步,令i=i+1。若i≤N,转6.2;若i>N,表明对D内的N个样本均计算生成了频率指纹,将频率指纹发送给检测模块,转第七步。Step 6.7, let i=i+1. If i≤N, go to 6.2; if i>N, it means that the frequency fingerprint is generated for all N samples in D, and the frequency fingerprint is sent to the detection module, and the seventh step is performed.

第七步,检测模块从频率指纹产生模块接收频率指纹,训练多核支持向量机模型,成为适合对待检测软件进行分类判断的分类器。令kperm、kapi、ksmali、karm分别表示频率指纹内的权限向量、API向量、smali操作码向量和arm操作码向量使用的核函数,β为权重向量,可表示为(βperm,βapi,βsmali,βarm),β的元素βperm、βapi、βsmali、βarm分别表示kperm、kapi、ksmali、karm的权重,令T为集合{perm,api,smali,arm},多核支持向量机模型Y可表示为:In the seventh step, the detection module receives the frequency fingerprint from the frequency fingerprint generation module, trains a multi-core support vector machine model, and becomes a classifier suitable for classifying and judging the software to be detected. Let k perm , k api , k smali , and k arm denote the kernel function used by the permission vector, API vector, smali opcode vector, and arm opcode vector in the frequency fingerprint, respectively, and β is the weight vector, which can be expressed as (β perm , β api , β smali , β arm ), the elements β perm , β api , β smali , and β arm of β represent the weights of k perm , k api , k smali , and k arm respectively, let T be the set {perm, api, smali , arm}, the multi-core support vector machine model Y can be expressed as:

Figure BDA0002431352400000225
Figure BDA0002431352400000225

α(i)为一个拉格朗日乘子,{α(1),α(2),...,α(i),...,α(N)}构成向量α。sgn(A)为参数A的阶跃函数,当A>0时,sgn(A)=1;当A=0时,sgn(A)=0;当A<0时,sgn(A)=-1。α、β通过求解公式(5)得到:α (i) is a Lagrange multiplier, {α (1) , α (2) , ..., α (i) , ..., α (N) } constitute the vector α. sgn(A) is the step function of parameter A, when A>0, sgn(A)=1; when A=0, sgn(A)=0; when A<0, sgn(A)=- 1. α and β are obtained by solving formula (5):

Figure BDA0002431352400000226
Figure BDA0002431352400000226

公式(5)的约束条件为公式(6)至公式(9):The constraints of formula (5) are formula (6) to formula (9):

Figure BDA0002431352400000231
Figure BDA0002431352400000231

0≤α(i)≤C (7)0≤α (i) ≤C(7)

t∈Tβt=1 (8)t∈T β t = 1 (8)

βt≥0,t∈T (9)β t ≥ 0, t∈T (9)

其中C为惩罚系数,C≥0,一般令C=100,用于表示对误分类惩罚的大小。Among them, C is the penalty coefficient, C≥0, generally let C=100, which is used to indicate the size of the penalty for misclassification.

b为标量,在求出α、β后,由下面公式得出:b is a scalar, after calculating α and β, it is obtained by the following formula:

Figure BDA0002431352400000232
Figure BDA0002431352400000232

其中,

Figure BDA0002431352400000233
为支持向量样本点。in,
Figure BDA0002431352400000233
is the support vector sample point.

对多核支持向量机模型进行训练的方法是:The way to train a multi-core SVM model is:

7.1步,根据从频率指纹产生模块接收的D内样本的频率指纹计算生成核矩阵。令Kt为核矩阵,t∈T,表示四种核矩阵Kperm、Kapi、Ksmali和Karm。Kt规模大小为N行N列,第i行第j列的元素为

Figure BDA0002431352400000234
选定3次多项式核函数,Kt的计算方法为:Step 7.1: Calculate and generate a kernel matrix according to the frequency fingerprints of the samples in D received from the frequency fingerprint generation module. Let K t be a kernel matrix, t∈T, denoting the four kernel matrices K perm , K api , K sma li and K arm . The size of K t is N rows and N columns, and the elements of the i-th row and the j-th column are
Figure BDA0002431352400000234
The 3rd degree polynomial kernel function is selected, and the calculation method of K t is:

7.1.1步,令i=1。Step 7.1.1, let i=1.

7.1.2步,令j=1。Step 7.1.2, let j=1.

7.1.3步,计算

Figure BDA0002431352400000235
Step 7.1.3, Calculation
Figure BDA0002431352400000235

Figure BDA0002431352400000236
Figure BDA0002431352400000236

Figure BDA0002431352400000237
表示
Figure BDA0002431352400000238
Figure BDA0002431352400000239
的内积。
Figure BDA0002431352400000237
express
Figure BDA0002431352400000238
and
Figure BDA0002431352400000239
the inner product.

7.1.4步,若j≤N,令j=j+1,转7.1.3步;若j>N,转7.1.5步。Step 7.1.4, if j≤N, let j=j+1, go to step 7.1.3; if j>N, go to step 7.1.5.

7.1.5步,若i≤N,令i=i+1,转7.1.2步;若i>N,Kt计算完毕,转7.2步。Step 7.1.5, if i≤N, set i=i+1, go to step 7.1.2; if i>N, K t is calculated, go to step 7.2.

7.2步,优化α、β参数,方法是:Step 7.2, optimize the α, β parameters, the method is:

7.2.1初始化α向量内每个元素为0,初始化β向量内每个元素为1/4。7.2.1 Initialize each element in the alpha vector to 0, and initialize each element in the beta vector to 1/4.

7.2.2利用公式(5),按照上标r、s从小到大的顺序,选择一对α(r)、α(s)对α进行优化,将

Figure BDA00024313524000002310
及向量β作为固定值。优化方法为:7.2.2 Using formula (5), select a pair of α (r) and α (s) to optimize α according to the superscript r and s in ascending order, and
Figure BDA00024313524000002310
and the vector β as a fixed value. The optimization method is:

7.2.2.1利用公式(6)的约束,使公式(5)成为α(r)的一元二次函数g(α(r)),对g(α(r))求导数使求导数之后的结果等于0,求出α(r)7.2.2.1 Using the constraints of formula (6), formula (5) becomes a quadratic function g(α (r) ) of α (r) in one variable, and the derivative of g(α (r) ) is obtained to obtain the result after the derivative equal to 0, find α (r) .

7.2.2.2利用公式(6)的约束求出α(s)7.2.2.2 Use the constraints of equation (6) to find α (s) .

7.2.2.3将α(r),α(s)更新,得到优化后的α,命名为α*7.2.2.3 Update α (r) and α (s) to obtain the optimized α, named α * .

7.2.3将α*作为固定值,对β进行优化,方法为:7.2.3 Taking α * as a fixed value, optimize β by:

7.2.3.1计算公式(5)对β的偏导数,使求偏导数之后的结果等于0,求出满足公式(8)和公式(9)约束条件的解,即求出了βperm、βapi、βsmali、βarm优化后的结果,分别命名为

Figure BDA0002431352400000241
7.2.3.1 Calculate the partial derivative of formula (5) with respect to β, make the result after the partial derivative equal to 0, and find the solution that satisfies the constraints of formula (8) and formula (9), that is, obtain β perm , β api The optimized results of , β smali and β arm are named as
Figure BDA0002431352400000241

7.2.3.2将

Figure BDA0002431352400000242
拼接成优化后的β,命名为β*。7.2.3.2 Will
Figure BDA0002431352400000242
spliced into optimized β, named β * .

7.2.4判断α、β是否满足公式(12)~公式(14)的优化终止条件:7.2.4 Judge whether α and β satisfy the optimization termination conditions of formula (12) to formula (14):

Figure BDA0002431352400000243
Figure BDA0002431352400000243

Figure BDA0002431352400000244
Figure BDA0002431352400000244

L(α*,β*)-L(α,β)≤ε (14)L(α ** )-L(α,β)≤ε(14)

当满足公式(14)时,对α、β参数的优化使得公式(5)中函数值改变小于阈值ε,,令ε=0.01,说明优化后的α、β满足要求,多核支持向量机模型训练完毕,转7.3步。否则转步骤7.2.2。When formula (14) is satisfied, the optimization of the α and β parameters makes the change of the function value in formula (5) less than the threshold ε, and ε=0.01, indicating that the optimized α and β meet the requirements, and the multi-core support vector machine model is trained Finished, go to step 7.3. Otherwise, go to step 7.2.2.

7.3步,由公式(10)计算得到b的值,公式(4)定义的多核支持向量机模型训练优化完成,成为分类器。In step 7.3, the value of b is calculated by formula (10), and the training and optimization of the multi-core support vector machine model defined by formula (4) is completed and becomes a classifier.

第八步,使用基于频率指纹提取的安卓恶意软件检测系统对待检软件进行检测,判断是否为恶意软件,方法是:The eighth step is to use the Android malware detection system based on frequency fingerprint extraction to detect the software to be inspected, and determine whether it is malware. The method is as follows:

8.1步,样本预处理模块对待检软件进行预处理。将待检测软件作为样本x(a),采用3.3步所述样本预处理方法,对x(a)进行预处理,得到x(a)的AndroidManifest.xml文件、smali文件和arm指令文件,输出至频率指纹计算模块。Step 8.1, the sample preprocessing module preprocesses the software to be tested. Taking the software to be detected as a sample x (a) , using the sample preprocessing method described in step 3.3, preprocessing x (a) to obtain the AndroidManifest.xml file, smali file and arm instruction file of x (a) , output to Frequency fingerprint calculation module.

8.2步,对x(a)计算产生频率指纹

Figure BDA0002431352400000245
Step 8.2, generate frequency fingerprint for x (a) calculation
Figure BDA0002431352400000245

8.2.1步,采用6.3步所述权限提取方法提取x(a)申请的权限,得到x(a)的权限向量

Figure BDA0002431352400000246
Step 8.2.1, use the permission extraction method described in step 6.3 to extract the permission applied for by x (a ) , and obtain the permission vector of x (a) .
Figure BDA0002431352400000246

8.2.2步,采用6.4步所述API统计方法统计x(a)使用的API,得到x(a)的API向量

Figure BDA0002431352400000251
Step 8.2.2, use the API statistics method described in step 6.4 to count the APIs used by x (a ) , and obtain the API vector of x (a) .
Figure BDA0002431352400000251

8.2.3步,采用6.5步所述smali操作码统计方法统计x(a)使用的smali操作码,得到x(a)的smali操作码向量

Figure BDA0002431352400000252
Step 8.2.3, use the smali opcode statistical method described in step 6.5 to count the smali opcodes used by x ( a ), and obtain the smali opcode vector of x (a) .
Figure BDA0002431352400000252

8.2.4步,采用6.6步所述arm操作码统计方法统计x(a)使用的arm操作码,得到x(a)的arm操作码向量

Figure BDA0002431352400000253
Step 8.2.4, use the arm opcode statistical method described in step 6.6 to count the arm opcodes used by x ( a ), and obtain the arm opcode vector of x (a) .
Figure BDA0002431352400000253

8.2.5步,将

Figure BDA0002431352400000254
计算完毕,拼接成
Figure BDA0002431352400000255
Step 8.2.5, will
Figure BDA0002431352400000254
After the calculation is completed, it is spliced into
Figure BDA0002431352400000255

8.3步,将

Figure BDA0002431352400000256
输入检测模块,由公式(4)计算输出F的值,F等于+1或者-1,+1代表待检软件为恶意软件,-1代表为良性软件,从而实现了判断待检测软件是否为恶意软件的目的。Step 8.3, will
Figure BDA0002431352400000256
Input the detection module, calculate and output the value of F by formula (4), F is equal to +1 or -1, +1 means the software to be detected is malicious software, -1 means that it is benign software, thus realizing whether the software to be detected is malicious or not purpose of the software.

Claims (10)

1. An android malicious software detection method based on frequency fingerprint extraction is characterized by comprising the following steps:
the method comprises the steps that firstly, an android malicious software detection system based on frequency fingerprint extraction is constructed, the android malicious software detection system based on frequency fingerprint extraction is installed in a Google official or third-party android application software market server and consists of a sample preprocessing module, a frequency fingerprint generation module and a detection module;
the sample preprocessing module is connected with the frequency fingerprint generating module, receives a sample of a reference test set and a sample to be detected, preprocesses the sample, generates three types of files, namely, an android manifest.xml file, a smali file and an arm instruction file, and outputs the three types of files to the frequency fingerprint generating module;
the frequency fingerprint generation module is connected with the sample preprocessing module and the detection module, receives the android manifest, the smal file and the arm instruction file from the sample preprocessing module, performs feature screening and frequency fingerprint calculation, generates a frequency fingerprint and outputs the frequency fingerprint to the detection module;
the frequency fingerprint generation module consists of a characteristic screening module and a frequency fingerprint calculation module; the characteristic screening module is connected with the sample preprocessing module and the frequency fingerprint computing module, receives android manifest, smal files and arm instruction files from the sample preprocessing module, performs characteristic screening on the three files to obtain authority, API, smal operation codes and arm operation code characteristics, and sends the authority, API, smal operation codes and arm operation code characteristics to the frequency fingerprint computing module; the frequency fingerprint calculation module is connected with the sample preprocessing module, the feature screening module and the detection module, receives the authority, the API, the smali operation code and the arm operation code features from the feature screening module, receives the android manifest.xml, the smali file and the arm instruction file from the sample preprocessing module, calculates to generate a frequency fingerprint, and sends the frequency fingerprint to the detection module;
the detection module is connected with the frequency fingerprint generation module, is a multi-core support vector machine model, receives the frequency fingerprints of the reference test set D and the frequency fingerprints of the software to be detected from the frequency fingerprint generation module, performs training optimization by using the frequency fingerprints of the reference test set D to form a classifier suitable for detecting the software to be detected, and then performs detection classification on the software to be detected according to the frequency fingerprints of the software to be detected to obtain a judgment result of whether the software to be detected is malicious software;
secondly, constructing a benchmark test set D, wherein the method comprises the following steps:
2.1 step of adding N1Individual android malware as malicious samples, N1Is a positive integer and N1>1000;
2.2 step (b), adding N2Benign software as a benign sample, N2Is a positive integer and N2>1000;
And 2.3, adding labels to the malicious samples and the benign samples to form a benchmark test set D, wherein N is the total number of the samples in D, and N is equal to N1+N2(ii) a Definition of x(i)Is the ith sample in D, y(i)Is x(i)Label of (a), y(i)Equal to 1 denotes x(i)As a malicious sample, y(i)Equal to-1 denotes x(i)I is more than or equal to 1 and less than or equal to N;
2.4 storing D in a memory which can be read by both the preprocessing module and the frequency fingerprint generation module;
thirdly, preprocessing the N samples in the D by a sample preprocessing module to obtain N android Manifest xml files, N smali files and N arm instruction files, wherein the method comprises the following steps:
step 3.1, enabling the variable i to be 1;
3.2 step, take the ith sample x from D(i)
3.3 step, using sample pretreatment method to x(i)Carrying out pretreatment to obtain x(i)Xml file, smali file and arm instruction file, the method is as follows:
3.3.1 Steps, using decompression tool vs. x(i)Decompress and extract x(i)Xml, classes, dex and so runtime files in (1);
3.3.2, using an android manifest. xml file special decompilation tool AXM L Printer2 to decompilate the android manifest. xml file from a binary form into a text form;
3.3.3, inversely compiling classes into a smali file by using a dex file format inverse compiling tool bakmali, if a plurality of smali files are generated, combining the plurality of smali files into one smali file, and turning to 3.3.4 steps; if only 1 smali file is generated, directly rotating to 3.3.4 steps;
3.3.4, reversely compiling the so runtime file into an arm instruction file in a text form by using an arm instruction disassembling tool gcc-arm-none-eabi, if a plurality of arm instruction files are generated, combining the plurality of arm instruction files into one arm instruction file, and turning to the 3.4 step; if the arm instruction file is not generated, establishing an empty arm instruction file, and turning to the step 3.4;
3.4, changing i to i +1, and if i is less than or equal to N, turning to 3.2; if i is larger than N, the N samples generate corresponding N android Manifest xml files, N smali files and N arm instruction files, the N android Manifest xml files, the N smali files and the N arm instruction files corresponding to the N samples of D are sent to a feature screening module, and the fourth step is carried out;
fourthly, the feature screening module performs feature screening on N android files, N smal files and N arm instruction files corresponding to N samples of D received from the sample preprocessing module to obtain right features, API features, smal operation code features and arm operation code features suitable for classifying D, and the method comprises the following steps:
4.1, selecting 167 types of android system permissions defined in an android developer document, and taking the 167 types of permissions as features, namely permission features;
4.2, selecting 256 APIs from the APIs in the pscout list, wherein the method comprises the following steps:
4.2.1 step, build a list LapiSelecting all 32437 APIs in the pscout list to add to LapiThe vth API is noted as Lapi[v],1≤v≤32437;
4.2.2 step, establishing a two-dimensional array Z of 32437 rows and N columnsapiRow v, column i element Zapi[v][i]Is defined as 1 or 0, 1 represents LapiThe v API in D is present in the i sample, 0 represents not present;
4.2.3 step, initialize ZapiAll the elements in the table are 0, and the initialization variable i is 1;
4.2.4, scanning the smali file of the ith sample of the D according to lines to obtain the attributes appearing in the ith sampleAt LapiAPI of, for ZapiThe ith column element of (1) is assigned; the u line character string of the notation smal file is str [ u]Recording the total line number of the smali file as U, wherein U is more than or equal to 1 and less than or equal to U;
4.2.5, making i equal to i + 1; if i is less than or equal to N, turning to 4.2.4 steps; if i is more than N, completing the two-dimensional array ZapiTo 4.2.6;
4.2.6 calculating a list LapiInformation gain IG of each API to the reference test set D, and information gain IG of the v-th API to D (D | L)api[v]) Expressed as IG (D | L)api[v]) L will be counted from large to smallapiSequencing the internal APIs, and taking the top 256 sequenced APIs as API characteristics;
4.3, using 256 kinds of smali operation codes with the length of 8 binary bits predefined by the android Dalvik virtual machine as the characteristics of the smali operation codes;
4.4, selecting a total 197 arm instruction operation codes listed by the arm instruction quick reference manual as arm operation code features;
4.5, sending the authority feature, the API feature, the smali operation code feature and the arm operation code feature to a frequency fingerprint calculation module;
fifthly, determining a frequency fingerprint format, wherein the method comprises the following steps:
respectively arranging 167 authority features, 256 API features, 256 smali operation code features and 197 arm operation code features according to an alphabetical order to form vectors which are respectively called as an authority vector, an API vector, a smali operation code vector and an arm operation code vector of the android software; the permission vector of the android software is composed of 167 integers, and each integer takes the value of 1 or 0; if the value of the integer at the position of the pa is 1, the pa in the 167 screened permissions is applied in the android software; if the integer value at the position of the pa is 0, it indicates that the pa in the 167 screened permissions is not applied in the android software; pa is an integer, 1 is more than or equal to pa is less than or equal to 167; an API vector of the android software consists of 256 decimal numbers, and the decimal number at the position of the pb indicates the frequency of the pb of the 256 screened APIs in the android software; pb is an integer, and pb is more than or equal to 1 and less than or equal to 256; the method comprises the steps that a smali operation code vector of the android software consists of 256 decimals, and the decimal at the position of the pc indicates the frequency of the pc of 256 kinds of screened smali operation codes appearing in the android software; pc is an integer, and pc is more than or equal to 1 and less than or equal to 256; an arm opcode vector of android software consists of 197 decimals, the decimal at the position of the pdth position indicating the frequency of occurrence of the pdth type of 197 arm opcodes screened in the android software; pd is an integer, and pd is more than or equal to 1 and less than or equal to 197;
connecting the four vectors end to form a vector with the length of 876 as a frequency fingerprint, wherein 167 integers and 709 decimal numbers contained in the frequency fingerprint are both called as elements of the frequency fingerprint;
sixthly, the frequency fingerprint calculation module receives the authority feature, the API feature, the smali operation code feature and the arm operation code feature from the feature screening module, receives the android manifest xml file, the smali file and the arm instruction file from the sample preprocessing module, and calculates and generates frequency fingerprints for N samples in the reference test set D, wherein the method comprises the following steps:
step 6.1, order LaAs a list of permissions, list member La[pa]The name character string of the pa-type authority arranged in the order of letters in the 167 authorities, and LbIs an API List, List Member Lb[pb]For the name string of the alphabetically arranged pb-th API of the 256 APIs, let LcAs a list of smali opcodes, list Member Lc[pc]The name character string of the pc type smali operation code arranged in the order of letters in the 256 kinds of smali operation codes, and LdAs an arm opcode List, List Member Ld[pd]The name character string is the name character string of the pd-th arm operation code arranged in the order of letters in 197-type arm operation codes; let variable i equal to 1;
6.2, taking the ith sample x in D(i)Is x(i)Generating frequency fingerprints
Figure FDA0002431352390000041
876 elements are contained, and each element is initialized to be 0; will be provided with
Figure FDA0002431352390000042
The authority vector in (1) is recorded as
Figure FDA0002431352390000043
The pa-th element in (b) is marked as
Figure FDA0002431352390000044
API vector notation
Figure FDA0002431352390000045
Pb th element of (1)
Figure FDA0002431352390000046
The smali opcode vector is noted
Figure FDA0002431352390000047
The pc-th element in (1)
Figure FDA0002431352390000048
arm opcode vector as
Figure FDA0002431352390000049
Pd th element in (2)
Figure FDA00024313523900000410
6.3, adopting a permission extraction method to extract x(i)Authority of application, get x(i)Authority vector of
Figure FDA00024313523900000411
The method comprises the following steps:
step 6.3.1, scan by line x(i)Xml file, the qa row character string of the xml file is stro [ qa]Marking the total number of rows of the android manifest.xml file as numa rows;
step 6.3.2, let qa equal to 1;
6.3.3, if stra [ qa ] contains a substring with the content of "uses-permission", making pa equal to 1, and turning to 6.3.4; if stra [ qa ] does not contain the character string with the content of "uses-permission", 6.3.6 steps are carried out;
6.3.4, if stra [ qa]Contains content La[pa]A substring of (a), indicates x(i)Application for La[pa]Authority, order
Figure FDA0002431352390000051
6.3.6 steps are carried out; if stra [ qa [ ]]The non-content is La[pa]Turning to 6.3.5 steps;
6.3.5, let pa equal to pa +1, if pa is less than or equal to 167, go to 6.3.4, if pa is greater than 167, it means that one-pass pair L is completedaChecking, 6.3.6 steps are carried out;
step 6.3.6, let qa be qa + 1; if qa is less than or equal to numa, turning to 6.3.3 steps; if qa > numa, x is stated(i)Xml document is scanned,
Figure FDA0002431352390000052
after the calculation is finished, 6.4 steps are carried out;
6.4, counting x by adopting an API statistical method(i)API used, get x(i)API vector
Figure FDA0002431352390000053
The method comprises the following steps:
step 6.4.1, scan by line x(i)Corresponding smali file, the qb line character string of the smali file is marked as strb [ qb [ ]]Recording the total line number of the smali file as a numb line;
step 6.4.2, making qb equal to 1, using a variable inv to represent the total number of APIs in the smali file, and making inv equal to 1;
6.4.3, changing the variable pb to 1;
6.4.4, if strb [ qb ] contains a substring with the content of 'invoke', making inv equal to inv +1, and turning to 6.4.5; if the substring of the 'invoke' is not contained, turning to step 6.4.7;
6.4.5, if strb [ qb ]]Contains content Lb[pb]Sub-string of (2), caption x(i)Call name Lb[pb]API of (1), order
Figure FDA0002431352390000054
Turning to step 6.4.7; if strb [ qb [ ]]The non-content is Lb[pb]Turning to step 6.4.6;
6.4.6, changing pb to pb +1, if pb is less than or equal to 256, turning to 6.4.5, if pb is more than 256, indicating that one-time pairing L is completedbGo to step 6.4.7;
6.4.7, making qb equal to qb + 1; if qb is less than or equal to numb, turning to 6.4.3 steps; if qb > numb, say x(i)After the corresponding smali file is scanned, turning to step 6.4.8;
6.4.8, making pb 1;
6.4.9 step (1), let
Figure FDA0002431352390000055
6.4.10, making pb + 1; if pb is less than or equal to 256, turning to 6.4.9; if pb > 256, this indicates
Figure FDA0002431352390000056
After the calculation is finished, 6.5 steps are carried out;
6.5, adopting a smali operation code statistical method to count x(i)The used smali operation code, get x(i)Of a smali opcode vector
Figure FDA0002431352390000057
The method comprises the following steps:
step 6.5.1, scan by line x(i)Corresponding smali file, wherein the qc line character string of the smali file is strc [ qc ] of]Recording the total line number of the smali file as a numc line;
6.5.2, making qc equal to 1, using a variable ops to represent the total amount of the smali operation codes in the smali file, and making ops equal to 1;
6.5.3, making pc equal to 1;
6.5.4, if strc [ qc ]]Contains content Lc[pc]Sub-string of
Figure FDA0002431352390000061
Switching to 6.5.6 step when ops is ops + 1; if strc [ qc ]]The non-content is Lc[pc]Turning to step 6.5.5;
6.5.5, changing pc to pc +1, if pc is less than or equal to 256, turning to 6.5.4, if pc is more than 256, indicating that one-time pairing L is completedcGo to step 6.5.6;
6.5.6, making qc equal to qc + 1; if qc is less than or equal to numc, 6.5.3 steps are carried out; if qc > numc, x is stated(i)After the corresponding smali file is scanned, turning to step 6.5.7;
6.5.7, making pc equal to 1;
6.5.8 step (1), let
Figure FDA0002431352390000062
6.5.9, making pc equal to pc + 1; if pc is less than or equal to 256, turning to 6.5.8; if pc > 256, this indicates
Figure FDA0002431352390000063
After the calculation is finished, 6.6 steps are carried out;
6.6, counting x by an arm operation code statistical method(i)The arm opcode used, yields x(i)Arm opcode vector of
Figure FDA0002431352390000064
The method comprises the following steps:
step 6.6.1, scan by line x(i)Corresponding arm file, memory the qd line character string of arm file as strd [ qd ]]The total line number of the arm file is numd lines;
step 6.6.2, let qd be 1, use variable opa to represent the total number of the arm opcodes used in the arm file, and let opa be 1; if qd is less than or equal to numd, turning to step 6.6.3; if qd > numd, x is stated(i)If the corresponding arm file is an empty file, turning to the step 6.7;
6.6.3, making pd equal to 1;
6.6.4, if strd [ qd ] contains ">" character, it indicates strd [ qd ] contains an arm instruction, opa +1, go to 6.6.5; if strd [ qd ] does not contain the ">" character, go to 6.6.7;
6.6.5, if strd [ qd ]]Contains content Ld[pd]Sub-string of
Figure FDA0002431352390000065
Turning to step 6.6.7; if strd [ qd ]]The non-content is Ld[pd]6.6.6 steps;
6.6.6, making pd equal to pd +1, if pd is less than or equal to 197, turning to 6.6.5, if pd is greater than 197, then it shows that one-pass pair L is completeddGo to step 6.6.7;
6.6.7, making qd-qd + 1; if qd is less than or equal to numd, turning to step 6.6.3; if qd > numd, x is stated(i)After the corresponding arm file is scanned, turning to step 6.6.8;
6.6.8, making pd equal to 1;
6.6.9 step (1), let
Figure FDA0002431352390000071
6.6.10, making pd ═ pd + 1; if pd is less than or equal to 197, turning to step 6.6.9; if pd > 197, this indicates
Figure FDA0002431352390000072
After the calculation is finished, 6.7 steps are carried out;
6.7, making i equal to i + 1; if i is less than or equal to N, turning to 6.2; if i is larger than N, the frequency fingerprints are generated by calculating the N samples in the D, the frequency fingerprints are sent to a detection module, and the seventh step is carried out;
seventhly, the detection module receives the frequency fingerprints from the frequency fingerprint generation module, trains the multi-core support vector machine model, and becomes a classifier suitable for classifying and judging software to be detected, and the method comprises the following steps: for the benchmark test set D, the characteristic space of the multi-core support vector machine model is a set of frequency fingerprints of N samples in D; let kperm、kapi、ksmali、karmRepresenting the kernel functions used by the authority vector, the API vector, the smali opcode vector, and the arm opcode vector, respectively, within the frequency fingerprint, β being weight vectors, is represented as (β)perm,βapi,βsmali,βarm) Element β of βperm、βapi、βsmali、βarmRespectively represents kperm、kapi、ksmali、karmLet T be the set { perm, api, smali,arm }, and the multi-core support vector machine model Y is expressed as:
Figure FDA0002431352390000073
α(i)as a Lagrangian multiplier, { α(1),α(2),...,α(i),...,α(N)The construction vector α, sgn (a) is a step function of the parameter a, sgn (a) ═ 1 when a > 0, sgn (a) ═ 0 when a ═ 0, sgn (a) ═ 1 when a < 0, α, β are obtained by solving the formula (5):
Figure FDA0002431352390000074
the constraint conditions of formula (5) are formula (6) to formula (9):
Figure FDA0002431352390000075
0≤α(i)≤C (7)
t∈Tβt=1 (8)
βt≥0,t∈T (9)
wherein C is a penalty coefficient, and C is more than or equal to 0 and is used for representing the size of the penalty of misclassification;
b is a scalar, and obtained α, β is given by the following equation:
Figure FDA0002431352390000081
wherein,
Figure FDA0002431352390000082
is a support vector sample point;
the method for training the multi-core support vector machine model comprises the following steps:
7.1, calculating and generating a kernel matrix according to the frequency fingerprint of the D-interior sample received from the frequency fingerprint generating module; let KtIs a coreThe matrix, T ∈ T, represents the four kernel matrices Kperm、Kapi、KsmaliAnd Karm;KtThe scale is N rows and N columns, the element of the ith row and the jth column is
Figure FDA0002431352390000083
Selecting a polynomial kernel of degree 3, KtThe calculation method comprises the following steps:
7.1.1, changing i to 1;
7.1.2, changing j to 1;
7.1.3 step of calculating
Figure FDA0002431352390000084
Figure FDA0002431352390000085
Figure FDA0002431352390000086
To represent
Figure FDA0002431352390000087
And
Figure FDA0002431352390000088
inner product of (d);
7.1.4, if j is less than or equal to N, making j equal to j +1, and turning to 7.1.3; if j is more than N, go to step 7.1.5;
7.1.5, if i is less than or equal to N, making i equal to i +1, and turning to 7.1.2; if i > N, Kt7.2, after the calculation is finished, turning;
7.2, optimizing α and β parameters by the following method:
7.2.1 initialize α each element in the vector to 0, initialize β each element in the vector to 1/4;
7.2.2 Using equation (5), in order of increasing superscript r, s, will (α)(1),α(2),...,α(r-1),α(r+1),...,α(s),α(s+1),...,α(N)) And vector β as a fixed value, selecting a pair α(r)、α(s)α is optimized to obtain optimized α named as α*
7.2.3 blend α*β was optimized as a fixed value to obtain an optimized β named β*
7.2.4, it is judged whether α, β satisfy the optimization termination conditions of formula (12) to formula (14):
Figure FDA0002431352390000089
Figure FDA00024313523900000810
L(α*,β*)-L(α,β)≤(14)
when the formula (14) is met, the α and β parameters are optimized to ensure that the change of the function value in the formula (5) is less than the threshold value, 0 & lt & ltltoreq.0.1, which indicates that α and β after optimization meet the requirements, the multi-core support vector machine model is trained, 7.3 steps are carried out, otherwise, the step 7.2.2 is carried out;
7.3, calculating a value b by a formula (10), and finishing training and optimizing the multi-core support vector machine model defined by the formula (4) to form a classifier;
eighthly, detecting the software to be detected received by the google official or a third-party android application software market server from the user by using an android malicious software detection system based on frequency fingerprint extraction, and judging whether the software to be detected is malicious software, wherein the method comprises the following steps of:
8.1, preprocessing the software to be detected by a sample preprocessing module; using the software to be detected as a sample x(a)The sample pretreatment method of 3.3 steps is adopted to carry out the pretreatment on the x(a)Carrying out pretreatment to obtain x(a)Outputting the xml file, the smali file and the arm instruction file to a frequency fingerprint calculation module;
8.2 step, frequency fingerprint computing Module Pair x(a)Computing to produce x(a)Frequency fingerprint of
Figure FDA0002431352390000091
The method comprises the following steps:
8.2.1, adopting the authority extraction method of 6.3 steps to extract x(a)Authority of application, get x(a)Authority vector of
Figure FDA0002431352390000092
Step 8.2.2, counting x by adopting the API statistical method of step 6.4(a)API used, get x(a)API vector
Figure FDA0002431352390000093
8.2.3, adopting the statistical method of the smali operation codes in the 6.5 steps to count x(a)The used smali operation code, get x(a)Of a smali opcode vector
Figure FDA0002431352390000094
8.2.4 steps, and counting x by adopting the arm operation code statistical method in the 6.6 steps(a)The arm opcode used, yields x(a)Arm opcode vector of
Figure FDA0002431352390000095
8.2.5, step (b), mixing
Figure FDA0002431352390000096
After the calculation, splicing into x(a)Frequency fingerprint of
Figure FDA0002431352390000097
8.3 step (b), mixing
Figure FDA0002431352390000098
An input detection module for calculating the value of output F according to formula (4), wherein F is equal to +1 or-1, and +1 represents that the software to be detected is malicious software and-1 represents that the software to be detected is goodAnd the purpose of judging whether the software to be detected is malicious software is achieved.
2. The method of claim 1, wherein the malicious samples are obtained from Drebin, Genome and AMD datasets from open sources at step 2.1.
3. The method of claim 1, wherein the 2.2 steps of benign samples refer to benign software obtained by crawling google play and Apkpure application stores, which is obtained by detection and filtering through local antivirus software and VirusTotal online antivirus website.
4. The method as claimed in claim 1, wherein the decompression tool at step 3.3.1 refers to Gzip or 7 zip.
5. The method as claimed in claim 1, wherein in the third step, the AXM L Printer2 requires version 2.0 or more, the bakmali requires version 2.4.0 or more, and the gcc-arm-none-easy requires version 9-2019-q4-major or more.
6. The method as claimed in claim 1, wherein 4.2.4 steps of the scali file of the ith sample of the line scanning D result in that L attributes appearing in the ith sampleapiAPI of, for ZapiThe method for assigning the value to the ith column element of (1) is as follows:
4.2.4.1, initializing u to 1;
4.2.4.2, if str [ u ] is an API character string, converting to 4.2.4.2.1; if str [ u ] is not an API character string, 4.2.4.3 is converted;
4.2.4.2.1, initializing a variable v to be 1;
4.2.4.2.2 step, if str [ u ]]Contains content Lapi[v]Substring of (a), assignment Zapi[v][i]1, 4.2.4.3; otherwise, go to step 4.2.4.2.3;
and step 4.2.4.2.3, making v equal to v + 1. If v is less than or equal to 32437, turning to step 4.2.4.2.2; if v is more than 32437, 4.2.4.3 steps are carried out;
4.2.4.3, if U is equal to U +1, turning to 4.2.4.2; and if U is larger than U, the scanning of the smali file of the ith sample is finished, and the operation is finished.
7. The method of claim 1, wherein the android malware detection method based on frequency fingerprint extraction is characterized in that 4.2.6 step calculates the list LapiThe information gain IG of each API to the reference test set D is determined by the information gain IG (D | L) of the v-th API to Dapi[v]) It is shown that,
4.2.6.1, changing v to 1;
4.2.6.2 step, let i equal to 1, let the first variable M11Let a second variable M equal to 012Let a third variable M equal to 021Let a fourth variable M equal to 022=0;
4.2.6.3, if Zapi[v][i]Is equal to 1 and y(i)Equal to 1, order M11=M11+ 1; if Z isapi[v][i]Is equal to 1 and y(i)Equal to 0, let M12=M12+ 1; if Z isapi[v][i]Is equal to 0 and y(i)Equal to 1, order M21=M21+ 1; if Z isapi[v][i]Is equal to 0 and y(i)Equal to 0, let M22=M22+1;
4.2.6.4, making i equal to i + 1; if i is less than or equal to N, turning to step 4.2.6.3; if i is greater than N, go to step 4.2.6.5;
computing IG (D | L) in 4.2.6.5 stepsapi[v]) The method comprises the following steps:
IG(D|Lapi[v])=H(D)-H(D|Lapi[v]](1)
wherein H (D) is the empirical entropy of the benchmark test set D, and H (D) is calculated by the following method:
Figure FDA0002431352390000111
H(D|Lapi[v]) Is a list LapiThe empirical conditional entropy of the vth API pair D of (D | L), Hapi[v]) Comprises the following steps:
Figure FDA0002431352390000112
4.2.6.6 step, let v equal to v +1, if v is less than or equal to 32437, turn to 4.2.6.2, if v >32437, explain list LapiAnd finishing the calculation of the information gain of D by all the APIs in the system.
8. The android malware detection method based on frequency fingerprint extraction as claimed in claim 1, wherein in the seventh step the penalty coefficient C is 100.
9. The android malware detection method based on frequency fingerprint extraction as claimed in claim 1, wherein the method for optimizing α in step 7.2.2 is as follows:
7.2.2.1 Using the constraint of equation (6), equation (5) becomes α(r)Unitary quadratic function g (α)(r)) For g (α)(r)) The derivative is found α with the result after the derivative equal to 0(r)
7.2.2.2 solving α by using the constraint of equation (6)(s)
7.2.2.3 mixing α(r),α(s)Updated to obtain optimized α named α*
10. The android malware detection method based on frequency fingerprint extraction as claimed in claim 1, wherein the method for optimizing β in step 7.2.3 is as follows:
7.2.3.1 calculating the partial derivative of β of formula (5), making the result after calculating the partial derivative equal to 0, solving the solution satisfying the constraint conditions of formula (8) and formula (9), i.e. βpermβapi、βsmali、βarmThe optimized results are respectively named
Figure FDA0002431352390000113
Figure FDA0002431352390000114
7.2.3.2 will be
Figure FDA0002431352390000115
Spliced into optimized β named β*
CN202010237052.6A 2020-03-30 2020-03-30 An Android malware detection method based on frequency fingerprint extraction Active CN111460452B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010237052.6A CN111460452B (en) 2020-03-30 2020-03-30 An Android malware detection method based on frequency fingerprint extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010237052.6A CN111460452B (en) 2020-03-30 2020-03-30 An Android malware detection method based on frequency fingerprint extraction

Publications (2)

Publication Number Publication Date
CN111460452A true CN111460452A (en) 2020-07-28
CN111460452B CN111460452B (en) 2022-09-09

Family

ID=71683415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010237052.6A Active CN111460452B (en) 2020-03-30 2020-03-30 An Android malware detection method based on frequency fingerprint extraction

Country Status (1)

Country Link
CN (1) CN111460452B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001376B (en) * 2020-10-29 2021-02-26 深圳开源互联网安全技术有限公司 Fingerprint identification method, device, equipment and storage medium based on open source component
CN112632538A (en) * 2020-12-25 2021-04-09 北京工业大学 Android malicious software detection method and system based on mixed features
CN114091028A (en) * 2022-01-19 2022-02-25 南京明博互联网安全创新研究院有限公司 A data stream-based Android application information leak detection method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
CN109165510A (en) * 2018-09-04 2019-01-08 中国民航大学 Android malicious application detection method based on binary channels convolutional neural networks
CN109271788A (en) * 2018-08-23 2019-01-25 北京理工大学 A kind of Android malware detection method based on deep learning
CN109753800A (en) * 2019-01-02 2019-05-14 重庆邮电大学 Android malicious application detection method and system integrating frequent itemsets and random forest algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
CN109271788A (en) * 2018-08-23 2019-01-25 北京理工大学 A kind of Android malware detection method based on deep learning
CN109165510A (en) * 2018-09-04 2019-01-08 中国民航大学 Android malicious application detection method based on binary channels convolutional neural networks
CN109753800A (en) * 2019-01-02 2019-05-14 重庆邮电大学 Android malicious application detection method and system integrating frequent itemsets and random forest algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李创丰等: "基于CNN和朴素贝叶斯方法的安卓恶意应用检测算法", 《信息安全研究》 *
苗博等: "基于随机森林的Android恶意代码检测系统", 《信息技术与信息化》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001376B (en) * 2020-10-29 2021-02-26 深圳开源互联网安全技术有限公司 Fingerprint identification method, device, equipment and storage medium based on open source component
CN112632538A (en) * 2020-12-25 2021-04-09 北京工业大学 Android malicious software detection method and system based on mixed features
CN114091028A (en) * 2022-01-19 2022-02-25 南京明博互联网安全创新研究院有限公司 A data stream-based Android application information leak detection method
CN114091028B (en) * 2022-01-19 2022-04-19 南京明博互联网安全创新研究院有限公司 A data stream-based Android application information leak detection method

Also Published As

Publication number Publication date
CN111460452B (en) 2022-09-09

Similar Documents

Publication Publication Date Title
EP4058916B1 (en) Detecting unknown malicious content in computer systems
CN109960726B (en) Text classification model construction method, device, terminal and storage medium
CN109784056B (en) Malicious software detection method based on deep learning
Drew et al. Polymorphic malware detection using sequence classification methods and ensembles: BioSTAR 2016 Recommended Submission-EURASIP Journal on Information Security
CN104699772B (en) A kind of big data file classification method based on cloud computing
Gao et al. Android malware detection via graphlet sampling
CN111460452B (en) An Android malware detection method based on frequency fingerprint extraction
CN109753801A (en) Dynamic detection method of intelligent terminal malware based on system call
CN109753800A (en) Android malicious application detection method and system integrating frequent itemsets and random forest algorithm
Pfeffer et al. Malware analysis and attribution using genetic information
Song et al. Malicious JavaScript detection based on bidirectional LSTM model
Khan et al. Malware classification framework using convolutional neural network
Wu et al. $ K $-Ary Tree Hashing for Fast Graph Classification
CN113591093A (en) Industrial software vulnerability detection method based on self-attention mechanism
Zhu et al. Malware homology determination using visualized images and feature fusion
Zhang et al. Exploring function call graph vectorization and file statistical features in malicious PE file classification
Naeem et al. Digital forensics for malware classification: An approach for binary code to pixel vector transition
Russell et al. A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences
Guyet et al. Incremental mining of frequent serial episodes considering multiple occurrences
WO2016093839A1 (en) Structuring of semi-structured log messages
Pei et al. Combining multi-features with a neural joint model for Android malware detection
CN114662109A (en) Webshell detection method and device
CN111444502B (en) A Population-Oriented Model Base Method for Android Malware Detection
De La Rosa et al. Efficient characterization and classification of malware using deep learning
CN117150512A (en) Source code vulnerability detection method, model training method, device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant