CN111460452A

CN111460452A - Android malicious software detection method based on frequency fingerprint extraction

Info

Publication number: CN111460452A
Application number: CN202010237052.6A
Authority: CN
Inventors: 吴庆; 刘波; 洪学恕; 马行空; 胡乃天; 陆潼; 刘鹏
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-28
Anticipated expiration: 2040-03-30
Also published as: CN111460452B

Abstract

The invention discloses an Android malware detection method based on frequency fingerprint extraction, and aims to provide a method capable of accurately detecting malware. The technical solution is to build an Android malware detection system based on frequency fingerprint extraction, which consists of a sample preprocessing module, a frequency fingerprint generation module, and a detection module, collect malicious and benign software as samples, and construct a benchmark test set D; decompress the samples in D Get AndroidManifest.xml, classes.dex and so library files, extract permissions, API, smali opcode and arm opcode features, count whether these features appear and their frequency, form four different types of feature vectors and connect them end to end frequency fingerprint. The optimized detection module is trained by the frequency fingerprint of the samples in D to become a classifier, to detect the samples to be inspected, and to output whether the samples to be inspected are the result of malware. The invention can effectively integrate the information from each component of the Android software, and the detection is both accurate and fast.

Description

An Android malware detection method based on frequency fingerprint extraction

技术领域technical field

本发明涉及安卓恶意软件检测领域，尤其涉及到一种利用提取的频率指纹对安卓恶意软件进行检测的方法。The invention relates to the field of Android malware detection, in particular to a method for detecting Android malware by using an extracted frequency fingerprint.

背景技术Background technique

近年来，伴随着互联网技术、移动通信技术的日益发展和普及，以智能手机为代表的移动终端给人们的生活带来了极大的便利，成为不可或缺的重要交流工具。在众多的移动操作系统中，安卓(即Android)移动操作系统以其出众的开放性、丰富的第三方应用软件、友好的操作界面和良好的用户体验等显著优势，受到广大用户的欢迎，在全球范围移动智能设备中占据了大量的市场份额。与此同时，安卓应用软件的数量也快速的增长，截止到2020年2月，Google Play中的应用软件数量达到了286万，且仍在不断增长。In recent years, with the increasing development and popularization of Internet technology and mobile communication technology, mobile terminals represented by smart phones have brought great convenience to people's lives and become an indispensable and important communication tool. Among the many mobile operating systems, the Android (ie Android) mobile operating system is welcomed by the majority of users due to its outstanding openness, rich third-party application software, friendly operation interface and good user experience. Global mobile smart devices occupy a large market share. At the same time, the number of Android applications has also grown rapidly. As of February 2020, the number of applications in Google Play has reached 2.86 million and is still growing.

除了安卓官方应用市场Google Play外，还存在着大量的第三方应用市场，这类市场良莠不齐数目众多，缺乏统一有效的管理，发布审核机制并不健全，不法人员也能随意发布安卓应用软件，使得这类市场中难以避免混入恶意应用，在被用户下载后给用户的信息安全带来巨大隐患。更为严重的是各类应用市场中软件存量巨大，且增速很快，在安全机制、检测方法并不健全的情况下，恶意软件在这类市场中长期存在，难以发现和查杀，对安卓生态的健康发展造成了巨大的威胁。In addition to Google Play, the official Android application market, there are also a large number of third-party application markets. There are many different types of markets, lacking unified and effective management, and the release review mechanism is not perfect. Unscrupulous personnel can also release Android application software at will, making It is difficult to avoid malicious applications mixed in such markets, which brings huge hidden dangers to users' information security after being downloaded by users. What's more serious is that there is a huge amount of software in various application markets, and the growth rate is very fast. Under the circumstance that the security mechanism and detection method are not perfect, malware has existed for a long time in such markets, and it is difficult to find and kill it. The healthy development of the Android ecosystem poses a huge threat.

目前典型的安卓恶意软件检测技术包括静态检测和动态检测两种类型。静态检测方法主要使用反汇编、反编译技术或者在smali中间代码上进行控制流和数据流分析技术来进行恶意代码检测。优点是代码覆盖率高，缺点是无法检测代码混淆、加密以及动态加载恶意代码的问题。动态分析方法是在系统运行过程中监控应用运行时的各种变量、跟踪应用的行为路径、收集运行产生的日志，优点是解决了静态方法遇到的代码混淆和加密等方面的问题，缺点是动态测试代码覆盖率低，并且有些恶意程序可以防止自身在模拟器下运行，在模拟器下运行时会崩溃或改变自身行为表现。在实现中，针对海量恶意样本的检测，为了得到更快的检测速度及更高的代码覆盖率，多数方法更倾向于使用静态检测。At present, typical Android malware detection techniques include two types: static detection and dynamic detection. Static detection methods mainly use disassembly, decompilation technology or control flow and data flow analysis technology on smali intermediate code to detect malicious code. The advantage is high code coverage, and the disadvantage is that it cannot detect code obfuscation, encryption, and dynamic loading of malicious code. The dynamic analysis method is to monitor various variables when the application is running, track the behavior path of the application, and collect the logs generated by the operation during the system operation. The advantage is that it solves the problems of code confusion and encryption encountered by the static method. The disadvantage is Dynamic test code coverage is low, and some malicious programs can prevent themselves from running under the emulator, crash or change their behavior when running under the emulator. In implementation, for the detection of massive malicious samples, in order to obtain faster detection speed and higher code coverage, most methods prefer to use static detection.

M.Ganesh等人提取安卓软件Manifest清单中列举的权限作为特征来检测恶意应用。他们将权限排列成12×12的阵列，输入到卷积神经网络模型进行训练，可以检测出软件是否是恶意的；M.Amin等人从字节码文件中提取操作码序列作为特征来检测安卓恶意软件。他们提取软件中的操作码组成一个长序列，将其视为有序文本进行处理，通过训练BiLSTM神经网络模型来分析软件的恶意性；R.Nix等人提取安卓API(ApplicationProgramming Interface，应用程序接口)调用序列研究恶意软件的检测方法，他们使用一个位向量对每个API调用进行编码，然后拆分组合成为大小为n×m的矩阵，用作卷积神经网络模型的输入，最终使用训练出的分类器判定软件的恶意性。M.Ganesh et al. extracted the permissions listed in the Android Manifest list as features to detect malicious applications. They arrange the permissions into a 12×12 array and input them into a convolutional neural network model for training, which can detect whether the software is malicious; M.Amin et al. extracted opcode sequences from bytecode files as features to detect Android malicious software. They extracted the opcodes in the software to form a long sequence, treated it as ordered text, and analyzed the maliciousness of the software by training the BiLSTM neural network model; R. Nix et al. extracted the Android API (Application Programming Interface, application programming interface) ) call sequences to study malware detection methods, they use a bit vector to encode each API call, and then split and combine into a matrix of size n × m, which is used as the input of the convolutional neural network model, and finally uses the trained The classifier determines the maliciousness of software.

上述检测方法在安卓恶意软件检测中取得了一定的成果，但也存在着一些问题，主要有以下两个方面：一是特征提取时考虑软件多种特征的关联分析不足。现有方法多是单方面提取某一种类型的特征刻画安卓软件行为，没有采取多种类型的特征协同进行软件分析，抽象出的特征表示类型单一，导致检测结果准确度不高。二是训练的神经网络模型较为复杂，涉及大量参数调整优化，效率低下，得到训练良好的模型需要消耗大量的时间。The above detection methods have achieved certain results in Android malware detection, but there are also some problems, mainly in the following two aspects: First, the correlation analysis considering multiple software features in feature extraction is insufficient. Most of the existing methods unilaterally extract a certain type of features to describe the behavior of Android software, and do not use multiple types of features to coordinate software analysis, and the abstracted features represent a single type, resulting in low accuracy of detection results. Second, the trained neural network model is relatively complex, involving a large number of parameter adjustment and optimization, which is inefficient, and it takes a lot of time to obtain a well-trained model.

因此，面对大量出现的安卓恶意软件，如何精确、高效的检测是一个非常值得关注的问题。Therefore, in the face of a large number of Android malware, how to detect it accurately and efficiently is a very important issue.

发明内容SUMMARY OF THE INVENTION

本发明要解决的技术问题是针对安卓恶意软件，生成能够唯一标识该软件的频率指纹，并基于该指纹训练和优化多核支持向量机模型，对安卓恶意软件进行准确检测，同时有效提高检测速度。The technical problem to be solved by the present invention is to generate a frequency fingerprint that can uniquely identify the software for Android malware, and train and optimize a multi-core support vector machine model based on the fingerprint to accurately detect the Android malware and effectively improve the detection speed.

本发明的技术方案是：构建由样本预处理模块、频率指纹产生模块、检测模块组成的基于频率指纹提取的安卓恶意软件检测系统，采集安卓恶意及良性软件作为样本，构建基准测试集。对集内的样本解压缩，得到AndroidManifest.xml、classes.dex和so库文件，提取权限、API、smali操作码和arm操作码特征，统计这四类特征是否出现以及出现频率，形成四种不同类型的特征向量并将其首尾相接，形成长向量，作为安卓软件的频率指纹。通过采集基准测试集内众多样本的频率指纹，训练优化检测模块(是一个多核支持向量机模型)成为分类器，对待检样本进行检测，输出待检样本是否是恶意软件的结果。The technical scheme of the present invention is to construct an Android malware detection system based on frequency fingerprint extraction, which is composed of a sample preprocessing module, a frequency fingerprint generation module, and a detection module, collect Android malware and benign software as samples, and construct a benchmark test set. Decompress the samples in the set, get AndroidManifest.xml, classes.dex and so library files, extract permissions, API, smali opcode and arm opcode features, and count whether these four types of features appear and their frequency to form four different Type feature vectors and connect them end to end to form a long vector as the frequency fingerprint of the Android software. By collecting the frequency fingerprints of many samples in the benchmark test set, the optimized detection module (which is a multi-core support vector machine model) is trained to become a classifier, which detects the samples to be tested and outputs whether the samples to be tested are the result of malware.

本发明包括以下步骤：The present invention includes the following steps:

第一步，构建基于频率指纹提取的安卓恶意软件检测系统。该系统安装在谷歌官方或者第三方安卓应用软件市场服务器中，由样本预处理模块、频率指纹产生模块、检测模块组成。The first step is to build an Android malware detection system based on frequency fingerprint extraction. The system is installed in Google's official or third-party Android application software market server, and consists of a sample preprocessing module, a frequency fingerprint generation module, and a detection module.

样本预处理模块与频率指纹产生模块相连，样本预处理模块接收来自开发人员构建的基准测试集的样本和普通用户提交的待检测样本，对样本进行预处理，产生AndroidManifest.xml、smali文件和arm指令文件三种类型的文件，输出至频率指纹产生模块。The sample preprocessing module is connected to the frequency fingerprint generation module. The sample preprocessing module receives samples from the benchmark test set constructed by developers and samples to be tested submitted by ordinary users, preprocesses the samples, and generates AndroidManifest.xml, smali files and arm Three types of command files are output to the frequency fingerprint generation module.

频率指纹产生模块与样本预处理模块、检测模块相连，频率指纹产生模块从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件，进行特征筛选和频率指纹(能够作为安卓软件身份标识的一种向量)计算，产生频率指纹，输出至检测模块；频率指纹产生模块由特征筛选模块和频率指纹计算模块组成。特征筛选模块与样本预处理模块、频率指纹计算模块相连，特征筛选模块从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件，对这三种文件进行特征筛选，得到权限、API、smali操作码和arm操作码特征，将权限、API、smali操作码和arm操作码特征发送给频率指纹计算模块。频率指纹计算模块与样本预处理模块、特征筛选模块、检测模块相连，频率指纹计算模块从特征筛选模块接收权限、API、smali操作码和arm操作码特征，从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件，计算产生频率指纹，将频率指纹发送给检测模块。The frequency fingerprint generation module is connected with the sample preprocessing module and the detection module, and the frequency fingerprint generation module receives AndroidManifest. The frequency fingerprint generation module is composed of a feature screening module and a frequency fingerprint calculation module. The feature screening module is connected with the sample preprocessing module and the frequency fingerprint calculation module. The feature screening module receives AndroidManifest.xml, smali file and arm command file from the sample preprocessing module, and performs feature screening on these three files to obtain permissions, API, smali Opcode and arm opcode feature, send permission, API, smali opcode and arm opcode feature to the frequency fingerprint calculation module. The frequency fingerprint calculation module is connected with the sample preprocessing module, the feature screening module and the detection module. The frequency fingerprint calculation module receives permissions, API, smali opcode and arm opcode features from the feature screening module, and receives AndroidManifest.xml, smali file and arm command file, calculate the frequency fingerprint, and send the frequency fingerprint to the detection module.

检测模块与频率指纹产生模块相连，检测模块是一个多核支持向量机模型，它从频率指纹产生模块接收基准测试集D的频率指纹和待检测软件的频率指纹，利用基准测试集D的频率指纹进行训练优化，成为适合对待检测软件进行检测的分类器，然后根据待检测软件的频率指纹对待检测软件进行检测分类，得出待检测软件是否是恶意软件的判定结果。The detection module is connected with the frequency fingerprint generation module. The detection module is a multi-core support vector machine model. It receives the frequency fingerprint of the benchmark test set D and the frequency fingerprint of the software to be detected from the frequency fingerprint generation module, and uses the frequency fingerprint of the benchmark test set D. The training is optimized to become a classifier suitable for detecting the software to be detected, and then the software to be detected is detected and classified according to the frequency fingerprint of the software to be detected, and the judgment result of whether the software to be detected is malware is obtained.

第二步，构建基准测试集D，方法是：The second step is to construct a benchmark test set D, the method is:

2.1步，从开源的Drebin、Genome和AMD数据集中获得N₁个安卓恶意软件作为恶意样本，N₁为正整数且N₁>1000。Step 2.1, obtain N ₁ Android malware from the open source Drebin, Genome and AMD datasets as malicious samples, where N ₁ is a positive integer and N ₁ >1000.

2.2步，通过爬取GooglePlay和Apkpure应用商店获得良性软件，并使用本地杀毒软件及VirusTotal在线杀毒网站进行检测过滤，形成N₂个良性样本，N₂为正整数且N₂>1000。Step 2.2, obtain benign software by crawling GooglePlay and Apkpure application stores, and use local antivirus software and VirusTotal online antivirus website to detect and filter to form N ₂ benign samples, N ₂ is a positive integer and N ₂ >1000.

2.3步，给恶意样本及良性样本添加标签，组成基准测试集D，N为D内样本总数，N＝N₁+N₂。定义x⁽ⁱ⁾为D中第i个样本，y⁽ⁱ⁾为x⁽ⁱ⁾的标签，y⁽ⁱ⁾等于1表示x⁽ⁱ⁾为恶意样本，y⁽ⁱ⁾等于-1表示x⁽ⁱ⁾为良性样本，1≤i≤N。Step 2.3, add labels to malicious samples and benign samples to form a benchmark test set D, where N is the total number of samples in D, N=N ₁ +N ₂ . Define x ⁽ⁱ⁾ as the ith sample in D, y ⁽ⁱ⁾ as the label of x ⁽ⁱ⁾ , y ⁽ⁱ⁾ equal to 1 means x ⁽ⁱ⁾ is a malicious sample, y ⁽ⁱ⁾ equal to -1 means x ^{( i)} is a benign sample, 1≤i≤N.

2.4将D存储在预处理模块、频率指纹产生模块均可以读取的存储器上。2.4 Store D in a memory that can be read by both the preprocessing module and the frequency fingerprint generation module.

第三步，样本预处理模块对D内N个样本进行预处理，得到N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件。In the third step, the sample preprocessing module preprocesses the N samples in D to obtain N AndroidManifest.xml files, N smali files and N arm instruction files.

3.1步，令变量i＝1；Step 3.1, let the variable i=1;

3.2步，从D中取第i个样本x⁽ⁱ⁾。Step 3.2, take the ith sample x ⁽ⁱ⁾ from D.

3.3步，采用样本预处理方法对x⁽ⁱ⁾进行预处理，得到x⁽ⁱ⁾的AndroidManifest.xml文件、smali文件和arm指令文件，方法是：Step 3.3, use the sample preprocessing method to preprocess x ⁽ⁱ⁾ to obtain the AndroidManifest.xml file, smali file and arm command file of x ⁽ⁱ⁾ , the method is:

3.3.1步，使用解压缩工具(例如Gzip和7zip)，对x⁽ⁱ⁾进行解压缩，提取x⁽ⁱ⁾中的AndroidManifest.xml、classes.dex以及so运行库文件。Step 3.3.1, use decompression tools (such as Gzip and 7zip) to decompress x ⁽ⁱ⁾ ^, and extract the AndroidManifest.xml, classes.dex and so runtime library files in x (i).

3.3.2步，使用AndroidManifest.xml文件专用反编译工具AXMLPrinter2(下载地址：https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/android4me/AXMLPrinter2.jar，版本2.0或以上版本)，将AndroidManifest.xml文件由二进制形式反编译为文本形式。Step 3.3.2, use the AndroidManifest.xml file dedicated decompilation tool AXMLPrinter2 (download address: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/android4me/AXMLPrinter2.jar , version 2.0 or above), decompile the AndroidManifest.xml file from binary form to text form.

3.3.3步，使用dex文件格式反编译工具baksmali(https://bitbucket.org/JesusFreke/smali/downloads/baksmali-2.4.0.jar，版本2.4.0或以上版本)将classes.dex反编译为smali文件，若产生多个smali文件，将多个smali文件合并成为一个smali文件，转3.3.4步；若只产生1个smali文件，直接转3.3.4步。Step 3.3.3, use the dex file format decompilation tool baksmali (https://bitbucket.org/JesusFreke/smali/downloads/baksmali-2.4.0.jar, version 2.4.0 or above) to decompile classes.dex It is a smali file. If multiple smali files are generated, merge the multiple smali files into one smali file and go to step 3.3.4; if only one smali file is generated, go to step 3.3.4 directly.

3.3.4步，使用arm指令反汇编工具gcc-arm-none-eabi(https://developer.arm.com/-/media/Files/downloads/gnu-rm/9-2019q4/gcc-arm-none-eabi-9-2019-q4-major-x86_64-linux.tar.bz2，版本9-2019-q4-major或以上版本)将so运行库文件反编译为文本形式的arm指令文件，若产生多个arm指令文件，将多个arm指令文件合并成为一个arm指令文件，转3.4步；如若没有产生arm指令文件，则新建一个空的arm指令文件，转3.4步。Step 3.3.4, use the arm instruction disassembly tool gcc-arm-none-eabi (https://developer.arm.com/-/media/Files/downloads/gnu-rm/9-2019q4/gcc-arm-none -eabi-9-2019-q4-major-x86_64-linux.tar.bz2, version 9-2019-q4-major or above) decompile the so runtime library file into a text-based arm command file, if multiple arm command file, combine multiple arm command files into one arm command file, go to step 3.4; if no arm command file is generated, create an empty arm command file and go to step 3.4.

3.4步，令i＝i+1，若i≤N，转3.2步；若i>N，此时N个样本产生了对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件，将D的N个样本对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件发送给特征筛选模块，转第四步。Step 3.4, let i=i+1, if i≤N, go to step 3.2; if i>N, then N samples generate corresponding N AndroidManifest.xml files, N smali files and N arm command files , send the N AndroidManifest.xml files, N smali files, and N arm instruction files corresponding to the N samples of D to the feature screening module, and go to the fourth step.

第四步，特征筛选模块对从样本预处理模块收到的D的N个样本对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件进行特征筛选，得到适合对D进行分类的权限特征、API特征、smali操作码特征和arm操作码特征。In the fourth step, the feature screening module performs feature screening on the N AndroidManifest.xml files, N smali files and N arm command files corresponding to the N samples of D received from the sample preprocessing module, and obtains a feature suitable for classifying D. Permission features, API features, smali opcode features, and arm opcode features.

4.1步，选择安卓开发者文档(https://developer.android.com/reference/android/Manifest.permission)中定义的167种android.manifest.permission系统权限，将这167种权限作为特征，称为权限特征。Step 4.1, select the 167 android.manifest.permission system permissions defined in the Android developer documentation (https://developer.android.com/reference/android/Manifest.permission), and use these 167 permissions as features, called Permission features.

4.2步，从pscout列表(https://security.csl.toronto.edu/pscout/？mdocs-file＝67)的API中，选择出256个API，方法是：Step 4.2, select 256 APIs from the APIs in the pscout list (https://security.csl.toronto.edu/pscout/?mdocs-file=67) by:

4.2.1步，建立一个列表L_api，选择pscout列表中全部的32437个API加入L_api，第v个API记为L_api[v]，1≤v≤32437。Step 4.2.1, create a list L _api , select all 32437 APIs in the pscout list to join L _api , the vth API is recorded as L _api [v], 1≤v≤32437.

4.2.2步，建立一个32437行N列的二维数组Z_api，第v行第i列元素Z_api[v][i]的值限定为1或0，1代表L_api的第v个API在D中的第i个样本中出现，0代表未出现。Step 4.2.2, build a two-dimensional array Z _api with 32437 rows and N columns, the value of the element Z _api [v][i] in the vth row and the i column is limited to 1 or 0, 1 represents the vth API of L _api appears in the ith sample in D, and 0 means not appearing.

4.2.3步，初始化Z_api内所有元素为0，初始化变量i＝1。Step 4.2.3, initialize all elements in Z _api to 0, and initialize variable i=1.

4.2.4步，按行扫描D的第i个样本的smali文件，得到第i个样本中出现的属于L_api的API，对Z_api的第i列元素进行赋值。记smali文件的第u行字符串为str[u]，记smali文件的总行数为U，1≤u≤U，方法是：Step 4.2.4, scan the smali file of the ith sample of D by row, obtain the API belonging to the L _api appearing in the ith sample, and assign values to the elements of the ith column of Z _api . Note that the u-th string of the smali file is str[u], and the total number of lines of the smali file is U, 1≤u≤U, the method is:

4.2.4.1步，初始化u＝1。Step 4.2.4.1, initialize u=1.

4.2.4.2步，若str[u]是一个API字符串，转4.2.4.2.1；若str[u]不是一个API字符串，转4.2.4.3。Step 4.2.4.2, if str[u] is an API string, go to 4.2.4.2.1; if str[u] is not an API string, go to 4.2.4.3.

4.2.4.2.1步，初始化变量v＝1。Step 4.2.4.2.1, initialize the variable v=1.

4.2.4.2.2步，若str[u]含有内容为L_api[v]的子字符串，赋值Z_api[v][i]＝1，转4.2.4.3；否则，转4.2.4.2.3步。Step 4.2.4.2.2, if str[u] contains a substring whose content is L _api [v], assign Z _api [v][i]=1, go to 4.2.4.3; otherwise, go to 4.2.4.2.3 step.

4.2.4.2.3步，令v＝v+1。若v≤32437，转4.2.4.2.2步；若v＞32437，转4.2.4.3步。Step 4.2.4.2.3, let v=v+1. If v≤32437, go to step 4.2.4.2.2; if v>32437, go to step 4.2.4.3.

4.2.4.3步，令u＝u+1。若u≤U，转4.2.4.2步；若u＞U，说明第i个样本的smali文件扫描完毕，转4.2.5步。Step 4.2.4.3, let u=u+1. If u≤U, go to step 4.2.4.2; if u>U, it means that the smali file of the ith sample is scanned, go to step 4.2.5.

4.2.5步，令i＝i+1。若i≤N，转4.2.4步；若i＞N，完成了对二维数组Z_api的赋值，转4.2.6。Step 4.2.5, let i=i+1. If i≤N, go to step 4.2.4; if i>N, complete the assignment to the two-dimensional array Z _api , go to step 4.2.6.

4.2.6步，计算列表L_api中每个API对基准测试集D的信息增益IG。第v个API对D的信息增益用IG(D|L_api[v])表示。Step 4.2.6, calculate the information gain IG of each API in the list L _api to the benchmark test set D. The information gain of the vth API to D is denoted by IG(D|L _api [v]).

4.2.6.1步，令v＝1。Step 4.2.6.1, let v=1.

4.2.6.2步，令i＝1。令第一变量M₁₁＝0，令第二变量M₁₂＝0，令第三变量M₂₁＝0，令第四变量M₂₂＝0。Step 4.2.6.2, let i=1. Let the first variable M ₁₁ =0, the second variable M ₁₂ =0, the third variable M ₂₁ =0, and the fourth variable M ₂₂ =0.

4.2.6.3步，若Z_api[v][i]等于1并且y⁽ⁱ⁾等于1，令M₁₁＝M₁₁+1；若Z_api[v][i]等于1并且y⁽ⁱ⁾等于0，令M₁₂＝M₁₂+1；若Z_api[v][i]等于0并且y⁽ⁱ⁾等于1，令M₂₁＝M₂₁+1；若Z_api[v][i]等于0并且y⁽ⁱ⁾等于0，令M₂₂＝M₂₂+1。Step 4.2.6.3, if _Zapi [v][i] is equal to 1 and y ⁽ⁱ⁾ is equal to 1, let M11= _M11 + ₁ ; if _Zapi [v][i] is equal to 1 and y ⁽ⁱ⁾ is equal to 0, let M ₁₂ =M ₁₂ +1; if Z _api [v][i] is equal to 0 and y ⁽ⁱ⁾ is equal to 1, let M ₂₁ =M ₂₁ +1; if Z _api [v][i] is equal to 0 and y ⁽ⁱ⁾ is equal to 0, let M ₂₂ =M ₂₂ +1.

4.2.6.4步，令i＝i+1。若i≤N，转4.2.6.3步；若i＞N，转4.2.6.5步。Step 4.2.6.4, let i=i+1. If i≤N, go to step 4.2.6.3; if i>N, go to step 4.2.6.5.

4.2.6.5步计算IG(D|L_api[v])，方法为：Step 4.2.6.5 Calculate IG(D|L _api [v]), the method is:

IG(D|L_api[v])＝H(D)-H(D|L_api[v]) (1)IG(D|L _api [v])=H(D)-H(D|L _api [v]) (1)

其中H(D)为基准测试集D的经验熵，H(D)计算方法为：where H(D) is the empirical entropy of the benchmark test set D, and the calculation method of H(D) is:

H(D|L_api[v])为列表L_api的第v个API对D的经验条件熵，H(D|L_api[v])为：H(D|L _api [v]) is the empirical conditional entropy of the vth API of the list L _api for D, and H(D|L _api [v]) is:

4.2.6.6步，令v＝v+1。若v≤32437，转4.2.6.2；若v＞32437，说明列表L_api内全部API对D的信息增益计算完毕，按照IG(D|L_api[v])值从大到小将L_api内API排序，取排序后的前256个API，作为API特征，转4.3步。Step 4.2.6.6, let v=v+1. If v≤32437, go to 4.2.6.2; if v>32437, it means that the information gain of all APIs in the list L _api to D has been calculated, and the APIs in L _api are sorted according to the value of IG(D|L _api [v]) from large to small. Sort, take the first 256 APIs after sorting, as API features, go to step 4.3.

4.3步，安卓Dalvik虚拟机预定义了长度为8个二进制位的smali操作码(https：//developer.android.com/reference/dalvik/bytecode/Opcodes.html)，包括预留未定义的类型，最多有256种，将这256种smali操作码作为特征，称为smali操作码特征。In step 4.3, the Android Dalvik virtual machine predefines the smali opcode with a length of 8 binary bits (https://developer.android.com/reference/dalvik/bytecode/Opcodes.html), including reserved undefined types, There are at most 256 kinds, and these 256 kinds of smali opcodes are used as features, which are called smali opcode features.

4.4步，根据arm指令快速参考手册(http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001mc/QRC0001_UAL.pdf)，特征筛选模块选择该手册列举的共计197种arm指令操作码作为特征，称为arm操作码特征。Step 4.4, according to the arm command quick reference manual (http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001mc/QRC0001_UAL.pdf), the feature screening module selects a total of 197 arm commands listed in the manual The opcode as a feature is called the arm opcode feature.

4.5步，将权限特征、API特征、smali操作码特征和arm操作码特征发送给频率指纹计算模块。In step 4.5, the permission feature, API feature, smali opcode feature and arm opcode feature are sent to the frequency fingerprint calculation module.

第五步，确定频率指纹格式。The fifth step is to determine the frequency fingerprint format.

将167种权限特征、256种API特征、256种smali操作码特征和197种arm操作码特征分别按字母顺序排列形成向量，分别称之为安卓软件的权限向量、API向量、smali操作码向量和arm操作码向量。The 167 permission features, 256 API features, 256 smali opcode features, and 197 arm opcode features are arranged in alphabetical order to form vectors, which are called Android software permission vectors, API vectors, smali opcode vectors and arm opcode vector.

一个安卓软件的权限向量由167个整数构成，每个整数取值为1或0。若第pa位置上的整数取值为1，说明筛选的167种权限中的第pa种在该安卓软件中被申请；若第pa位置上的整数取值为0，说明筛选的167种权限中的第pa种在该安卓软件中未被申请。pa为整数，1≤pa≤167。The permission vector of an Android software consists of 167 integers, each of which takes the value 1 or 0. If the integer value in the pa-th position is 1, it means that the pa-th type of the 167 kinds of permissions screened is applied for in the Android software; if the integer value of the pa-th position is 0, it means that the 167 kinds of permissions screened are in the pa-th kind. The type pa is not applied for in this Android software. pa is an integer, 1≤pa≤167.

一个安卓软件的API向量由256个小数组成，第pb位置上的小数说明筛选的256种API的第pb种在该安卓软件中出现的频率。pb为整数，1≤pb≤256。The API vector of an Android software consists of 256 decimals, and the decimal in the pbth position indicates the frequency of the pbth type of the 256 kinds of APIs screened in the Android software. pb is an integer, 1≤pb≤256.

一个安卓软件的smali操作码向量由256个小数组成，第pc位置上的小数说明筛选的256种smali操作码的第pc种在该安卓软件中出现的频率。pc为整数，1≤pc≤256。The smali opcode vector of an Android software consists of 256 decimals, and the decimal at the pcth position indicates the frequency of the pcth of the 256 smali opcodes screened in the Android software. pc is an integer, 1≤pc≤256.

一个安卓软件的arm操作码向量由197个小数组成，第pd位置上的小数说明筛选的197种arm操作码的第pd种在该安卓软件中出现的频率。pd为整数，1≤pd≤197。The arm opcode vector of an Android software consists of 197 decimals, and the decimal at the pd-th position indicates the frequency of the pd-th type of the selected 197 arm opcodes appearing in the Android software. pd is an integer, 1≤pd≤197.

四种向量首尾相接，形成一个长度为876的向量，作为样本的身份标识，称为频率指纹。频率指纹中含有的167个整数和709个小数，均称作频率指纹的元素。The four vectors are connected end to end to form a vector with a length of 876, which is used as the identity of the sample, which is called the frequency fingerprint. The 167 integers and 709 decimals contained in the frequency fingerprint are called the elements of the frequency fingerprint.

第六步，频率指纹计算模块从特征筛选模块接收权限特征、API特征、smali操作码特征和arm操作码特征，从样本预处理模块接收AndroidManifest.xml文件、smali文件和arm指令文件，对基准测试集D中N个样本计算产生频率指纹。In the sixth step, the frequency fingerprint calculation module receives permission features, API features, smali opcode features and arm opcode features from the feature screening module, and receives the AndroidManifest.xml file, smali file and arm command file from the sample preprocessing module, and performs benchmark testing The N samples in the set D are calculated to generate frequency fingerprints.

6.1步，令L_a为权限列表，列表成员L_a[pa]为167种权限中按字母顺序排列的第pa种权限的名称字符串；令L_b为API列表，列表成员L_b[pb]为256种API中按字母顺序排列的第pb种API的名称字符串；令L_c为smali操作码列表，列表成员L_c[pc]为256种smali操作码中按字母顺序排列的第pc种smali操作码的名称字符串；令L_d为arm操作码列表，列表成员L_d[pd]为197种arm操作码中按字母顺序排列的第pd种arm操作码的名称字符串。令变量i＝1。Step 6.1, let L _a be the permission list, and the list member L _a [pa] is the name string of the pa-th permission in alphabetical order among the 167 permissions; let L _b be the API list, and the list member L _b [pb] is the name string of the pbth API in alphabetical order among the 256 APIs; let L _c be the list of smali opcodes, and the list member L _c [pc] is the alphabetical pcth of the 256 smali opcodes The name string of the smali opcode; let Ld be the list of arm opcodes, and the list member _Ld [pd] be the name string of the _pdth arm opcode in alphabetical order among the 197 arm opcodes. Let variable i=1.

6.2步，取D中第i个样本x⁽ⁱ⁾，为x⁽ⁱ⁾生成频率指纹

含876个元素，初始化每个元素为0。将

中的权限向量记为

中的第pa个元素记为

API向量记为

中的第pb个元素记为

smali操作码向量记为

中的第pc个元素记为

arm操作码向量记为

中的第pd个元素记为

Step 6.2, take the ith sample x ⁽ⁱ⁾ in D, and generate a frequency fingerprint for x ⁽ⁱ⁾

Contains 876 elements, initializing each element to 0. Will

The permission vector in is denoted as

The pa-th element in is denoted as

API vector is denoted as

The pbth element in is denoted as

The smali opcode vector is denoted as

The pc-th element in is denoted as

The arm opcode vector is denoted as

The pd-th element in is denoted as

6.3步，采用权限提取方法提取x⁽ⁱ⁾申请的权限，得到x⁽ⁱ⁾的权限向量

方法是：Step 6.3, using the permission extraction method to extract the permission applied by x (i ⁾ , and obtain the permission vector of x ⁽ⁱ⁾

the way is:

6.3.1步，按行扫描x⁽ⁱ⁾对应的AndroidManifest.xml文件，记AndroidManifest.xml文件的第qa行字符串为stra[qa]，记AndroidManifest.xml文件总行数为numa行。Step 6.3.1, scan the AndroidManifest.xml file corresponding to x ⁽ⁱ⁾ by line, record the string of line qa in the AndroidManifest.xml file as stra[qa], and record the total number of lines in the AndroidManifest.xml file as numa lines.

6.3.2步，令qa＝1。Step 6.3.2, let qa=1.

6.3.3步，若stra[qa]含有内容为“uses-permission”的子字符串，令pa＝1，转6.3.4步；若stra[qa]不含有内容为“uses-permission”的字符串，转6.3.6步。Step 6.3.3, if stra[qa] contains a substring whose content is "uses-permission", set pa=1, go to step 6.3.4; if stra[qa] does not contain a character whose content is "uses-permission" string, go to step 6.3.6.

6.3.4步，若stra[qa]含有内容为L_a[pa]的子字符串，表明x⁽ⁱ⁾申请了L_a[pa]权限，令

转6.3.6步；若stra[qa]不含有内容为L_a[pa]的子字符串，转6.3.5步。In step 6.3.4, if stra[qa] contains _a substring whose content is La [pa], it means that x ^{(i) has} applied for _La [pa] permission, and let

Go to step 6.3.6; if stra[qa] does not contain _a substring whose content is La [pa], go to step 6.3.5.

6.3.5步，令pa＝pa+1。若pa≤167，转6.3.4步；若pa＞167，说明完成了一遍对L_a的检查，转6.3.6步。Step 6.3.5, let pa=pa+1. If _pa≤167 , go to step 6.3.4; if pa>167, it means that the check of La has been completed, go to step 6.3.6.

6.3.6步，令qa＝qa+1。若qa≤numa，转6.3.3步；若qa＞numa，说明x⁽ⁱ⁾对应的AndroidManifest.xml文件扫描完毕，

计算完成，转6.4步。Step 6.3.6, let qa=qa+1. If qa≤numa, go to step 6.3.3; if qa>numa, it means that the AndroidManifest.xml file corresponding to x ⁽ⁱ⁾ has been scanned,

After the calculation is completed, go to step 6.4.

6.4步，采用API统计方法统计x⁽ⁱ⁾使用的API，得到x⁽ⁱ⁾的API向量

方法是：Step 6.4, use the API statistics method to count the APIs used by x (i ⁾ , and obtain the API vector of x ⁽ⁱ⁾ .

the way is:

6.4.1步，按行扫描x⁽ⁱ⁾对应的smali文件，记smali文件的第qb行字符串为strb[qb]，记smali文件总行数为numb行。Step 6.4.1, scan the smali file corresponding to x ⁽ⁱ⁾ by line, record the qb line string of the smali file as strb[qb], and record the total number of lines in the smali file as numb lines.

6.4.2步，令qb＝1，使用变量inv表示smali文件中API的总数量，令inv＝1。Step 6.4.2, let qb=1, use the variable inv to represent the total number of APIs in the smali file, let inv=1.

6.4.3步，令变量pb＝1。Step 6.4.3, let the variable pb=1.

6.4.4步，若strb[qb]含有内容为“invoke”的子字符串，令inv＝inv+1，转6.4.5步；若不含有“invoke”子字符串，转6.4.7步。Step 6.4.4, if strb[qb] contains a substring whose content is "invoke", let inv=inv+1, go to step 6.4.5; if it does not contain a substring of "invoke", go to step 6.4.7.

6.4.5步，若strb[qb]含有内容为L_b[pb]的子字符串，说明x⁽ⁱ⁾调用了名字为L_b[pb]的API，令

转6.4.7步；若strb[qb]不含有内容为L_b[pb]的子字符串，转6.4.6步。Step 6.4.5, if strb[qb] contains a substring whose content is L _b [pb], it means that x ⁽ⁱ⁾ calls the API named L _b [pb], let

Go to step 6.4.7; if strb[qb] does not contain a substring whose content is L _b [pb], go to step 6.4.6.

6.4.6步，令pb＝pb+1。若pb≤256，转6.4.5步；若pb＞256，说明完成了一遍对L_b的检查，转6.4.7步。Step 6.4.6, let pb=pb+1. If pb≤256, go to step 6.4.5; if pb>256, it means that the check of L _b is completed, go to step 6.4.7.

6.4.7步，令qb＝qb+1。若qb≤numb，转6.4.3步；若qb＞numb，说明x⁽ⁱ⁾对应的smali文件扫描完毕，转6.4.8步。Step 6.4.7, let qb=qb+1. If qb≤numb, go to step 6.4.3; if qb>numb, it means that the smali file corresponding to x ⁽ⁱ⁾ is scanned, go to step 6.4.8.

6.4.8步，令pb＝1。Step 6.4.8, let pb=1.

6.4.9步，令

Step 6.4.9, let

6.4.10步，令pb＝pb+1。若pb≤256，转6.4.9步；若pb＞256，说明

计算完成，转6.5步。Step 6.4.10, let pb=pb+1. If pb≤256, go to step 6.4.9; if pb>256, explain

After the calculation is completed, go to step 6.5.

6.5步，采用smali操作码统计方法统计x⁽ⁱ⁾使用的smali操作码，得到x⁽ⁱ⁾的smali操作码向量

方法是：Step 6.5, use the smali opcode statistical method to count the smali opcodes used by x ⁽ⁱ⁾ ^, and obtain the smali opcode vector of x (i).

the way is:

6.5.1步，按行扫描x⁽ⁱ⁾对应的smali文件，记smali文件的第qc行字符串为strc[qc]，记smali文件总行数为hume行。Step 6.5.1, scan the smali file corresponding to x ⁽ⁱ⁾ by line, record the qc line string of the smali file as strc[qc], and record the total number of lines in the smali file as the hume line.

6.5.2步，令qc＝1，使用变量ops表示smali文件中smali操作码的总数量，令ops＝1。Step 6.5.2, let qc=1, use the variable ops to represent the total number of smali opcodes in the smali file, let ops=1.

6.5.3步，令pc＝1。Step 6.5.3, let pc=1.

6.5.4步，若strc[qc]含有内容为L_c[pc]的子字符串，令

ops＝ops+1，转6.5.6步；若strc[qc]不含有内容为L_c[pc]的子字符串，转6.5.5步。Step 6.5.4, if _strc [qc] contains a substring whose content is Lc[pc], let

ops=ops+1, go to step 6.5.6; if strc[qc] does not contain a substring whose content is L _c [pc], go to step 6.5.5.

6.5.5步，令pc＝pc+1。若pc≤256，转6.5.4步；若pc＞256，说明完成了一遍对L_c的检查，转6.5.6步。Step 6.5.5, let pc=pc+1. If pc≤256, go to step 6.5.4; if pc>256, it means that the check of L _c has been completed, go to step 6.5.6.

6.5.6步，令qc＝qc+1。若qc≤numc，转6.5.3步；若qc＞numc，说明x⁽ⁱ⁾对应的smali文件扫描完毕，转6.5.7步。Step 6.5.6, let qc=qc+1. If qc≤numc, go to step 6.5.3; if qc>numc, it means that the smali file corresponding to x ⁽ⁱ⁾ is scanned, go to step 6.5.7.

6.5.7步，令pc＝1。Step 6.5.7, let pc=1.

6.5.8步，令

Step 6.5.8, let

6.5.9步，令pc＝pc+1。若pc≤256，转6.5.8步；若pc＞256，说明

计算完成，转6.6步。Step 6.5.9, let pc=pc+1. If pc≤256, go to step 6.5.8; if pc>256, explain

After the calculation is completed, go to step 6.6.

6.6步，采用arm操作码统计方法统计x⁽ⁱ⁾使用的arm操作码，得到x⁽ⁱ⁾的arm操作码向量

方法是：Step 6.6, use the arm opcode statistical method to count the arm opcodes used by x ⁽ⁱ⁾ ^to obtain the arm opcode vector of x (i).

the way is:

6.6.1步，按行扫描x⁽ⁱ⁾对应的arm文件，记arm文件的第qd行字符串为strd[qd]，arm文件总行数为numd行。Step 6.6.1, scan the arm file corresponding to x ⁽ⁱ⁾ line by line, record the qd line string of the arm file as strd[qd], and the total number of lines in the arm file as numd lines.

6.6.2步，令qd＝1，使用变量opa表示arm文件中使用的arm操作码总数量，令opa＝1。若qd≤numd，转6.6.3步；若qd＞numd，说明x⁽ⁱ⁾对应的arm文件是空文件，转6.7步。Step 6.6.2, let qd=1, use the variable opa to represent the total number of arm opcodes used in the arm file, let opa=1. If qd≤numd, go to step 6.6.3; if qd>numd, it means that the arm file corresponding to x ⁽ⁱ⁾ is an empty file, go to step 6.7.

6.6.3步，令pd＝1。Step 6.6.3, let pd=1.

6.6.4步，若strd[qd]含有“＞”字符，说明strd[qd]包含一条arm指令，令opa＝opa+1，转6.6.5；若strd[qd]不含有“＞”字符，转6.6.7。Step 6.6.4, if strd[qd] contains the ">" character, it means that strd[qd] contains an arm instruction, let opa=opa+1, go to 6.6.5; if strd[qd] does not contain the ">" character, Go to 6.6.7.

6.6.5步，若strd[qd]含有内容为L_d[pd]的子字符串，令

转6.6.7步；若strd[qd]不含有内容为L_d[pd]的子字符串，转6.6.6步。Step 6.6.5, if strd[qd] contains a substring whose content is L _d [pd], let

Go to step 6.6.7; if strd[qd] does not contain a substring whose content is L _d [pd], go to step 6.6.6.

6.6.6步，令pd＝pd+1。若pd≤197，转6.6.5步；若pd＞197，说明完成了一遍对L_d的检查，转6.6.7步。Step 6.6.6, let pd=pd+1. If pd≤197, go to step 6.6.5; if pd>197, it means that the inspection of L _d is completed, go to step 6.6.7.

6.6.7步，令qd＝qd+1。若qd≤numd，转6.6.3步；若qd＞numd，说明x⁽ⁱ⁾对应的arm文件扫描完毕，转6.6.8步。Step 6.6.7, let qd=qd+1. If qd≤numd, go to step 6.6.3; if qd>numd, it means that the arm file corresponding to x ⁽ⁱ⁾ is scanned, go to step 6.6.8.

6.6.8步，令pd＝1。Step 6.6.8, let pd=1.

6.6.9步，令

Step 6.6.9, let

6.6.10步，令pd＝pd+1。若pd≤197，转6.6.9步；若pd＞197，说明

计算完成，转6.7步。Step 6.6.10, let pd=pd+1. If pd≤197, go to step 6.6.9; if pd>197, explain

After the calculation is completed, go to step 6.7.

6.7步，令i＝i+1。若i≤N，转6.2；若i＞N，表明对D内的N个样本均计算生成了频率指纹，将频率指纹发送给检测模块，转第七步。Step 6.7, let i=i+1. If i≤N, go to 6.2; if i>N, it means that the frequency fingerprint is generated for all N samples in D, and the frequency fingerprint is sent to the detection module, and the seventh step is performed.

第七步，检测模块从频率指纹产生模块接收频率指纹，训练多核支持向量机模型，成为适合对待检测软件进行分类判断的分类器。多核支持向量机模型是一种基于支持向量机模型、使用多种核函数将特征空间的向量由低维映射到高维来增强分类能力的分类模型。对基准测试集D来说，其特征空间为D内N个样本的频率指纹的集合。令k_perm、k_api、k_smali、k_arm分别表示频率指纹内的权限向量、API向量、smali操作码向量和arm操作码向量使用的核函数，β为权重向量，可表示为(β_perm，β_api，β_smali，β_arm)，β的元素β_perm、β_api、β_smali、β_arm分别表示k_perm、k_api、k_smali、k_arm的权重，令T为集合{perm，api，smali，arm}(perm，api，smali，arm分别为k_perm、k_api、k_smali、k_arm的下标，为了描述公式(4)的一种表达方式)，多核支持向量机模型Y可表示为：In the seventh step, the detection module receives the frequency fingerprint from the frequency fingerprint generation module, trains a multi-core support vector machine model, and becomes a classifier suitable for classifying and judging the software to be detected. The multi-kernel support vector machine model is a classification model based on the support vector machine model, which uses a variety of kernel functions to map the vector of the feature space from low-dimensional to high-dimensional to enhance the classification ability. For the benchmark test set D, its feature space is the set of frequency fingerprints of N samples in D. Let k _perm , k _api , k _smali , and k _arm denote the kernel function used by the permission vector, API vector, smali opcode vector, and arm opcode vector in the frequency fingerprint, respectively, and β is the weight vector, which can be expressed as (β _perm , β _api , β _smali , β _arm ), the elements β _perm , β _api , β _smali , and β _arm of β represent the weights of k _perm , k _api , k _smali , and k _arm respectively, let T be the set {perm, api, smali , arm} (perm, api, smali, arm are the subscripts of k _perm , k _api , k _smali , and k _arm respectively, in order to describe an expression of formula (4)), the multi-core support vector machine model Y can be expressed as :

α⁽ⁱ⁾为一个拉格朗日乘子，{α⁽¹⁾，α⁽²⁾，...，α⁽ⁱ⁾，...，α^(N)}构成向量α。sgn(A)为参数A的阶跃函数，当A＞0时，sgn(A)＝1；当A＝0时，sgn(A)＝0；当A＜0时，sgn(A)＝-1。α、β通过求解公式(5)得到：α ⁽ⁱ⁾ is a Lagrange multiplier, {α ⁽¹⁾ , α ⁽²⁾ , ..., α ⁽ⁱ⁾ , ..., α ^(N) } constitute the vector α. sgn(A) is the step function of parameter A, when A>0, sgn(A)=1; when A=0, sgn(A)=0; when A<0, sgn(A)=- 1. α and β are obtained by solving formula (5):

公式(5)的约束条件为公式(6)至公式(9)：The constraints of formula (5) are formula (6) to formula (9):

0≤α⁽ⁱ⁾≤C (7)0≤α ⁽ⁱ⁾ ≤C(7)

∑_t∈Tβ_t＝1 (8)∑ _t∈T β _t = 1 (8)

β_t≥0，t∈T (9)β _t ≥ 0, t∈T (9)

其中C为惩罚系数，C≥0，用于表示对误分类惩罚的大小。where C is the penalty coefficient, C≥0, which is used to indicate the size of the penalty for misclassification.

b为标量，在求出α、β后，由下面公式得出：b is a scalar, after calculating α and β, it is obtained by the following formula:

其中，

为支持向量样本点。in,

is the support vector sample point.

对多核支持向量机模型进行训练的方法是：The way to train a multi-core SVM model is:

7.1步，根据从频率指纹产生模块接收的D内样本的频率指纹计算生成核矩阵。令K_t为核矩阵，t∈T，表示四种核矩阵K_perm、K_api、K_smali和K_arm。K_t规模大小为N行N列，第i行第j列的元素为

选定3次多项式核函数，K_t的计算方法为：Step 7.1: Calculate and generate a kernel matrix according to the frequency fingerprints of the samples in D received from the frequency fingerprint generation module. Let K _t be a kernel matrix, t∈T, representing the four kernel matrices K _perm , K _api , K _smali and K _arm . The size of K _t is N rows and N columns, and the elements of the i-th row and the j-th column are

The 3rd degree polynomial kernel function is selected, and the calculation method of K _t is:

7.1.1步，令i＝1。Step 7.1.1, let i=1.

7.1.2步，令j＝1。Step 7.1.2, let j=1.

7.1.3步，计算

Step 7.1.3, Calculation

表示

与

的内积。

express

and

the inner product.

7.1.4步，若j≤N，令j＝j+1，转7.1.3步；若j＞N，转7.1.5步。Step 7.1.4, if j≤N, let j=j+1, go to step 7.1.3; if j>N, go to step 7.1.5.

7.1.5步，若i≤N，令i＝i+1，转7.1.2步；若i＞N，K_t计算完毕，转7.2步。Step 7.1.5, if i≤N, set i=i+1, go to step 7.1.2; if i>N, K _t is calculated, go to step 7.2.

7.2步，优化α、β参数，方法是：Step 7.2, optimize the α, β parameters, the method is:

7.2.1初始化α向量内每个元素为0，初始化β向量内每个元素为1/4。7.2.1 Initialize each element in the alpha vector to 0, and initialize each element in the beta vector to 1/4.

7.2.2利用公式(5)，按照上标r、s从小到大的顺序，将(α⁽¹⁾，α⁽²⁾，...，α^(r-1)，α^(r+1)，...，α^(s)，α^(s+1)，...，α^(N))及向量β作为固定值，选择一对α^(r)、α^(s)对α进行优化，优化方法为：7.2.2 Using formula (5), according to the superscript r, s in ascending order, (α ⁽¹⁾ , α ⁽²⁾ ,..., α ^(r-1) , α ^(r+1) , ..., α ^(s) , α ^(s+1) , ..., α ^(N) ) and vector β as fixed values, select a pair of α ^(r) , α ^(s) to optimize α, The optimization method is:

7.2.2.1利用公式(6)的约束，使公式(5)成为α^(r)的一元二次函数g(α^(r))，对g(α^(r))求导数使求导数之后的结果等于0，求出α^(r)。7.2.2.1 Using the constraints of formula (6), formula (5) becomes a quadratic function g(α ^(r) ) of α ^(r) in one variable, and the derivative of g(α ^(r) ) is obtained to obtain the result after the derivative equal to 0, find α ^(r) .

7.2.2.2利用公式(6)的约束求出α^(s)。7.2.2.2 Use the constraints of equation (6) to find α ^(s) .

7.2.2.3将α^(r)，α^(s)更新，得到优化后的α，命名为α^*。7.2.2.3 Update α ^(r) and α ^(s) to obtain the optimized α, named α ^* .

7.2.3将α^*作为固定值，对β进行优化，方法为：7.2.3 Taking α ^* as a fixed value, optimize β by:

7.2.3.1计算公式(5)对β的偏导数，使求偏导数之后的结果等于0，求出满足公式(8)和公式(9)约束条件的解，即求出了β_perm、β_api、β_smali、β_arm优化后的结果，分别命名为

7.2.3.1 Calculate the partial derivative of formula (5) with respect to β, make the result after the partial derivative equal to 0, and find the solution that satisfies the constraints of formula (8) and formula (9), that is, obtain β _perm , β _api The optimized results of , β _smali and β _arm are named as

7.2.3.2将

拼接成优化后的β，命名为β^*。7.2.3.2 Will

spliced into optimized β, named β ^* .

7.2.4判断α、β是否满足公式(12)～公式(14)的优化终止条件：7.2.4 Judge whether α and β satisfy the optimization termination conditions of formula (12) to formula (14):

L(α^*，β^*)-L(α，β)≤ε (14)L(α ^* ,β ^* )-L(α,β)≤ε(14)

当满足公式(14)时，对α、β参数的优化使得公式(5)中函数值改变小于阈值ε，0＜ε≤0.1，说明优化后的α、β满足要求，多核支持向量机模型训练完毕，转7.3步。否则转步骤7.2.2。When the formula (14) is satisfied, the optimization of the α and β parameters makes the change of the function value in the formula (5) less than the threshold ε, 0<ε≤0.1, indicating that the optimized α and β meet the requirements, and the multi-core support vector machine model training Finished, go to step 7.3. Otherwise, go to step 7.2.2.

7.3步，由公式(10)计算得到b的值，公式(4)定义的多核支持向量机模型训练优化完成，成为分类器。In step 7.3, the value of b is calculated by formula (10), and the training and optimization of the multi-core support vector machine model defined by formula (4) is completed and becomes a classifier.

第八步，使用基于频率指纹提取的安卓恶意软件检测系统对谷歌官方或者第三方安卓应用软件市场服务器从用户接收的待检软件进行检测，判断是否为恶意软件，方法是：The eighth step is to use the Android malware detection system based on frequency fingerprint extraction to detect the software to be inspected received from the user by Google's official or third-party Android application software market server, and determine whether it is malware. The method is as follows:

8.1步，样本预处理模块对待检软件进行预处理。将待检测软件作为样本x^(a)，采用3.3步所述样本预处理方法，对x^(a)进行预处理，得到x^(a)的AndroidManifest.xml文件、smali文件和arm指令文件，输出至频率指纹计算模块。Step 8.1, the sample preprocessing module preprocesses the software to be tested. Taking the software to be detected as a sample x ^(a) , using the sample preprocessing method described in step 3.3, preprocessing x ^(a) to obtain the AndroidManifest.xml file, smali file and arm instruction file of x ^(a) , output to Frequency fingerprint calculation module.

8.2步，频率指纹计算模块对x^(a)计算产生x^(a)的频率指纹

方法是：Step 8.2, the frequency fingerprint calculation module calculates x ^(a) to generate the frequency fingerprint of x ^(a) .

the way is:

8.2.1步，采用6.3步所述权限提取方法提取x^(a)申请的权限，得到x^(a)的权限向量

Step 8.2.1, use the permission extraction method described in step 6.3 to extract the permission applied for by x (a ⁾ , and obtain the permission vector of x ^(a) .

8.2.2步，采用6.4步所述API统计方法统计x^(a)使用的API，得到x^(a)的API向量

Step 8.2.2, use the API statistics method described in step 6.4 to count the APIs used by x (a ⁾ , and obtain the API vector of x ^(a) .

8.2.3步，采用6.5步所述smali操作码统计方法统计x^(a)使用的smali操作码，得到x^(a)的smali操作码向量

Step 8.2.3, use the smali opcode statistical method described in step 6.5 to count the smali opcodes used by x ( ^a ^{), and obtain the smali opcode vector of x (a)} .

8.2.4步，采用6.6步所述arm操作码统计方法统计x^(a)使用的arm操作码，得到x^(a)的arm操作码向量

Step 8.2.4, use the arm opcode statistical method described in step 6.6 to count the arm opcodes used by x ( ^a ^{), and obtain the arm opcode vector of x (a)} .

8.2.5步，将

计算完毕，拼接成x^(a)的频率指纹

Step 8.2.5, will

After the calculation is completed, spliced into the frequency fingerprint of x ^(a)

8.3步，将

输入检测模块(此时是优化后的适合于检测的分类器)，由公式(4)计算输出F的值，F等于+1或者-1，+1代表待检软件为恶意软件，-1代表为良性软件，从而实现了判断待检软件是否为恶意软件的目的。Step 8.3, will

Input detection module (at this time it is an optimized classifier suitable for detection), calculate the value of output F by formula (4), F is equal to +1 or -1, +1 means the software to be checked is malware, -1 means It is benign software, thus realizing the purpose of judging whether the software to be checked is malicious software.

相比于其他技术，本发明具有以下优点：Compared with other technologies, the present invention has the following advantages:

一是高精确度。本发明融合使用权限、API、smali操作码和arm操作码特性产生频率指纹，能准确表达安卓软件属性特征，适于作为安卓软件身份标识。基于频率指纹训练出的多核支持向量机，作为分类器，能够有效地整合来自安卓软件各个组成部分的的信息，达到准确的检测结果。One is high precision. The invention integrates the characteristics of use authority, API, smali operation code and arm operation code to generate a frequency fingerprint, can accurately express the attribute characteristics of Android software, and is suitable for being used as an Android software identity mark. The multi-core support vector machine trained based on the frequency fingerprint, as a classifier, can effectively integrate the information from each component of the Android software to achieve accurate detection results.

二是高效率。本发明的效率体现在两个方面：一是频率指纹生成的效率高。本发明扫描AndroidManifest.xml、smali文件及arm指令文件，统计权限、API、smali操作码和arm操作码的频率，可在线性时间内完成。二是分类模型的训练效率高。与大量的神经网络模型参数相比，多核支持向量机模型的参数比较少，优化参数时的计算量较低，训练效率有显著提高。The second is high efficiency. The efficiency of the present invention is embodied in two aspects: First, the efficiency of frequency fingerprint generation is high. The invention scans AndroidManifest.xml, smali file and arm instruction file, and counts the frequency of authority, API, smali operation code and arm operation code, and can be completed in linear time. Second, the training efficiency of the classification model is high. Compared with a large number of neural network model parameters, the multi-core support vector machine model has fewer parameters, and the calculation amount when optimizing parameters is lower, and the training efficiency is significantly improved.

附图说明Description of drawings

图1是基于频率指纹提取的安卓恶意软件检测系统结构图。Figure 1 is a structural diagram of an Android malware detection system based on frequency fingerprint extraction.

图2是本发明总体流程图。Figure 2 is a general flow chart of the present invention.

具体实施方式Detailed ways

下面对照附图对本发明进行详细说明。The present invention will be described in detail below with reference to the accompanying drawings.

本发明技术方案如图2所示，包括以下步骤：The technical solution of the present invention, as shown in Figure 2, includes the following steps:

第一步，构建基于频率指纹提取的安卓恶意软件检测系统。该系统安装在谷歌官方或者第三方安卓应用软件市场服务器中，该系统总体结构如图1所示，由样本预处理模块、频率指纹产生模块、检测模块组成。The first step is to build an Android malware detection system based on frequency fingerprint extraction. The system is installed in Google's official or third-party Android application software market server. The overall structure of the system is shown in Figure 1, which consists of a sample preprocessing module, a frequency fingerprint generation module, and a detection module.

频率指纹产生模块与样本预处理模块、检测模块相连，频率指纹产生模块从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件，进行特征筛选和频率指纹计算，产生频率指纹，输出至检测模块；频率指纹产生模块由特征筛选模块和频率指纹计算模块组成。特征筛选模块与样本预处理模块、频率指纹计算模块相连，特征筛选模块从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件，对这三种文件进行特征筛选，得到权限、API、smali操作码和arm操作码特征，将权限、API、smali操作码和arm操作码特征发送给频率指纹计算模块。频率指纹计算模块与样本预处理模块、特征筛选模块、检测模块相连，频率指纹计算模块从特征筛选模块接收权限、API、smali操作码和arm操作码特征，从样本预处理模块接收AndroidManifest.xml、smali文件和arm指令文件，计算产生频率指纹，将频率指纹发送给检测模块。The frequency fingerprint generation module is connected with the sample preprocessing module and the detection module. The frequency fingerprint generation module receives the AndroidManifest.xml, smali file and arm command file from the sample preprocessing module, performs feature screening and frequency fingerprint calculation, generates frequency fingerprints, and outputs them to the detection. module; the frequency fingerprint generation module is composed of a feature screening module and a frequency fingerprint calculation module. The feature screening module is connected with the sample preprocessing module and the frequency fingerprint calculation module. The feature screening module receives AndroidManifest.xml, smali file and arm command file from the sample preprocessing module, and performs feature screening on these three files to obtain permissions, API, smali Opcode and arm opcode feature, send permission, API, smali opcode and arm opcode feature to the frequency fingerprint calculation module. The frequency fingerprint calculation module is connected with the sample preprocessing module, the feature screening module and the detection module. The frequency fingerprint calculation module receives permissions, API, smali opcode and arm opcode features from the feature screening module, and receives AndroidManifest.xml, smali file and arm command file, calculate the frequency fingerprint, and send the frequency fingerprint to the detection module.

图1中样本预处理模块到频率指纹产生模块、检测模块的实线箭头是基于频率指纹提取的安卓恶意软件检测系统对基准测试集D内的样本进行处理的流程，样本预处理模块到频率指纹产生模块、检测模块的虚线箭头是对待检样本进行处理的流程(从第八步可以看出，待检测软件不需要特征筛选模块进行特征筛选)。The solid arrows from the sample preprocessing module to the frequency fingerprint generation module and the detection module in Figure 1 are the flow of the Android malware detection system based on the frequency fingerprint extraction to process the samples in the benchmark test set D. The sample preprocessing module to the frequency fingerprint The dashed arrows of the generation module and the detection module are the flow of processing the sample to be tested (it can be seen from the eighth step that the software to be detected does not need the feature screening module for feature screening).

2.1步，从开源的Drebin、Genome和AMD数据集中获得N₁个安卓恶意软件作为恶意样本，N₁为正整数且N₁＝2000。Step 2.1, obtain N ₁ Android malware from the open source Drebin, Genome and AMD datasets as malicious samples, where N ₁ is a positive integer and N ₁ =2000.

2.2步，通过爬取GooglePlay和Apkpure应用商店获得良性软件，并使用本地杀毒软件及VirusTotal在线杀毒网站进行检测过滤，形成N₂个良性样本，N₂为正整数且N₂＝2000。Step 2.2, obtain benign software by crawling GooglePlay and Apkpure application stores, and use local anti-virus software and VirusTotal online anti-virus website to detect and filter to form N ₂ benign samples, N ₂ is a positive integer and N ₂ =2000.

2.4将D存储在预处理模块、频率指纹产生模块均可以读取的存储器(如安装有基于频率指纹提取的安卓恶意软件检测系统的谷歌官方或者第三方安卓应用软件市场服务器的存储器)上。2.4 Store D in a memory that can be read by both the preprocessing module and the frequency fingerprint generation module (such as the memory of Google's official or third-party Android application software market server installed with the Android malware detection system based on frequency fingerprint extraction).

3.1步，令变量i＝1；Step 3.1, let the variable i=1;

3.3.1步，使用解压缩工具Gzip，对x⁽ⁱ⁾进行解压缩，提取x⁽ⁱ⁾中的AndroidManifest.xml、classes.dex以及so运行库文件。Step 3.3.1, use the decompression tool Gzip to decompress x ⁽ⁱ⁾ , and extract the AndroidManifest.xml, classes.dex and so runtime library files in x ⁽ⁱ⁾ .

3.3.2步，使用AndroidManifest.xml文件专用反编译工具AXMLPrinter2版本2.0，将AndroidManifest.xml文件由二进制形式反编译为文本形式。Step 3.3.2, use AXMLPrinter2 version 2.0, a special decompilation tool for the AndroidManifest.xml file, to decompile the AndroidManifest.xml file from binary form to text form.

3.3.3步，使用dex文件格式反编译工具baksmali版本2.4.0将classes.dex反编译为smali文件，若产生多个smali文件，将多个smali文件合并成为一个smali文件，转3.3.4步；若只产生1个smali文件，直接转3.3.4步。Step 3.3.3, use the dex file format decompile tool baksmali version 2.4.0 to decompile classes.dex into a smali file, if multiple smali files are generated, merge the multiple smali files into one smali file, go to step 3.3.4 ; If only one smali file is generated, go to step 3.3.4 directly.

3.3.4步，使用arm指令反汇编工具gcc-arm-none-eabi版本9-2019-q4-major将so运行库文件反编译为文本形式的arm指令文件，若产生多个arm指令文件，将多个arm指令文件合并成为一个arm指令文件，转3.4步；如若没有产生arm指令文件，则新建一个空的arm指令文件，转3.4步。Step 3.3.4, use the arm instruction disassembly tool gcc-arm-none-eabi version 9-2019-q4-major to decompile the so runtime library file into a textual arm instruction file. If multiple arm instruction files are generated, set the Combine multiple arm command files into one arm command file, go to step 3.4; if no arm command file is generated, create an empty arm command file and go to step 3.4.

3.4步，令i＝i+1，若i≤N，转3.2步；若i＞N，此时N个样本产生了对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件，将D的N个样本对应的N个AndroidManifest.xml文件、N个smali文件和N个arm指令文件发送给特征筛选模块，转第四步。Step 3.4, let i=i+1, if i≤N, go to step 3.2; if i>N, then N samples generate corresponding N AndroidManifest.xml files, N smali files and N arm command files , send the N AndroidManifest.xml files, N smali files, and N arm instruction files corresponding to the N samples of D to the feature screening module, and go to the fourth step.

4.2.4步，按行扫描D的第i个样本的smali文件，得到第i个样本中出现的属于L_api的API，对Z_api的第i列元素进行赋值；记smali文件的第u行字符串为str[u]，记smali文件的总行数为U，1≤u≤U。Step 4.2.4, scan the smali file of the ith sample of D row by line, get the API belonging to the L _api appearing in the ith sample, and assign values to the elements of the ith column of Z _api ; record the uth line of the smali file The string is str[u], and the total number of lines in the smali file is U, 1≤u≤U.

4.2.4.1步，初始化u＝1。Step 4.2.4.1, initialize u=1.

4.2.4.2.3步，令v＝v+1。若v≤32437，转4.2.4.2.2步；若v>32437，转4.2.4.3步。Step 4.2.4.2.3, let v=v+1. If v≤32437, go to step 4.2.4.2.2; if v>32437, go to step 4.2.4.3.

4.2.4.3步，令u＝u+1。若u≤U，转4.2.4.2步；若u>U，转4.2.5步。Step 4.2.4.3, let u=u+1. If u≤U, go to step 4.2.4.2; if u>U, go to step 4.2.5.

4.2.5步，令i＝i+1。若i≤N，转4.2.4步；若i>N，完成了对二维数组Z_api的赋值，转4.2.6。Step 4.2.5, let i=i+1. If i≤N, go to step 4.2.4; if i>N, complete the assignment to the two-dimensional array Z _api , go to step 4.2.6.

4.2.6.1步，令v＝1。Step 4.2.6.1, let v=1.

4.2.6.3步，若Z_api[v][i]等于1并且y⁽ⁱ⁾等于1，令M₁₁＝M₁₁+1；若Z_api[v][i]等于l并且y⁽ⁱ⁾等于0，令M₁₂＝M₁₂+1；若Z_api[v][i]等于0并且y⁽ⁱ⁾等于1，令M₂₁＝M₂₁+1；若Z_api[v][i]等于0并且y⁽ⁱ⁾等于0，令M₂₂＝M₂₂+1。Step 4.2.6.3, if _Zapi [v][i] is equal to 1 and y ⁽ⁱ⁾ is equal to 1, let M11= _M11 + ₁ ; if _Zapi [v][i] is equal to 1 and y ⁽ⁱ⁾ is equal to 0, let M ₁₂ =M ₁₂ +1; if Z _api [v][i] is equal to 0 and y ⁽ⁱ⁾ is equal to 1, let M ₂₁ =M ₂₁ +1; if Z _api [v][i] is equal to 0 and y ⁽ⁱ⁾ is equal to 0, let M ₂₂ =M ₂₂ +1.

4.4步，根据arm指令快速参考手册(http：//infocenter.arm.com/help/topic/com.arm.doc.qrc0001mc/QRC0001_UAL.pdf)，特征筛选模块选择该手册列举的共计197种arm指令操作码作为特征，称为arm操作码特征。Step 4.4, according to the arm command quick reference manual (http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001mc/QRC0001_UAL.pdf), the feature screening module selects a total of 197 arm commands listed in the manual The opcode as a feature is called the arm opcode feature.

将167种权限特征、256种API特征、256种smali操作码特征和197种arm操作码特征分别按字母顺序排列形成向量，分别称之为安卓软件的权限向量、API向量、smali操作码向量和arm操作码向量。四种向量首尾相接，形成一个长度为876的向量，作为样本的频率指纹。The 167 permission features, 256 API features, 256 smali opcode features, and 197 arm opcode features are arranged in alphabetical order to form vectors, which are called Android software permission vectors, API vectors, smali opcode vectors and arm opcode vector. The four vectors are connected end to end to form a vector of length 876 as the frequency fingerprint of the sample.

6.2步，取D中第i个样本x⁽ⁱ⁾，为x⁽ⁱ⁾生成频率指纹

含876个元素，初始化每个元素为0。将

中的权限向量记为

中的第pa个元素记为

API向量记为

中的第pb个元素记为

smali操作码向量记为

中的第pc个元素记为

arm操作码向量记为

中的第pd个元素记为

Contains 876 elements, initializing each element to 0. Will

The permission vector in is denoted as

The pa-th element in is denoted as

API vector is denoted as

The pbth element in is denoted as

The smali opcode vector is denoted as

The pc-th element in is denoted as

The arm opcode vector is denoted as

The pd-th element in is denoted as

the way is:

6.3.2步，令qa＝1。Step 6.3.2, let qa=1.

After the calculation is completed, go to step 6.4.

the way is:

6.4.3步，令变量pb＝1。Step 6.4.3, let the variable pb=1.

6.4.8步，令pb＝1。Step 6.4.8, let pb=1.

6.4.9步，令

Step 6.4.9, let

6.4.10步，令pb＝pb+1。若pb≤256，转6.4.9步；若pb＞256，说明

After the calculation is completed, go to step 6.5.

the way is:

6.5.1步，按行扫描x⁽ⁱ⁾对应的smali文件，记smali文件的第qc行字符串为strc[qc]，记smali文件总行数为numc行。Step 6.5.1, scan the smali file corresponding to x ⁽ⁱ⁾ by line, record the qc line string of the smali file as strc[qc], and record the total number of lines in the smali file as numc lines.

6.5.3步，令pc＝1。Step 6.5.3, let pc=1.

6.5.4步，若strc[qc]含有内容为L_c[pc]的子字符串，令

6.5.7步，令pc＝1。Step 6.5.7, let pc=1.

6.5.8步，令

Step 6.5.8, let

6.5.9步，令pc＝pc+1。若pc≤256，转6.5.8步；若pc＞256，说明

After the calculation is completed, go to step 6.6.

the way is:

6.6.2步，令qd＝l，使用变量opa表示arm文件中使用的arm操作码总数量，令opa＝1。若qd≤numd，转6.6.3步；若qd＞numd，说明x⁽ⁱ⁾对应的arm文件是空文件，转6.7步。Step 6.6.2, let qd=1, use the variable opa to represent the total number of arm opcodes used in the arm file, let opa=1. If qd≤numd, go to step 6.6.3; if qd>numd, it means that the arm file corresponding to x ⁽ⁱ⁾ is an empty file, go to step 6.7.

6.6.3步，令pd＝1。Step 6.6.3, let pd=1.

6.6.5步，若strd[qd]含有内容为L_d[pd]的子字符串，令

6.6.8步，令pd＝1。Step 6.6.8, let pd=1.

6.6.9步，令

Step 6.6.9, let

6.6.10步，令pd＝pd+1。若pd≤197，转6.6.9步；若pd＞197，说明

After the calculation is completed, go to step 6.7.

第七步，检测模块从频率指纹产生模块接收频率指纹，训练多核支持向量机模型，成为适合对待检测软件进行分类判断的分类器。令k_perm、k_api、k_smali、k_arm分别表示频率指纹内的权限向量、API向量、smali操作码向量和arm操作码向量使用的核函数，β为权重向量，可表示为(β_perm，β_api，β_smali，β_arm)，β的元素β_perm、β_api、β_smali、β_arm分别表示k_perm、k_api、k_smali、k_arm的权重，令T为集合{perm，api，smali，arm}，多核支持向量机模型Y可表示为：In the seventh step, the detection module receives the frequency fingerprint from the frequency fingerprint generation module, trains a multi-core support vector machine model, and becomes a classifier suitable for classifying and judging the software to be detected. Let k _perm , k _api , k _smali , and k _arm denote the kernel function used by the permission vector, API vector, smali opcode vector, and arm opcode vector in the frequency fingerprint, respectively, and β is the weight vector, which can be expressed as (β _perm , β _api , β _smali , β _arm ), the elements β _perm , β _api , β _smali , and β _arm of β represent the weights of k _perm , k _api , k _smali , and k _arm respectively, let T be the set {perm, api, smali , arm}, the multi-core support vector machine model Y can be expressed as:

0≤α⁽ⁱ⁾≤C (7)0≤α ⁽ⁱ⁾ ≤C(7)

∑_t∈Tβ_t＝1 (8)∑ _t∈T β _t = 1 (8)

β_t≥0，t∈T (9)β _t ≥ 0, t∈T (9)

其中C为惩罚系数，C≥0，一般令C＝100，用于表示对误分类惩罚的大小。Among them, C is the penalty coefficient, C≥0, generally let C=100, which is used to indicate the size of the penalty for misclassification.

其中，

为支持向量样本点。in,

is the support vector sample point.

选定3次多项式核函数，K_t的计算方法为：Step 7.1: Calculate and generate a kernel matrix according to the frequency fingerprints of the samples in D received from the frequency fingerprint generation module. Let K _t be a kernel matrix, t∈T, denoting the four kernel matrices K _perm , K _api , K _sma li and K _arm . The size of K _t is N rows and N columns, and the elements of the i-th row and the j-th column are

7.1.1步，令i＝1。Step 7.1.1, let i=1.

7.1.2步，令j＝1。Step 7.1.2, let j=1.

7.1.3步，计算

Step 7.1.3, Calculation

表示

与

的内积。

express

and

the inner product.

7.2.2利用公式(5)，按照上标r、s从小到大的顺序，选择一对α^(r)、α^(s)对α进行优化，将

及向量β作为固定值。优化方法为：7.2.2 Using formula (5), select a pair of α ^(r) and α ^(s) to optimize α according to the superscript r and s in ascending order, and

and the vector β as a fixed value. The optimization method is:

7.2.3.2将

拼接成优化后的β，命名为β^*。7.2.3.2 Will

spliced into optimized β, named β ^* .

L(α^*，β^*)-L(α，β)≤ε (14)L(α ^* ,β ^* )-L(α,β)≤ε(14)

当满足公式(14)时，对α、β参数的优化使得公式(5)中函数值改变小于阈值ε，，令ε＝0.01，说明优化后的α、β满足要求，多核支持向量机模型训练完毕，转7.3步。否则转步骤7.2.2。When formula (14) is satisfied, the optimization of the α and β parameters makes the change of the function value in formula (5) less than the threshold ε, and ε=0.01, indicating that the optimized α and β meet the requirements, and the multi-core support vector machine model is trained Finished, go to step 7.3. Otherwise, go to step 7.2.2.

第八步，使用基于频率指纹提取的安卓恶意软件检测系统对待检软件进行检测，判断是否为恶意软件，方法是：The eighth step is to use the Android malware detection system based on frequency fingerprint extraction to detect the software to be inspected, and determine whether it is malware. The method is as follows:

8.2步，对x^(a)计算产生频率指纹

Step 8.2, generate frequency fingerprint for x ^(a) calculation

8.2.5步，将

计算完毕，拼接成

Step 8.2.5, will

After the calculation is completed, it is spliced into

8.3步，将

输入检测模块，由公式(4)计算输出F的值，F等于+1或者-1，+1代表待检软件为恶意软件，-1代表为良性软件，从而实现了判断待检测软件是否为恶意软件的目的。Step 8.3, will

Input the detection module, calculate and output the value of F by formula (4), F is equal to +1 or -1, +1 means the software to be detected is malicious software, -1 means that it is benign software, thus realizing whether the software to be detected is malicious or not purpose of the software.

Claims

1. An android malicious software detection method based on frequency fingerprint extraction is characterized by comprising the following steps:

the method comprises the steps that firstly, an android malicious software detection system based on frequency fingerprint extraction is constructed, the android malicious software detection system based on frequency fingerprint extraction is installed in a Google official or third-party android application software market server and consists of a sample preprocessing module, a frequency fingerprint generation module and a detection module;

the sample preprocessing module is connected with the frequency fingerprint generating module, receives a sample of a reference test set and a sample to be detected, preprocesses the sample, generates three types of files, namely, an android manifest.xml file, a smali file and an arm instruction file, and outputs the three types of files to the frequency fingerprint generating module;

the frequency fingerprint generation module is connected with the sample preprocessing module and the detection module, receives the android manifest, the smal file and the arm instruction file from the sample preprocessing module, performs feature screening and frequency fingerprint calculation, generates a frequency fingerprint and outputs the frequency fingerprint to the detection module;

the frequency fingerprint generation module consists of a characteristic screening module and a frequency fingerprint calculation module; the characteristic screening module is connected with the sample preprocessing module and the frequency fingerprint computing module, receives android manifest, smal files and arm instruction files from the sample preprocessing module, performs characteristic screening on the three files to obtain authority, API, smal operation codes and arm operation code characteristics, and sends the authority, API, smal operation codes and arm operation code characteristics to the frequency fingerprint computing module; the frequency fingerprint calculation module is connected with the sample preprocessing module, the feature screening module and the detection module, receives the authority, the API, the smali operation code and the arm operation code features from the feature screening module, receives the android manifest.xml, the smali file and the arm instruction file from the sample preprocessing module, calculates to generate a frequency fingerprint, and sends the frequency fingerprint to the detection module;

the detection module is connected with the frequency fingerprint generation module, is a multi-core support vector machine model, receives the frequency fingerprints of the reference test set D and the frequency fingerprints of the software to be detected from the frequency fingerprint generation module, performs training optimization by using the frequency fingerprints of the reference test set D to form a classifier suitable for detecting the software to be detected, and then performs detection classification on the software to be detected according to the frequency fingerprints of the software to be detected to obtain a judgment result of whether the software to be detected is malicious software;

secondly, constructing a benchmark test set D, wherein the method comprises the following steps:

2.1 step of adding N₁Individual android malware as malicious samples, N₁Is a positive integer and N₁＞1000；

2.2 step (b), adding N₂Benign software as a benign sample, N₂Is a positive integer and N₂＞1000；

And 2.3, adding labels to the malicious samples and the benign samples to form a benchmark test set D, wherein N is the total number of the samples in D, and N is equal to N₁+N₂(ii) a Definition of x⁽ⁱ⁾Is the ith sample in D, y⁽ⁱ⁾Is x⁽ⁱ⁾Label of (a), y⁽ⁱ⁾Equal to 1 denotes x⁽ⁱ⁾As a malicious sample, y⁽ⁱ⁾Equal to-1 denotes x⁽ⁱ⁾I is more than or equal to 1 and less than or equal to N;

2.4 storing D in a memory which can be read by both the preprocessing module and the frequency fingerprint generation module;

thirdly, preprocessing the N samples in the D by a sample preprocessing module to obtain N android Manifest xml files, N smali files and N arm instruction files, wherein the method comprises the following steps:

step 3.1, enabling the variable i to be 1;

3.2 step, take the ith sample x from D⁽ⁱ⁾；

3.3 step, using sample pretreatment method to x⁽ⁱ⁾Carrying out pretreatment to obtain x⁽ⁱ⁾Xml file, smali file and arm instruction file, the method is as follows:

3.3.1 Steps, using decompression tool vs. x⁽ⁱ⁾Decompress and extract x⁽ⁱ⁾Xml, classes, dex and so runtime files in (1);

3.3.2, using an android manifest. xml file special decompilation tool AXM L Printer2 to decompilate the android manifest. xml file from a binary form into a text form;

3.3.3, inversely compiling classes into a smali file by using a dex file format inverse compiling tool bakmali, if a plurality of smali files are generated, combining the plurality of smali files into one smali file, and turning to 3.3.4 steps; if only 1 smali file is generated, directly rotating to 3.3.4 steps;

3.3.4, reversely compiling the so runtime file into an arm instruction file in a text form by using an arm instruction disassembling tool gcc-arm-none-eabi, if a plurality of arm instruction files are generated, combining the plurality of arm instruction files into one arm instruction file, and turning to the 3.4 step; if the arm instruction file is not generated, establishing an empty arm instruction file, and turning to the step 3.4;

3.4, changing i to i +1, and if i is less than or equal to N, turning to 3.2; if i is larger than N, the N samples generate corresponding N android Manifest xml files, N smali files and N arm instruction files, the N android Manifest xml files, the N smali files and the N arm instruction files corresponding to the N samples of D are sent to a feature screening module, and the fourth step is carried out;

fourthly, the feature screening module performs feature screening on N android files, N smal files and N arm instruction files corresponding to N samples of D received from the sample preprocessing module to obtain right features, API features, smal operation code features and arm operation code features suitable for classifying D, and the method comprises the following steps:

4.1, selecting 167 types of android system permissions defined in an android developer document, and taking the 167 types of permissions as features, namely permission features;

4.2, selecting 256 APIs from the APIs in the pscout list, wherein the method comprises the following steps:

4.2.1 step, build a list L_apiSelecting all 32437 APIs in the pscout list to add to L_apiThe vth API is noted as L_api[v]，1≤v≤32437；

4.2.2 step, establishing a two-dimensional array Z of 32437 rows and N columns_apiRow v, column i element Z_api[v][i]Is defined as 1 or 0, 1 represents L_apiThe v API in D is present in the i sample, 0 represents not present;

4.2.3 step, initialize Z_apiAll the elements in the table are 0, and the initialization variable i is 1;

4.2.4, scanning the smali file of the ith sample of the D according to lines to obtain the attributes appearing in the ith sampleAt L_apiAPI of, for Z_apiThe ith column element of (1) is assigned; the u line character string of the notation smal file is str [ u]Recording the total line number of the smali file as U, wherein U is more than or equal to 1 and less than or equal to U;

4.2.5, making i equal to i + 1; if i is less than or equal to N, turning to 4.2.4 steps; if i is more than N, completing the two-dimensional array Z_apiTo 4.2.6;

4.2.6 calculating a list L_apiInformation gain IG of each API to the reference test set D, and information gain IG of the v-th API to D (D | L)_api[v]) Expressed as IG (D | L)_api[v]) L will be counted from large to small_apiSequencing the internal APIs, and taking the top 256 sequenced APIs as API characteristics;

4.3, using 256 kinds of smali operation codes with the length of 8 binary bits predefined by the android Dalvik virtual machine as the characteristics of the smali operation codes;

4.4, selecting a total 197 arm instruction operation codes listed by the arm instruction quick reference manual as arm operation code features;

4.5, sending the authority feature, the API feature, the smali operation code feature and the arm operation code feature to a frequency fingerprint calculation module;

fifthly, determining a frequency fingerprint format, wherein the method comprises the following steps:

respectively arranging 167 authority features, 256 API features, 256 smali operation code features and 197 arm operation code features according to an alphabetical order to form vectors which are respectively called as an authority vector, an API vector, a smali operation code vector and an arm operation code vector of the android software; the permission vector of the android software is composed of 167 integers, and each integer takes the value of 1 or 0; if the value of the integer at the position of the pa is 1, the pa in the 167 screened permissions is applied in the android software; if the integer value at the position of the pa is 0, it indicates that the pa in the 167 screened permissions is not applied in the android software; pa is an integer, 1 is more than or equal to pa is less than or equal to 167; an API vector of the android software consists of 256 decimal numbers, and the decimal number at the position of the pb indicates the frequency of the pb of the 256 screened APIs in the android software; pb is an integer, and pb is more than or equal to 1 and less than or equal to 256; the method comprises the steps that a smali operation code vector of the android software consists of 256 decimals, and the decimal at the position of the pc indicates the frequency of the pc of 256 kinds of screened smali operation codes appearing in the android software; pc is an integer, and pc is more than or equal to 1 and less than or equal to 256; an arm opcode vector of android software consists of 197 decimals, the decimal at the position of the pdth position indicating the frequency of occurrence of the pdth type of 197 arm opcodes screened in the android software; pd is an integer, and pd is more than or equal to 1 and less than or equal to 197;

connecting the four vectors end to form a vector with the length of 876 as a frequency fingerprint, wherein 167 integers and 709 decimal numbers contained in the frequency fingerprint are both called as elements of the frequency fingerprint;

sixthly, the frequency fingerprint calculation module receives the authority feature, the API feature, the smali operation code feature and the arm operation code feature from the feature screening module, receives the android manifest xml file, the smali file and the arm instruction file from the sample preprocessing module, and calculates and generates frequency fingerprints for N samples in the reference test set D, wherein the method comprises the following steps:

step 6.1, order L_aAs a list of permissions, list member L_a[pa]The name character string of the pa-type authority arranged in the order of letters in the 167 authorities, and L_bIs an API List, List Member L_b[pb]For the name string of the alphabetically arranged pb-th API of the 256 APIs, let L_cAs a list of smali opcodes, list Member L_c[pc]The name character string of the pc type smali operation code arranged in the order of letters in the 256 kinds of smali operation codes, and L_dAs an arm opcode List, List Member L_d[pd]The name character string is the name character string of the pd-th arm operation code arranged in the order of letters in 197-type arm operation codes; let variable i equal to 1;

6.2, taking the ith sample x in D⁽ⁱ⁾Is x⁽ⁱ⁾Generating frequency fingerprints

876 elements are contained, and each element is initialized to be 0; will be provided with

The authority vector in (1) is recorded as

The pa-th element in (b) is marked as

API vector notation

Pb th element of (1)

The smali opcode vector is noted

The pc-th element in (1)

arm opcode vector as

Pd th element in (2)

6.3, adopting a permission extraction method to extract x⁽ⁱ⁾Authority of application, get x⁽ⁱ⁾Authority vector of

The method comprises the following steps:

step 6.3.1, scan by line x⁽ⁱ⁾Xml file, the qa row character string of the xml file is stro [ qa]Marking the total number of rows of the android manifest.xml file as numa rows;

step 6.3.2, let qa equal to 1;

6.3.3, if stra [ qa ] contains a substring with the content of "uses-permission", making pa equal to 1, and turning to 6.3.4; if stra [ qa ] does not contain the character string with the content of "uses-permission", 6.3.6 steps are carried out;

6.3.4, if stra [ qa]Contains content L_a[pa]A substring of (a), indicates x⁽ⁱ⁾Application for L_a[pa]Authority, order

6.3.6 steps are carried out; if stra [ qa [ ]]The non-content is L_a[pa]Turning to 6.3.5 steps;

6.3.5, let pa equal to pa +1, if pa is less than or equal to 167, go to 6.3.4, if pa is greater than 167, it means that one-pass pair L is completed_aChecking, 6.3.6 steps are carried out;

step 6.3.6, let qa be qa + 1; if qa is less than or equal to numa, turning to 6.3.3 steps; if qa > numa, x is stated⁽ⁱ⁾Xml document is scanned,

after the calculation is finished, 6.4 steps are carried out;

6.4, counting x by adopting an API statistical method⁽ⁱ⁾API used, get x⁽ⁱ⁾API vector

The method comprises the following steps:

step 6.4.1, scan by line x⁽ⁱ⁾Corresponding smali file, the qb line character string of the smali file is marked as strb [ qb [ ]]Recording the total line number of the smali file as a numb line;

step 6.4.2, making qb equal to 1, using a variable inv to represent the total number of APIs in the smali file, and making inv equal to 1;

6.4.3, changing the variable pb to 1;

6.4.4, if strb [ qb ] contains a substring with the content of 'invoke', making inv equal to inv +1, and turning to 6.4.5; if the substring of the 'invoke' is not contained, turning to step 6.4.7;

6.4.5, if strb [ qb ]]Contains content L_b[pb]Sub-string of (2), caption x⁽ⁱ⁾Call name L_b[pb]API of (1), order

Turning to step 6.4.7; if strb [ qb [ ]]The non-content is L_b[pb]Turning to step 6.4.6;

6.4.6, changing pb to pb +1, if pb is less than or equal to 256, turning to 6.4.5, if pb is more than 256, indicating that one-time pairing L is completed_bGo to step 6.4.7;

6.4.7, making qb equal to qb + 1; if qb is less than or equal to numb, turning to 6.4.3 steps; if qb > numb, say x⁽ⁱ⁾After the corresponding smali file is scanned, turning to step 6.4.8;

6.4.8, making pb 1;

6.4.9 step (1), let

6.4.10, making pb + 1; if pb is less than or equal to 256, turning to 6.4.9; if pb > 256, this indicates

After the calculation is finished, 6.5 steps are carried out;

6.5, adopting a smali operation code statistical method to count x⁽ⁱ⁾The used smali operation code, get x⁽ⁱ⁾Of a smali opcode vector

The method comprises the following steps:

step 6.5.1, scan by line x⁽ⁱ⁾Corresponding smali file, wherein the qc line character string of the smali file is strc [ qc ] of]Recording the total line number of the smali file as a numc line;

6.5.2, making qc equal to 1, using a variable ops to represent the total amount of the smali operation codes in the smali file, and making ops equal to 1;

6.5.3, making pc equal to 1;

6.5.4, if strc [ qc ]]Contains content L_c[pc]Sub-string of

Switching to 6.5.6 step when ops is ops + 1; if strc [ qc ]]The non-content is L_c[pc]Turning to step 6.5.5;

6.5.5, changing pc to pc +1, if pc is less than or equal to 256, turning to 6.5.4, if pc is more than 256, indicating that one-time pairing L is completed_cGo to step 6.5.6;

6.5.6, making qc equal to qc + 1; if qc is less than or equal to numc, 6.5.3 steps are carried out; if qc > numc, x is stated⁽ⁱ⁾After the corresponding smali file is scanned, turning to step 6.5.7;

6.5.7, making pc equal to 1;

6.5.8 step (1), let

6.5.9, making pc equal to pc + 1; if pc is less than or equal to 256, turning to 6.5.8; if pc > 256, this indicates

After the calculation is finished, 6.6 steps are carried out;

6.6, counting x by an arm operation code statistical method⁽ⁱ⁾The arm opcode used, yields x⁽ⁱ⁾Arm opcode vector of

The method comprises the following steps:

step 6.6.1, scan by line x⁽ⁱ⁾Corresponding arm file, memory the qd line character string of arm file as strd [ qd ]]The total line number of the arm file is numd lines;

step 6.6.2, let qd be 1, use variable opa to represent the total number of the arm opcodes used in the arm file, and let opa be 1; if qd is less than or equal to numd, turning to step 6.6.3; if qd > numd, x is stated⁽ⁱ⁾If the corresponding arm file is an empty file, turning to the step 6.7;

6.6.3, making pd equal to 1;

6.6.4, if strd [ qd ] contains ">" character, it indicates strd [ qd ] contains an arm instruction, opa +1, go to 6.6.5; if strd [ qd ] does not contain the ">" character, go to 6.6.7;

6.6.5, if strd [ qd ]]Contains content L_d[pd]Sub-string of

Turning to step 6.6.7; if strd [ qd ]]The non-content is L_d[pd]6.6.6 steps;

6.6.6, making pd equal to pd +1, if pd is less than or equal to 197, turning to 6.6.5, if pd is greater than 197, then it shows that one-pass pair L is completed_dGo to step 6.6.7;

6.6.7, making qd-qd + 1; if qd is less than or equal to numd, turning to step 6.6.3; if qd > numd, x is stated⁽ⁱ⁾After the corresponding arm file is scanned, turning to step 6.6.8;

6.6.8, making pd equal to 1;

6.6.9 step (1), let

6.6.10, making pd ═ pd + 1; if pd is less than or equal to 197, turning to step 6.6.9; if pd > 197, this indicates

After the calculation is finished, 6.7 steps are carried out;

6.7, making i equal to i + 1; if i is less than or equal to N, turning to 6.2; if i is larger than N, the frequency fingerprints are generated by calculating the N samples in the D, the frequency fingerprints are sent to a detection module, and the seventh step is carried out;

seventhly, the detection module receives the frequency fingerprints from the frequency fingerprint generation module, trains the multi-core support vector machine model, and becomes a classifier suitable for classifying and judging software to be detected, and the method comprises the following steps: for the benchmark test set D, the characteristic space of the multi-core support vector machine model is a set of frequency fingerprints of N samples in D; let k_perm、k_api、k_smali、k_armRepresenting the kernel functions used by the authority vector, the API vector, the smali opcode vector, and the arm opcode vector, respectively, within the frequency fingerprint, β being weight vectors, is represented as (β)_perm，β_api，β_smali，β_arm) Element β of β_perm、β_api、β_smali、β_armRespectively represents k_perm、k_api、k_smali、k_armLet T be the set { perm, api, smali,arm }, and the multi-core support vector machine model Y is expressed as:

α⁽ⁱ⁾as a Lagrangian multiplier, { α⁽¹⁾，α⁽²⁾，...，α⁽ⁱ⁾，...，α^(N)The construction vector α, sgn (a) is a step function of the parameter a, sgn (a) ═ 1 when a > 0, sgn (a) ═ 0 when a ═ 0, sgn (a) ═ 1 when a < 0, α, β are obtained by solving the formula (5):

the constraint conditions of formula (5) are formula (6) to formula (9):

0≤α⁽ⁱ⁾≤C (7)

∑_t∈Tβ_t＝1 (8)

β_t≥0，t∈T (9)

wherein C is a penalty coefficient, and C is more than or equal to 0 and is used for representing the size of the penalty of misclassification;

b is a scalar, and obtained α, β is given by the following equation:

wherein,

is a support vector sample point;

the method for training the multi-core support vector machine model comprises the following steps:

7.1, calculating and generating a kernel matrix according to the frequency fingerprint of the D-interior sample received from the frequency fingerprint generating module; let K_tIs a coreThe matrix, T ∈ T, represents the four kernel matrices K_perm、K_api、K_smaliAnd K_arm；K_tThe scale is N rows and N columns, the element of the ith row and the jth column is

Selecting a polynomial kernel of degree 3, K_tThe calculation method comprises the following steps:

7.1.1, changing i to 1;

7.1.2, changing j to 1;

7.1.3 step of calculating

To represent

And

inner product of (d);

7.1.4, if j is less than or equal to N, making j equal to j +1, and turning to 7.1.3; if j is more than N, go to step 7.1.5;

7.1.5, if i is less than or equal to N, making i equal to i +1, and turning to 7.1.2; if i > N, K_t7.2, after the calculation is finished, turning;

7.2, optimizing α and β parameters by the following method:

7.2.1 initialize α each element in the vector to 0, initialize β each element in the vector to 1/4;

7.2.2 Using equation (5), in order of increasing superscript r, s, will (α)⁽¹⁾，α⁽²⁾，...，α^(r-1)，α^(r+1)，...，α^(s)，α^(s+1)，...，α^(N)) And vector β as a fixed value, selecting a pair α^(r)、α^(s)α is optimized to obtain optimized α named as α^*；

7.2.3 blend α^*β was optimized as a fixed value to obtain an optimized β named β^*；

7.2.4, it is judged whether α, β satisfy the optimization termination conditions of formula (12) to formula (14):

L(α^*，β^*)-L(α，β)≤(14)

when the formula (14) is met, the α and β parameters are optimized to ensure that the change of the function value in the formula (5) is less than the threshold value, 0 & lt & ltltoreq.0.1, which indicates that α and β after optimization meet the requirements, the multi-core support vector machine model is trained, 7.3 steps are carried out, otherwise, the step 7.2.2 is carried out;

7.3, calculating a value b by a formula (10), and finishing training and optimizing the multi-core support vector machine model defined by the formula (4) to form a classifier;

eighthly, detecting the software to be detected received by the google official or a third-party android application software market server from the user by using an android malicious software detection system based on frequency fingerprint extraction, and judging whether the software to be detected is malicious software, wherein the method comprises the following steps of:

8.1, preprocessing the software to be detected by a sample preprocessing module; using the software to be detected as a sample x^(a)The sample pretreatment method of 3.3 steps is adopted to carry out the pretreatment on the x^(a)Carrying out pretreatment to obtain x^(a)Outputting the xml file, the smali file and the arm instruction file to a frequency fingerprint calculation module;

8.2 step, frequency fingerprint computing Module Pair x^(a)Computing to produce x^(a)Frequency fingerprint of

The method comprises the following steps:

8.2.1, adopting the authority extraction method of 6.3 steps to extract x^(a)Authority of application, get x^(a)Authority vector of

Step 8.2.2, counting x by adopting the API statistical method of step 6.4^(a)API used, get x^(a)API vector

8.2.3, adopting the statistical method of the smali operation codes in the 6.5 steps to count x^(a)The used smali operation code, get x^(a)Of a smali opcode vector

8.2.4 steps, and counting x by adopting the arm operation code statistical method in the 6.6 steps^(a)The arm opcode used, yields x^(a)Arm opcode vector of

8.2.5, step (b), mixing

After the calculation, splicing into x^(a)Frequency fingerprint of

8.3 step (b), mixing

An input detection module for calculating the value of output F according to formula (4), wherein F is equal to +1 or-1, and +1 represents that the software to be detected is malicious software and-1 represents that the software to be detected is goodAnd the purpose of judging whether the software to be detected is malicious software is achieved.

2. The method of claim 1, wherein the malicious samples are obtained from Drebin, Genome and AMD datasets from open sources at step 2.1.

3. The method of claim 1, wherein the 2.2 steps of benign samples refer to benign software obtained by crawling google play and Apkpure application stores, which is obtained by detection and filtering through local antivirus software and VirusTotal online antivirus website.

4. The method as claimed in claim 1, wherein the decompression tool at step 3.3.1 refers to Gzip or 7 zip.

5. The method as claimed in claim 1, wherein in the third step, the AXM L Printer2 requires version 2.0 or more, the bakmali requires version 2.4.0 or more, and the gcc-arm-none-easy requires version 9-2019-q4-major or more.

6. The method as claimed in claim 1, wherein 4.2.4 steps of the scali file of the ith sample of the line scanning D result in that L attributes appearing in the ith sample_apiAPI of, for Z_apiThe method for assigning the value to the ith column element of (1) is as follows:

4.2.4.1, initializing u to 1;

4.2.4.2, if str [ u ] is an API character string, converting to 4.2.4.2.1; if str [ u ] is not an API character string, 4.2.4.3 is converted;

4.2.4.2.1, initializing a variable v to be 1;

4.2.4.2.2 step, if str [ u ]]Contains content L_api[v]Substring of (a), assignment Z_api[v][i]1, 4.2.4.3; otherwise, go to step 4.2.4.2.3;

and step 4.2.4.2.3, making v equal to v + 1. If v is less than or equal to 32437, turning to step 4.2.4.2.2; if v is more than 32437, 4.2.4.3 steps are carried out;

4.2.4.3, if U is equal to U +1, turning to 4.2.4.2; and if U is larger than U, the scanning of the smali file of the ith sample is finished, and the operation is finished.

7. The method of claim 1, wherein the android malware detection method based on frequency fingerprint extraction is characterized in that 4.2.6 step calculates the list L_apiThe information gain IG of each API to the reference test set D is determined by the information gain IG (D | L) of the v-th API to D_api[v]) It is shown that,

4.2.6.1, changing v to 1;

4.2.6.2 step, let i equal to 1, let the first variable M₁₁Let a second variable M equal to 0₁₂Let a third variable M equal to 0₂₁Let a fourth variable M equal to 0₂₂＝0；

4.2.6.3, if Z_api[v][i]Is equal to 1 and y⁽ⁱ⁾Equal to 1, order M₁₁＝M₁₁+ 1; if Z is_api[v][i]Is equal to 1 and y⁽ⁱ⁾Equal to 0, let M₁₂＝M₁₂+ 1; if Z is_api[v][i]Is equal to 0 and y⁽ⁱ⁾Equal to 1, order M₂₁＝M₂₁+ 1; if Z is_api[v][i]Is equal to 0 and y⁽ⁱ⁾Equal to 0, let M₂₂＝M₂₂+1；

4.2.6.4, making i equal to i + 1; if i is less than or equal to N, turning to step 4.2.6.3; if i is greater than N, go to step 4.2.6.5;

computing IG (D | L) in 4.2.6.5 steps_api[v]) The method comprises the following steps:

IG(D|L_api[v])＝H(D)-H(D|L_api[v]](1)

wherein H (D) is the empirical entropy of the benchmark test set D, and H (D) is calculated by the following method:

H(D|L_api[v]) Is a list L_apiThe empirical conditional entropy of the vth API pair D of (D | L), H_api[v]) Comprises the following steps:

4.2.6.6 step, let v equal to v +1, if v is less than or equal to 32437, turn to 4.2.6.2, if v >32437, explain list L_apiAnd finishing the calculation of the information gain of D by all the APIs in the system.

8. The android malware detection method based on frequency fingerprint extraction as claimed in claim 1, wherein in the seventh step the penalty coefficient C is 100.

9. The android malware detection method based on frequency fingerprint extraction as claimed in claim 1, wherein the method for optimizing α in step 7.2.2 is as follows:

7.2.2.1 Using the constraint of equation (6), equation (5) becomes α^(r)Unitary quadratic function g (α)^(r)) For g (α)^(r)) The derivative is found α with the result after the derivative equal to 0^(r)；

7.2.2.2 solving α by using the constraint of equation (6)^(s)；

7.2.2.3 mixing α^(r)，α^(s)Updated to obtain optimized α named α^*。

10. The android malware detection method based on frequency fingerprint extraction as claimed in claim 1, wherein the method for optimizing β in step 7.2.3 is as follows:

7.2.3.1 calculating the partial derivative of β of formula (5), making the result after calculating the partial derivative equal to 0, solving the solution satisfying the constraint conditions of formula (8) and formula (9), i.e. β_permβ_api、β_smali、β_armThe optimized results are respectively named

7.2.3.2 will be

Spliced into optimized β named β^*。