CN111460452A - Android malicious software detection method based on frequency fingerprint extraction - Google Patents

Android malicious software detection method based on frequency fingerprint extraction Download PDF

Info

Publication number
CN111460452A
CN111460452A CN202010237052.6A CN202010237052A CN111460452A CN 111460452 A CN111460452 A CN 111460452A CN 202010237052 A CN202010237052 A CN 202010237052A CN 111460452 A CN111460452 A CN 111460452A
Authority
CN
China
Prior art keywords
equal
api
smali
arm
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010237052.6A
Other languages
Chinese (zh)
Other versions
CN111460452B (en
Inventor
吴庆
刘波
洪学恕
马行空
胡乃天
陆潼
刘鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202010237052.6A priority Critical patent/CN111460452B/en
Publication of CN111460452A publication Critical patent/CN111460452A/en
Application granted granted Critical
Publication of CN111460452B publication Critical patent/CN111460452B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Virology (AREA)
  • General Health & Medical Sciences (AREA)
  • Stored Programmes (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses an android malicious software detection method based on frequency fingerprint extraction, and aims to provide a method capable of accurately detecting malicious software. The method comprises the following steps of constructing an android malicious software detection system which is composed of a sample preprocessing module, a frequency fingerprint generation module and a detection module and is based on frequency fingerprint extraction, collecting malicious and benign software as samples, and constructing a benchmark test set D; and decompressing the sample in the D to obtain android manifest, classes, dex and so library files, extracting the characteristics of authority, API, smali operation codes and arm operation codes, counting whether the characteristics appear and the appearance frequency to form four different types of characteristic vectors and end-to-end frequency fingerprints. And training the optimized detection module through the frequency fingerprint of the sample in the D to form a classifier, detecting the sample to be detected, and outputting the result of whether the sample to be detected is the malicious software. The method can effectively integrate information from each component of the android software, and the detection is accurate and quick.

Description

Android malicious software detection method based on frequency fingerprint extraction
Technical Field
The invention relates to the field of android malicious software detection, in particular to a method for detecting android malicious software by using extracted frequency fingerprints.
Background
In recent years, with the increasing development and popularization of internet technology and mobile communication technology, mobile terminals represented by smartphones have brought great convenience to the lives of people, and become indispensable important communication tools. Among numerous mobile operating systems, the Android (i.e., Android) mobile operating system is popular with users due to its outstanding advantages of openness, rich third-party application software, friendly operating interface, good user experience and the like, and occupies a large market share in mobile intelligent devices on a global scale. Meanwhile, the number of android applications is also rapidly increasing, and by 2 months of 2020, the number of applications in Google Play reaches 286 thousands, and is still increasing.
Except for Google Play in an android official application market, a large number of third-party application markets exist, the markets are irregular in quality and large in number, unified and effective management is lacked, a release auditing mechanism is not sound, illegal personnel can release android application software at will, malicious applications are difficult to avoid being mixed in the markets, and huge hidden dangers are brought to information safety of users after the applications are downloaded by the users. More serious, the software stock in various application markets is huge, the speed is increased quickly, under the condition that a safety mechanism and a detection method are not sound, malicious software exists in the markets for a long time and is difficult to discover, check and kill, and the healthy development of android ecology is greatly threatened.
Typical android malware detection techniques today include both types of static detection and dynamic detection. The static detection method mainly uses a disassembling and decompiling technology or a control flow and data flow analysis technology on the smali intermediate code to detect the malicious code. The method has the advantages of high code coverage rate and the defects of incapability of detecting the problems of code confusion, encryption and dynamic loading of malicious codes. The dynamic analysis method is used for monitoring various variables of the application in running, tracking the behavior path of the application and collecting logs generated by running in the running process of the system, has the advantages of solving the problems of code confusion, encryption and the like in a static method, and has the defects of low coverage rate of dynamic test codes, and the dynamic analysis method has the advantages that some malicious programs can prevent the self running under a simulator and can crash or change the behavior of the self when the malicious programs run under the simulator. In implementation, for detection of a large amount of malicious samples, in order to obtain a faster detection speed and a higher code coverage rate, most methods are more prone to use static detection.
Ganesh et al extract permissions listed in the Manifest list of android software as features to detect malicious applications, arrange permissions into an array of 12 × 12, input the array into a convolutional neural network model for training to detect whether the software is malicious, M.Amin et al extract opcode sequences from byte code files as features to detect android malware, extract opcodes in the software into a long sequence, treat the long sequence as ordered text, analyze the malicious nature of the software by training a Bi L STM neural network model, R.Nix et al extract android API (application Programming Interface) call sequences to study the detection method of the malicious software, encode each API call using a bit vector, split and combine into a matrix of n × m in size for use as an input to the convolutional neural network model, and finally use a trained classifier to determine the malicious nature of the software.
The detection method achieves certain achievements in android malicious software detection, but has some problems, and mainly has the following two aspects: the method is characterized in that the insufficient correlation analysis of various features of software is considered during feature extraction. Most of the existing methods are to unilaterally extract a certain type of features to characterize android software behaviors, software analysis is not performed by adopting cooperation of multiple types of features, and the extracted features are single in representation type, so that the accuracy of detection results is not high. Secondly, the trained neural network model is complex, a large amount of parameter adjustment and optimization are involved, the efficiency is low, and a large amount of time is consumed for obtaining the well-trained model.
Therefore, in the face of the abundant android malware, how to accurately and efficiently detect the android malware is a very significant concern.
Disclosure of Invention
The method aims to solve the technical problems that aiming at the android malicious software, frequency fingerprints capable of uniquely identifying the software are generated, a multi-core support vector machine model is trained and optimized based on the fingerprints, the android malicious software is accurately detected, and meanwhile, the detection speed is effectively improved.
The technical scheme of the invention is as follows: the android malicious software detection system based on frequency fingerprint extraction and composed of a sample preprocessing module, a frequency fingerprint generation module and a detection module is constructed, android malicious and benign software is collected as samples, and a benchmark test set is constructed. Decompressing the samples in the set to obtain android manifest, classes, dex and so library files, extracting the characteristics of authority, API, smali operation codes and arm operation codes, counting whether the four characteristics appear and the occurrence frequency to form four different types of characteristic vectors which are connected end to form long vectors as the frequency fingerprints of the android software. By collecting frequency fingerprints of a plurality of samples in the benchmark test set, the training optimization detection module (which is a multi-core support vector machine model) becomes a classifier, the samples to be detected are detected, and the result of whether the samples to be detected are malicious software is output.
The invention comprises the following steps:
firstly, constructing an android malicious software detection system based on frequency fingerprint extraction. The system is installed in a Google official or third-party android application software market server and consists of a sample preprocessing module, a frequency fingerprint generating module and a detecting module.
The sample preprocessing module is connected with the frequency fingerprint generating module, receives samples from a benchmark test set constructed by developers and samples to be detected submitted by common users, preprocesses the samples, generates files of three types including android manifest.
The frequency fingerprint generation module is connected with the sample preprocessing module and the detection module, receives android manifest, smali files and arm instruction files from the sample preprocessing module, performs feature screening and frequency fingerprint (a vector which can be used as an android software identity) calculation, generates a frequency fingerprint, and outputs the frequency fingerprint to the detection module; the frequency fingerprint generation module consists of a characteristic screening module and a frequency fingerprint calculation module. The feature screening module is connected with the sample preprocessing module and the frequency fingerprint computing module, receives android manifest, smal files and arm instruction files from the sample preprocessing module, performs feature screening on the three files to obtain authority, API, smal operation codes and arm operation code features, and sends the authority, API, smal operation codes and arm operation code features to the frequency fingerprint computing module. The frequency fingerprint calculation module is connected with the sample preprocessing module, the feature screening module and the detection module, receives the authority, the API, the smali operation code and the arm operation code features from the feature screening module, receives the android manifest, the smali file and the arm instruction file from the sample preprocessing module, calculates to generate a frequency fingerprint, and sends the frequency fingerprint to the detection module.
The detection module is connected with the frequency fingerprint generation module, is a multi-core support vector machine model, receives the frequency fingerprints of the reference test set D and the frequency fingerprints of the software to be detected from the frequency fingerprint generation module, performs training optimization by using the frequency fingerprints of the reference test set D to form a classifier suitable for detecting the software to be detected, and then performs detection classification on the software to be detected according to the frequency fingerprints of the software to be detected to obtain a judgment result of whether the software to be detected is malicious software.
Secondly, constructing a benchmark test set D, wherein the method comprises the following steps:
2.1 step N was obtained from open source Drebin, Genome and AMD data sets1Individual android malware as malicious samples, N1Is a positive integer and N1>1000。
2.2, obtaining benign software by crawling GooglePlay and Apkpure application stores, and detecting and filtering by using local antivirus software and VirusTotal online antivirus website to form N2Benign specimen, N2Is a positive integer and N2>1000。
And 2.3, adding labels to the malicious samples and the benign samples to form a benchmark test set D, wherein N is the total number of the samples in D, and N is equal to N1+N2. Definition of x(i)Is the ith sample in D, y(i)Is x(i)Label of (a), y(i)Equal to 1 denotes x(i)As a malicious sample, y(i)Equal to-1 denotes x(i)I is more than or equal to 1 and less than or equal to N.
2.4 store D in a memory readable by both the preprocessing module and the frequency fingerprint generation module.
And thirdly, preprocessing the N samples in the D by using a sample preprocessing module to obtain N android Manifest xml files, N smali files and N arm instruction files.
Step 3.1, enabling the variable i to be 1;
3.2 step, take the ith sample x from D(i)
3.3 step, using sample pretreatment method to x(i)Carrying out pretreatment to obtain x(i)Xml file, smali file and arm instruction file, the method is as follows:
3.3.1 Steps, using decompression tools (e.g., Gzip and 7zip), on x(i)Decompress and extract x(i)Xml, classes, dex, and so runtime files in (1).
3.3.2 step, using android Manifest xml file specific decompilation tool AXM L Printer2 (download address: https:// storage. google apis. com/google-code-archive-downloads/v2/code. google. com/android4me/AXM L Printer2.jar, version 2.0 or above), android Manifest xml file is decompilated from binary form to text form.
3.3.3, using a dex file format decompilation tool bakmali (https:// bitbucket.org/JessuFreke/smali/downloads/bakmali-2.4.0. jar, version 2.4.0 or above) to decompilate classs.dex into a smali file, if a plurality of smali files are generated, combining the plurality of smali files into one smali file, and turning to 3.3.4 steps; if only 1 smali file is generated, directly rotating to 3.3.4 steps.
3.3.4 steps, reversely compiling the so running library file into an arm instruction file in a text form by using an arm instruction disassembling tool gcc-arm-none-outline (https:// developer. arm.com/-/media/Files/downloads/gnu-rm/9-2019q 4/gcc-arm-none-outline-9-2019-q 4-major-x 86-64-linux.tar.bz2, version 9-2019-q4-major or the above versions), and if a plurality of arm instruction Files are generated, combining the plurality of arm instruction Files into one arm instruction file, and turning to 3.4 steps; if the arm instruction file is not generated, an empty arm instruction file is newly created, and the step is rotated to 3.4.
3.4, changing i to i +1, and if i is less than or equal to N, turning to 3.2; and if i is larger than N, generating N corresponding android Manifest xml files, N corresponding smali files and N corresponding arm instruction files by the N samples, sending the N corresponding android Manifest xml files, the N corresponding smali files and the N corresponding arm instruction files of the N samples of D to the feature screening module, and turning to the fourth step.
And fourthly, the feature screening module performs feature screening on N android files, N smal files and N arm instruction files corresponding to N samples of D received from the sample preprocessing module to obtain authority features, API features, smal operation code features and arm operation code features suitable for classifying D.
And 4.1, selecting 167 android system permissions defined in an android developer document (https:// leveller. android. com/reference/android/Manifest. permission), and taking the 167 permissions as features, namely permission features.
And 4.2, selecting 256 APIs from the APIs of a pscout list (https:// security. csl. toronto. edu/pscout/:
4.2.1 step, build a list LapiSelecting all 32437 APIs in the pscout list to add to LapiThe vth API is noted as Lapi[v],1≤v≤32437。
4.2.2, establishing a two-dimensional array Z of 32437 rows and N columnsapiRow v, column i element Zapi[v][i]Is defined as 1 or 0, 1 represents LapiThe vth API of (D) appears in the ith sample in D, and 0 represents no appearance.
4.2.3 step, initialize ZapiAll elements in the table are 0, and the initialization variable i is 1.
4.2.4, scanning the smali file of the ith sample of the D line by line to obtain L attributes appearing in the ith sampleapiAPI of, for ZapiThe ith column element of (a). The u line character string of the notation smal file is str [ u]Recording the total line number of the smali file as U, wherein U is more than or equal to 1 and less than or equal to U, and the method comprises the following steps:
and step 4.2.4.1, initializing u to 1.
4.2.4.2, if str [ u ] is an API character string, converting to 4.2.4.2.1; if str [ u ] is not an API string, go to 4.2.4.3.
At step 4.2.4.2.1, the initialization variable v is 1.
4.2.4.2.2 step, if str [ u ]]Contains content Lapi[v]Substring of (a), assignment Zapi[v][i]1, 4.2.4.3; otherwise, go to step 4.2.4.2.3.
And step 4.2.4.2.3, making v equal to v + 1. If v is less than or equal to 32437, turning to step 4.2.4.2.2; if v is more than 32437, go to 4.2.4.3 steps.
4.2.4.3, making u equal to u + 1. If U is less than or equal to U, turning to 4.2.4.2; if U is larger than U, the scanning of the smali file of the ith sample is finished, and the step is converted to 4.2.5.
And 4.2.5, making i equal to i + 1. If i is less than or equal to N, turning to 4.2.4 steps; if i is more than N, completing the two-dimensional array ZapiTo 4.2.6.
4.2.6 calculating a list LapiInformation gain IG of each API to reference test set D. information gain IG (D | L) of the vth API to Dapi[v]) And (4) showing.
And 4.2.6.1, making v equal to 1.
And 4.2.6.2, making i equal to 1. Let a first variable M11Let a second variable M equal to 012Let a third variable M equal to 021Let a fourth variable M equal to 022=0。
4.2.6.3, if Zapi[v][i]Is equal to 1 and y(i)Equal to 1, order M11=M11+ 1; if Z isapi[v][i]Is equal to 1 and y(i)Equal to 0, let M12=M12+ 1; if Z isapi[v][i]Is equal to 0 and y(i)Equal to 1, order M21=M21+ 1; if Z isapi[v][i]Is equal to 0 and y(i)Equal to 0, let M22=M22+1。
And 4.2.6.4, making i equal to i + 1. If i is less than or equal to N, turning to step 4.2.6.3; if i is greater than N, go to step 4.2.6.5.
Computing IG (D | L) in 4.2.6.5 stepsapi[v]) The method comprises the following steps:
IG(D|Lapi[v])=H(D)-H(D|Lapi[v]) (1)
wherein H (D) is the empirical entropy of the benchmark test set D, and H (D) is calculated by the following method:
Figure BDA0002431352400000061
H(D|Lapi[v]) Is a list LapiThe empirical conditional entropy of the vth API pair D of (D | L), Hapi[v]) Comprises the following steps:
Figure BDA0002431352400000062
4.2.6.6, if v is equal to v +1, if v is less than or equal to 32437, turn to 4.2.6.2, if v is greater than 32437, explain list LapiAfter the information gain of all the APIs in the system on D is calculated, according to IG (D | L)api[v]) L will be counted from large to smallapiAnd (4) sequencing the internal APIs, taking the top 256 sequenced APIs as API characteristics, and turning to 4.3 steps.
4.3, the android Dalvik virtual machine predefines 8 binary bits of length of the smali operation code (https:// leveler. android. com/reference/Dalvik/byte/opcode. html), including the undefined types of reservation, and 256 at most, and takes the 256 kinds of smali operation codes as features, which are called the smali operation code features.
And 4.4, according to an arm instruction quick reference manual (http:// infocenter. arm. com/help/topic/com. arm. doc. QRC0001mc/QRC0001_ UA L. pdf), the feature screening module selects a total of 197 arm instruction operation codes listed by the manual as features, which are called arm operation code features.
And 4.5, sending the authority feature, the API feature, the smali operation code feature and the arm operation code feature to a frequency fingerprint calculation module.
And fifthly, determining a frequency fingerprint format.
And respectively arranging 167 authority features, 256 API features, 256 smali operation code features and 197 arm operation code features according to an alphabetical order to form vectors, which are respectively called as an authority vector, an API vector, a smali operation code vector and an arm operation code vector of the android software.
The permission vector of the android software is composed of 167 integers, and each integer takes the value of 1 or 0. If the value of the integer at the position of the pa is 1, the pa in the 167 screened permissions is applied in the android software; if the value of the integer at the position of the pa is 0, it is indicated that the pa in the 167 screened permissions is not applied in the android software. pa is an integer of 1 to 167.
The API vector for an android software consists of 256 decimal places, the decimal place at the pb-th position indicating the frequency of occurrence of the pb-th of the 256 screened APIs in the android software. pb is an integer, and pb is more than or equal to 1 and less than or equal to 256.
The smali operation code vector of the android software is composed of 256 decimals, and the decimal at the position of the pc describes the frequency of occurrence of the pc of the 256 kinds of screened smali operation codes in the android software. pc is an integer, and pc is more than or equal to 1 and less than or equal to 256.
The arm opcode vector for an android software consists of 197 decimals, and the fraction at the position of the pdth specifies the frequency with which the pdth of the 197 screened arm opcodes occurs in the android software. pd is an integer of 1 to 197.
The four vectors are connected end to form a vector with the length of 876, and the vector is used as the identity of the sample and is called a frequency fingerprint. The 167 integers and 709 decimal places contained in a frequency fingerprint are referred to as elements of the frequency fingerprint.
And sixthly, the frequency fingerprint calculation module receives the authority feature, the API feature, the smali operation code feature and the arm operation code feature from the feature screening module, receives the android manifest.
Step 6.1, order LaAs a list of permissions, list member La[pa]The name character string of the pa-type authority arranged in the order of letters in the 167 authorities, and LbIs an API List, List Member Lb[pb]For the name string of the alphabetically arranged pb-th API of the 256 APIs, let LcAs a list of smali opcodes, list Member Lc[pc]The name character string of the pc type smali operation code arranged in the order of letters in the 256 kinds of smali operation codes, and LdAs an arm opcode List, List Member Ld[pd]Is the name character string of the pd-th arm operation code arranged in the order of letters in 197 arm operation codes. Let variable i equal 1.
6.2, taking the ith sample x in D(i)Is x(i)Generating frequency fingerprints
Figure BDA0002431352400000081
876 elements are included, and each element is initialized to 0. Will be provided with
Figure BDA00024313524000000812
The authority vector in (1) is recorded as
Figure BDA0002431352400000082
The pa-th element in (b) is marked as
Figure BDA0002431352400000083
API vector notation
Figure BDA0002431352400000084
Pb th element of (1)
Figure BDA0002431352400000085
The smali opcode vector is noted
Figure BDA0002431352400000086
The pc-th element in (1)
Figure BDA0002431352400000087
arm opcode vector as
Figure BDA0002431352400000088
Pd th element in (2)
Figure BDA0002431352400000089
6.3, adopting a permission extraction method to extract x(i)Authority of application, get x(i)Authority vector of
Figure BDA00024313524000000810
The method comprises the following steps:
step 6.3.1, scan by line x(i)Xml file, the qa row character string of the xml file is stro [ qa]Let the total number of rows of the android manifest.
And 6.3.2, making qa equal to 1.
6.3.3, if stra [ qa ] contains a substring with the content of "uses-permission", making pa equal to 1, and turning to 6.3.4; if stra [ qa ] does not contain the character string with the content of "uses-permission", 6.3.6 steps are carried out.
6.3.4, if stra [ qa]Contains content La[pa]A substring of (a), indicates x(i)Application for La[pa]Authority, order
Figure BDA00024313524000000813
6.3.6 steps are carried out; if stra [ qa [ ]]The non-content is La[pa]And 6.3.5 steps.
6.3.5, if pa is equal to pa +1, if pa is less than or equal to 167, go to 6.3.4, if pa is greater than 167, it shows that a pair L is completedaThe inspection is turned to 6.3.6 steps.
And 6.3.6, making qa equal to qa + 1. If qa is less than or equal to numa, turning to 6.3.3 steps; if qa > numa, x is stated(i)Xml document is scanned,
Figure BDA00024313524000000811
and 6.4 steps are carried out after the calculation is finished.
6.4, counting x by adopting an API statistical method(i)API used, get x(i)API vector
Figure BDA0002431352400000091
The method comprises the following steps:
step 6.4.1, scan by line x(i)Corresponding smali file, the qb line character string of the smali file is marked as strb [ qb [ ]]And recording the total line number of the smali file as a numb line.
And 6.4.2, making qb equal to 1, using a variable inv to represent the total number of the APIs in the smali file, and making inv equal to 1.
And 6.4.3, making the variable pb equal to 1.
6.4.4, if strb [ qb ] contains a substring with the content of 'invoke', making inv equal to inv +1, and turning to 6.4.5; if the substring of "invoke" is not contained, go to step 6.4.7.
6.4.5, if strb [ qb ]]Contains content Lb[pb]Sub-string of (2), caption x(i)Call name Lb[pb]API of (1), order
Figure BDA0002431352400000092
Turning to step 6.4.7; if strb [ qb [ ]]The non-content is Lb[pb]Go to step 6.4.6.
6.4.6, if pb is not more than 256, turning to 6.4.5, if pb is more than 256, indicating that L pairs are completedbGo to step 6.4.7.
6.4.7, let qb be qb + 1. If qb is less than or equal to numb, turning to 6.4.3 steps; if qb > numb, say x(i)And after the corresponding smali file is scanned, turning to step 6.4.8.
6.4.8, making pb 1.
6.4.9 step (1), let
Figure BDA0002431352400000093
6.4.10, making pb ═ pb + 1. If pb is less than or equal to 256,turning to step 6.4.9; if pb > 256, this indicates
Figure BDA0002431352400000094
And 6.5 steps are carried out after the calculation is finished.
6.5, adopting a smali operation code statistical method to count x(i)The used smali operation code, get x(i)Of a smali opcode vector
Figure BDA0002431352400000095
The method comprises the following steps:
step 6.5.1, scan by line x(i)Corresponding smali file, wherein the qc line character string of the smali file is strc [ qc ] of]And recording the total line number of the smali file as the hue line.
And 6.5.2, setting qc to be 1, using a variable ops to represent the total amount of the smali operation codes in the smali file, and setting ops to be 1.
And 6.5.3, making pc equal to 1.
6.5.4, if strc [ qc ]]Contains content Lc[pc]Sub-string of
Figure BDA0002431352400000101
Figure BDA0002431352400000102
Switching to 6.5.6 step when ops is ops + 1; if strc [ qc ]]The non-content is Lc[pc]Go to step 6.5.5.
6.5.5, if pc is less than or equal to 256, 6.5.4 steps are carried out, if pc is more than 256, the L step is completedcGo to step 6.5.6.
6.5.6, let qc be qc + 1. If qc is less than or equal to numc, 6.5.3 steps are carried out; if qc > numc, x is stated(i)And after the corresponding smali file is scanned, turning to step 6.5.7.
And 6.5.7, making pc equal to 1.
6.5.8 step (1), let
Figure BDA0002431352400000103
And 6.5.9, making pc equal to pc + 1. If pc is less than or equal to 256, turning to 6.5.8; if pc > 256, this indicates
Figure BDA0002431352400000104
And 6.6 steps are carried out after the calculation is finished.
6.6, counting x by an arm operation code statistical method(i)The arm opcode used, yields x(i)Arm opcode vector of
Figure BDA0002431352400000105
The method comprises the following steps:
step 6.6.1, scan by line x(i)Corresponding arm file, memory the qd line character string of arm file as strd [ qd ]]And the total line number of the arm file is numd lines.
And 6.6.2, making qd equal to 1, using a variable opa to represent the total number of the arm operation codes used in the arm file, and making opa equal to 1. If qd is less than or equal to numd, turning to step 6.6.3; if qd > numd, x is stated(i)And the corresponding arm file is an empty file, and the step is 6.7.
6.6.3, let pd equal to 1.
6.6.4, if strd [ qd ] contains ">" character, it indicates strd [ qd ] contains an arm instruction, opa +1, go to 6.6.5; if strd [ qd ] does not contain the ">" character, go to 6.6.7.
6.6.5, if strd [ qd ]]Contains content Ld[pd]Sub-string of
Figure BDA0002431352400000106
Figure BDA0002431352400000107
Turning to step 6.6.7; if strd [ qd ]]The non-content is Ld[pd]And 6.6.6 steps.
6.6.6, changing the step to 6.6.5 if pd is less than or equal to 197, if pd is more than 197, indicating that one time of the pair L is completeddGo to step 6.6.7.
6.6.7, let qd be qd + 1. If qd is less than or equal to numd, turning to step 6.6.3; if qd > numd, x is stated(i)And after the corresponding arm file is scanned, turning to step 6.6.8.
6.6.8, let pd equal to 1.
6.6.9, the step of pressing the film to be dried,order to
Figure BDA0002431352400000111
6.6.10, let pd be pd + 1. If pd is less than or equal to 197, turning to step 6.6.9; if pd > 197, this indicates
Figure BDA0002431352400000112
And 6.7, completing the calculation.
And 6.7, making i equal to i + 1. If i is less than or equal to N, turning to 6.2; and if i is larger than N, the frequency fingerprints are generated by calculating the N samples in the D, the frequency fingerprints are sent to a detection module, and the seventh step is carried out.
And seventhly, the detection module receives the frequency fingerprints from the frequency fingerprint generation module, trains the multi-core support vector machine model, and becomes a classifier suitable for classifying and judging the software to be detected. The multi-kernel support vector machine model is a classification model which is based on the support vector machine model and uses various kernel functions to map vectors of a feature space from a low dimension to a high dimension to enhance classification capability. For benchmark set D, the feature space is a set of frequency fingerprints for N samples within D. Let kperm、kapi、ksmali、karmThe kernel functions respectively representing the usage of the authority vector, the API vector, the smali opcode vector and the arm opcode vector in the frequency fingerprint, β are weight vectors, which can be expressed as (β)perm,βapi,βsmali,βarm) Element β of βperm、βapi、βsmali、βarmRespectively represents kperm、kapi、ksmali、karmLet T be the set { perm, api, smali, arm } (perm, api, smali, arm are k, respectively)perm、kapi、ksmali、karmFor describing one expression of equation (4), the multi-core support vector machine model Y may be expressed as:
Figure BDA0002431352400000113
α(i)is a LagrangeMultiplier, { α(1),α(2),...,α(i),...,α(N)The construction vector α sgn (a) is a step function of the parameter a, sgn (a) ═ 1 when a > 0, sgn (a) ═ 0 when a ═ 0, sgn (a) ═ 1 when a < 0, α, β are obtained by solving equation (5):
Figure BDA0002431352400000114
the constraint conditions of formula (5) are formula (6) to formula (9):
Figure BDA0002431352400000115
0≤α(i)≤C (7)
t∈Tβt=1 (8)
βt≥0,t∈T (9)
wherein C is a penalty coefficient, and C is more than or equal to 0 and is used for representing the size of the penalty of misclassification.
b is a scalar, and obtained α, β is given by the following equation:
Figure BDA0002431352400000121
wherein,
Figure BDA0002431352400000122
are support vector sample points.
The method for training the multi-core support vector machine model comprises the following steps:
and 7.1, calculating and generating a kernel matrix according to the frequency fingerprint of the D-interior sample received from the frequency fingerprint generating module. Let KtIs a kernel matrix, T ∈ T, representing four kernel matrices Kperm、Kapi、KsmaliAnd Karm。KtThe scale is N rows and N columns, the element of the ith row and the jth column is
Figure BDA0002431352400000123
Selecting more than 3 itemsFormula kernel function, KtThe calculation method comprises the following steps:
and 7.1.1, changing i to 1.
And 7.1.2, changing j to 1.
7.1.3 step of calculating
Figure BDA0002431352400000124
Figure BDA0002431352400000125
Figure BDA0002431352400000126
To represent
Figure BDA0002431352400000127
And
Figure BDA0002431352400000128
the inner product of (d).
7.1.4, if j is less than or equal to N, making j equal to j +1, and turning to 7.1.3; if j is greater than N, go to step 7.1.5.
7.1.5, if i is less than or equal to N, making i equal to i +1, and turning to 7.1.2; if i > N, KtAnd 7.2, after the calculation is finished, turning to the step.
7.2, optimizing α and β parameters by the following method:
7.2.1 initialize α each element in the vector is 0 and initialize β each element in the vector is 1/4.
7.2.2 Using equation (5), in order of increasing superscript r, s, will (α)(1),α(2),...,α(r-1),α(r+1),...,α(s),α(s+1),...,α(N)) And vector β as a fixed value, selecting a pair α(r)、α(s)α is optimized, and the optimization method comprises the following steps:
7.2.2.1 Using the constraint of equation (6), equation (5) becomes α(r)Unitary quadratic function g (α)(r)) For g (α)(r)) The derivative is found α with the result after the derivative equal to 0(r)
7.2.2.2 solving α by using the constraint of equation (6)(s)
7.2.2.3 mixing α(r),α(s)Updated to obtain optimized α named α*
7.2.3 blend α*β is optimized as a fixed value by the following method:
7.2.3.1 calculating the partial derivative of β of formula (5), making the result after calculating the partial derivative equal to 0, solving the solution satisfying the constraint conditions of formula (8) and formula (9), i.e. βperm、βapi、βsmali、βarmThe optimized results are respectively named
Figure BDA0002431352400000131
7.2.3.2 will be
Figure BDA0002431352400000132
Spliced into optimized β named β*
7.2.4, it is judged whether α, β satisfy the optimization termination conditions of formula (12) to formula (14):
Figure BDA0002431352400000133
Figure BDA0002431352400000134
L(α*,β*)-L(α,β)≤ (14)
when the formula (14) is met, the α and β parameters are optimized so that the change of the function value in the formula (5) is smaller than the threshold value, 0 & lt & ltltoreq.0.1, the optimized α and β meet the requirement, the multi-core support vector machine model is trained, and 7.3 steps are carried out, otherwise, 7.2.2 steps are carried out.
And 7.3, calculating the value of b by using a formula (10), and finishing training and optimizing the multi-core support vector machine model defined by the formula (4) to form the classifier.
Eighthly, detecting the software to be detected received by the google official or a third-party android application software market server from the user by using an android malicious software detection system based on frequency fingerprint extraction, and judging whether the software to be detected is malicious software, wherein the method comprises the following steps of:
and 8.1, preprocessing the software to be detected by a sample preprocessing module. Using the software to be detected as a sample x(a)The sample pretreatment method of 3.3 steps is adopted to carry out the pretreatment on the x(a)Carrying out pretreatment to obtain x(a)And outputting the xml file, the smali file and the arm instruction file to a frequency fingerprint calculation module.
8.2 step, frequency fingerprint computing Module Pair x(a)Computing to produce x(a)Frequency fingerprint of
Figure BDA0002431352400000137
The method comprises the following steps:
8.2.1, adopting the authority extraction method of 6.3 steps to extract x(a)Authority of application, get x(a)Authority vector of
Figure BDA0002431352400000135
Step 8.2.2, counting x by adopting the API statistical method of step 6.4(a)API used, get x(a)API vector
Figure BDA0002431352400000136
8.2.3, adopting the statistical method of the smali operation codes in the 6.5 steps to count x(a)The used smali operation code, get x(a)Of a smali opcode vector
Figure BDA0002431352400000141
8.2.4 steps, and counting x by adopting the arm operation code statistical method in the 6.6 steps(a)The arm opcode used, yields x(a)Arm opcode vector of
Figure BDA0002431352400000142
8.2.5, step (b), mixing
Figure BDA0002431352400000143
After the calculation, splicing into x(a)Frequency fingerprint of
Figure BDA0002431352400000144
8.3 step (b), mixing
Figure BDA0002431352400000145
Inputting a detection module (an optimized classifier suitable for detection at the moment), and calculating the value of the output F by a formula (4), wherein F is equal to +1 or-1, and +1 represents that the software to be detected is malicious software, and-1 represents benign software, so that the aim of judging whether the software to be detected is the malicious software is fulfilled.
Compared with other technologies, the invention has the following advantages:
one is high accuracy. The method and the device provided by the invention have the advantages that the frequency fingerprint is generated by combining the characteristics of the use permission, the API, the smali operation code and the arm operation code, the attribute characteristics of the android software can be accurately expressed, and the method and the device are suitable for being used as the android software identity mark. The multi-core support vector machine trained based on the frequency fingerprints is used as a classifier, information from all components of android software can be effectively integrated, and an accurate detection result is achieved.
Secondly, high efficiency. The efficiency of the invention is embodied in two aspects: one is that the efficiency of frequency fingerprint generation is high. Xml, a smali file and an arm instruction file are scanned, the frequencies of the authority, the API, the smali operation code and the arm operation code are counted, and the method can be completed in linear time. Secondly, the training efficiency of the classification model is high. Compared with a large number of neural network model parameters, the multi-core support vector machine model has fewer parameters, the calculated amount during parameter optimization is low, and the training efficiency is obviously improved.
Drawings
FIG. 1 is a block diagram of an android malware detection system based on frequency fingerprint extraction.
Fig. 2 is a general flow diagram of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
The technical scheme of the invention is shown in figure 2, and comprises the following steps:
firstly, constructing an android malicious software detection system based on frequency fingerprint extraction. The system is installed in a Google official or third-party android application software market server, the overall structure of the system is shown in figure 1, and the system consists of a sample preprocessing module, a frequency fingerprint generating module and a detecting module.
The sample preprocessing module is connected with the frequency fingerprint generating module, receives samples from a benchmark test set constructed by developers and samples to be detected submitted by common users, preprocesses the samples, generates files of three types including android manifest.
The frequency fingerprint generation module is connected with the sample preprocessing module and the detection module, receives the android manifest, the smal file and the arm instruction file from the sample preprocessing module, performs feature screening and frequency fingerprint calculation, generates a frequency fingerprint and outputs the frequency fingerprint to the detection module; the frequency fingerprint generation module consists of a characteristic screening module and a frequency fingerprint calculation module. The feature screening module is connected with the sample preprocessing module and the frequency fingerprint computing module, receives android manifest, smal files and arm instruction files from the sample preprocessing module, performs feature screening on the three files to obtain authority, API, smal operation codes and arm operation code features, and sends the authority, API, smal operation codes and arm operation code features to the frequency fingerprint computing module. The frequency fingerprint calculation module is connected with the sample preprocessing module, the feature screening module and the detection module, receives the authority, the API, the smali operation code and the arm operation code features from the feature screening module, receives the android manifest, the smali file and the arm instruction file from the sample preprocessing module, calculates to generate a frequency fingerprint, and sends the frequency fingerprint to the detection module.
The detection module is connected with the frequency fingerprint generation module, is a multi-core support vector machine model, receives the frequency fingerprints of the reference test set D and the frequency fingerprints of the software to be detected from the frequency fingerprint generation module, performs training optimization by using the frequency fingerprints of the reference test set D to form a classifier suitable for detecting the software to be detected, and then performs detection classification on the software to be detected according to the frequency fingerprints of the software to be detected to obtain a judgment result of whether the software to be detected is malicious software.
In fig. 1, the solid arrows from the sample preprocessing module to the frequency fingerprint generation module and the detection module are the flow of processing the samples in the benchmark test set D by the android malware detection system based on frequency fingerprint extraction, and the dotted arrows from the sample preprocessing module to the frequency fingerprint generation module and the detection module are the flow of processing the samples to be detected (as can be seen from the eighth step, the software to be detected does not need the feature screening module to perform feature screening).
Secondly, constructing a benchmark test set D, wherein the method comprises the following steps:
2.1 step N was obtained from open source Drebin, Genome and AMD data sets1Individual android malware as malicious samples, N1Is a positive integer and N1=2000。
2.2, obtaining benign software by crawling GooglePlay and Apkpure application stores, and detecting and filtering by using local antivirus software and VirusTotal online antivirus website to form N2Benign specimen, N2Is a positive integer and N2=2000。
And 2.3, adding labels to the malicious samples and the benign samples to form a benchmark test set D, wherein N is the total number of the samples in D, and N is equal to N1+N2. Definition of x(i)Is the ith sample in D, y(i)Is x(i)Label of (a), y(i)Equal to 1 denotes x(i)As a malicious sample, y(i)Equal to-1 denotes x(i)I is more than or equal to 1 and less than or equal to N.
2.4 store D on a memory (e.g., memory of google official or third party android application market server installed with android malware detection system based on frequency fingerprint extraction) that can be read by both the preprocessing module and the frequency fingerprint generation module.
And thirdly, preprocessing the N samples in the D by using a sample preprocessing module to obtain N android Manifest xml files, N smali files and N arm instruction files.
Step 3.1, enabling the variable i to be 1;
3.2 step, take the ith sample x from D(i)
3.3 step, using sample pretreatment method to x(i)Carrying out pretreatment to obtain x(i)Xml file, smali file and arm instruction file, the method is as follows:
3.3.1 Steps, using decompression tool Gzip, for x(i)Decompress and extract x(i)Xml, classes, dex, and so runtime files in (1).
3.3.2 step, the android manifest. xml file is decompiled from binary form to text form using the android manifest. xml file specific decompilation tool AXM L Printer2 version 2.0.
3.3.3, inversely compiling classes into the smali file by using a dex file format inverse compiling tool bakamali version 2.4.0, if a plurality of smali files are generated, combining the plurality of smali files into one smali file, and turning to 3.3.4; if only 1 smali file is generated, directly rotating to 3.3.4 steps.
3.3.4, inversely compiling the so runtime library file into an arm instruction file in a text form by using an arm instruction disassembling tool gcc-arm-none-eabi version 9-2019-q4-major, and combining a plurality of arm instruction files into one arm instruction file if a plurality of arm instruction files are generated, and turning to the 3.4 step; if the arm instruction file is not generated, an empty arm instruction file is newly created, and the step is rotated to 3.4.
3.4, changing i to i +1, and if i is less than or equal to N, turning to 3.2; and if i is larger than N, generating N corresponding android Manifest xml files, N corresponding smali files and N corresponding arm instruction files by the N samples, sending the N corresponding android Manifest xml files, the N corresponding smali files and the N corresponding arm instruction files of the N samples of D to a feature screening module, and turning to the fourth step.
And fourthly, the feature screening module performs feature screening on N android files, N smal files and N arm instruction files corresponding to N samples of D received from the sample preprocessing module to obtain authority features, API features, smal operation code features and arm operation code features suitable for classifying D.
And 4.1, selecting 167 android system permissions defined in an android developer document (https:// leveller. android. com/reference/android/Manifest. permission), and taking the 167 permissions as features, namely permission features.
And 4.2, selecting 256 APIs from the APIs of a pscout list (https:// security. csl. toronto. edu/pscout/:
4.2.1 step, build a list LapiSelecting all 32437 APIs in the pscout list to add to LapiThe vth API is noted as Lapi[v],1≤v≤32437。
4.2.2 step, establishing a two-dimensional array Z of 32437 rows and N columnsapiRow v, column i element Zapi[v][i]Is defined as 1 or 0, 1 represents LapiThe vth API of (D) appears in the ith sample in D, and 0 represents no appearance.
4.2.3 step, initialize ZapiAll elements in the table are 0, and the initialization variable i is 1.
4.2.4, scanning the smali file of the ith sample of the D line by line to obtain L attributes appearing in the ith sampleapiAPI of, for ZapiThe ith column element of (1) is assigned; the u line character string of the notation smal file is str [ u]And recording the total line number of the smali file as U, wherein U is more than or equal to 1 and less than or equal to U.
And step 4.2.4.1, initializing u to 1.
4.2.4.2, if str [ u ] is an API character string, converting to 4.2.4.2.1; if str [ u ] is not an API string, go to 4.2.4.3.
At step 4.2.4.2.1, the initialization variable v is 1.
4.2.4.2.2 step, if str [ u ]]Contains content Lapi[v]Substring of (a), assignment Zapi[v][i]1, 4.2.4.3; otherwise, go to step 4.2.4.2.3.
And step 4.2.4.2.3, making v equal to v + 1. If v is less than or equal to 32437, turning to step 4.2.4.2.2; if v is greater than 32437, go to step 4.2.4.3.
4.2.4.3, making u equal to u + 1. If U is less than or equal to U, turning to 4.2.4.2; if U is greater than U, turning to step 4.2.5.
And 4.2.5, making i equal to i + 1. If i is less than or equal to N, turning to 4.2.4 steps; if i>N, completing the pairing of the two-dimensional array ZapiTo 4.2.6.
4.2.6 calculating a list LapiInformation gain IG of each API to reference test set D. information gain IG (D | L) of the vth API to Dapi[v]) And (4) showing.
And 4.2.6.1, making v equal to 1.
And 4.2.6.2, making i equal to 1. Let a first variable M11Let a second variable M equal to 012Let a third variable M equal to 021Let a fourth variable M equal to 022=0。
4.2.6.3, if Zapi[v][i]Is equal to 1 and y(i)Equal to 1, order M11=M11+ 1; if Z isapi[v][i]Is equal to l and y(i)Equal to 0, let M12=M12+ 1; if Z isapi[v][i]Is equal to 0 and y(i)Equal to 1, order M21=M21+ 1; if Z isapi[v][i]Is equal to 0 and y(i)Equal to 0, let M22=M22+1。
And 4.2.6.4, making i equal to i + 1. If i is less than or equal to N, turning to step 4.2.6.3; if i is greater than N, go to step 4.2.6.5.
Computing IG (D | L) in 4.2.6.5 stepsapi[v]) The method comprises the following steps:
IG(D|Lapi[v])=H(D)-H(D|Lapi[v]) (1)
wherein H (D) is the empirical entropy of the benchmark test set D, and H (D) is calculated by the following method:
Figure BDA0002431352400000181
H(D|Lapi[v]) Is a list LapiThe empirical conditional entropy of the vth API pair D of (D | L), Hapi[v]) Comprises the following steps:
Figure BDA0002431352400000182
4.2.6.6, if v is equal to v +1, if v is less than or equal to 32437, turn to 4.2.6.2, if v is greater than 32437, explain list LapiAfter the information gain of all the APIs in the system on D is calculated, according to IG (D | L)api[v]) L will be counted from large to smallapiAnd (4) sequencing the internal APIs, taking the top 256 sequenced APIs as API characteristics, and turning to 4.3 steps.
4.3, the android Dalvik virtual machine predefines 8 binary bits of length of the smali operation code (https:// leveler. android. com/reference/Dalvik/byte/opcode. html), including the undefined types of reservation, and 256 at most, and takes the 256 kinds of smali operation codes as features, which are called the smali operation code features.
And 4.4, according to an arm instruction quick reference manual (http:// infocenter. arm. com/help/topic/com. arm. doc. QRC0001mc/QRC0001_ UA L. pdf), the feature screening module selects a total of 197 arm instruction operation codes listed by the manual as features, which are called arm operation code features.
And 4.5, sending the authority feature, the API feature, the smali operation code feature and the arm operation code feature to a frequency fingerprint calculation module.
And fifthly, determining a frequency fingerprint format.
And respectively arranging 167 authority features, 256 API features, 256 smali operation code features and 197 arm operation code features according to an alphabetical order to form vectors, which are respectively called as an authority vector, an API vector, a smali operation code vector and an arm operation code vector of the android software. The four vectors are connected end to form a vector with the length of 876, which is used as the frequency fingerprint of the sample.
And sixthly, the frequency fingerprint calculation module receives the authority feature, the API feature, the smali operation code feature and the arm operation code feature from the feature screening module, receives the android manifest.
Step 6.1, order LaAs a list of permissions, list member La[pa]The name character string of the pa-type authority arranged in the order of letters in the 167 authorities, and LbIs an API list, list intoMember Lb[pb]For the name string of the alphabetically arranged pb-th API of the 256 APIs, let LcAs a list of smali opcodes, list Member Lc[pc]The name character string of the pc type smali operation code arranged in the order of letters in the 256 kinds of smali operation codes, and LdAs an arm opcode List, List Member Ld[pd]Is the name character string of the pd-th arm operation code arranged in the order of letters in 197 arm operation codes. Let variable i equal 1.
6.2, taking the ith sample x in D(i)Is x(i)Generating frequency fingerprints
Figure BDA0002431352400000191
876 elements are included, and each element is initialized to 0. Will be provided with
Figure BDA00024313524000001911
The authority vector in (1) is recorded as
Figure BDA0002431352400000192
The pa-th element in (b) is marked as
Figure BDA0002431352400000193
API vector notation
Figure BDA0002431352400000194
Pb th element of (1)
Figure BDA0002431352400000195
The smali opcode vector is noted
Figure BDA0002431352400000196
The pc-th element in (1)
Figure BDA0002431352400000197
arm opcode vector as
Figure BDA0002431352400000198
Pd th element in (2)
Figure BDA0002431352400000199
6.3, adopting a permission extraction method to extract x(i)Authority of application, get x(i)Authority vector of
Figure BDA00024313524000001910
The method comprises the following steps:
step 6.3.1, scan by line x(i)Xml file, the qa row character string of the xml file is stro [ qa]Let the total number of rows of the android manifest.
And 6.3.2, making qa equal to 1.
6.3.3, if stra [ qa ] contains a substring with the content of "uses-permission", making pa equal to 1, and turning to 6.3.4; if stra [ qa ] does not contain the character string with the content of "uses-permission", 6.3.6 steps are carried out.
6.3.4, if stra [ qa]Contains content La[pa]A substring of (a), indicates x(i)Application for La[pa]Authority, order
Figure BDA0002431352400000201
6.3.6 steps are carried out; if stra [ qa [ ]]The non-content is La[pa]And 6.3.5 steps.
6.3.5, if pa is equal to pa +1, if pa is less than or equal to 167, go to 6.3.4, if pa is greater than 167, it shows that a pair L is completedaThe inspection is turned to 6.3.6 steps.
And 6.3.6, making qa equal to qa + 1. If qa is less than or equal to numa, turning to 6.3.3 steps; if qa > numa, x is stated(i)Xml document is scanned,
Figure BDA0002431352400000202
and 6.4 steps are carried out after the calculation is finished.
6.4, counting x by adopting an API statistical method(i)API used, get x(i)API vector
Figure BDA0002431352400000203
The method comprises the following steps:
6.4step 1, scan by line x(i)Corresponding smali file, the qb line character string of the smali file is marked as strb [ qb [ ]]And recording the total line number of the smali file as a numb line.
And 6.4.2, making qb equal to 1, using a variable inv to represent the total number of the APIs in the smali file, and making inv equal to 1.
And 6.4.3, making the variable pb equal to 1.
6.4.4, if strb [ qb ] contains a substring with the content of 'invoke', making inv equal to inv +1, and turning to 6.4.5; if the substring of "invoke" is not contained, go to step 6.4.7.
6.4.5, if strb [ qb ]]Contains content Lb[pb]Sub-string of (2), caption x(i)Call name Lb[pb]API of (1), order
Figure BDA0002431352400000204
Turning to step 6.4.7; if strb [ qb [ ]]The non-content is Lb[pb]Go to step 6.4.6.
6.4.6, if pb is not more than 256, turning to 6.4.5, if pb is more than 256, indicating that L pairs are completedbGo to step 6.4.7.
6.4.7, let qb be qb + 1. If qb is less than or equal to numb, turning to 6.4.3 steps; if qb > numb, say x(i)And after the corresponding smali file is scanned, turning to step 6.4.8.
6.4.8, making pb 1.
6.4.9 step (1), let
Figure BDA0002431352400000205
6.4.10, making pb ═ pb + 1. If pb is less than or equal to 256, turning to 6.4.9; if pb > 256, this indicates
Figure BDA0002431352400000206
And 6.5 steps are carried out after the calculation is finished.
6.5, adopting a smali operation code statistical method to count x(i)The used smali operation code, get x(i)Of a smali opcode vector
Figure BDA0002431352400000211
The method comprises the following steps:
step 6.5.1, scan by line x(i)Corresponding smali file, wherein the qc line character string of the smali file is strc [ qc ] of]And recording the total line number of the smali file as a numc line.
And 6.5.2, setting qc to be 1, using a variable ops to represent the total amount of the smali operation codes in the smali file, and setting ops to be 1.
And 6.5.3, making pc equal to 1.
6.5.4, if strc [ qc ]]Contains content Lc[pc]Sub-string of
Figure BDA0002431352400000212
Figure BDA0002431352400000213
Switching to 6.5.6 step when ops is ops + 1; if strc [ qc ]]The non-content is Lc[pc]Go to step 6.5.5.
6.5.5, if pc is less than or equal to 256, 6.5.4 steps are carried out, if pc is more than 256, the L step is completedcGo to step 6.5.6.
6.5.6, let qc be qc + 1. If qc is less than or equal to numc, 6.5.3 steps are carried out; if qc > numc, x is stated(i)And after the corresponding smali file is scanned, turning to step 6.5.7.
And 6.5.7, making pc equal to 1.
6.5.8 step (1), let
Figure BDA0002431352400000214
And 6.5.9, making pc equal to pc + 1. If pc is less than or equal to 256, turning to 6.5.8; if pc > 256, this indicates
Figure BDA0002431352400000215
And 6.6 steps are carried out after the calculation is finished.
6.6, counting x by an arm operation code statistical method(i)The arm opcode used, yields x(i)Arm opcode vector of
Figure BDA0002431352400000216
The method comprises the following steps:
step 6.6.1, scan by line x(i)Corresponding arm file, memory the qd line character string of arm file as strd [ qd ]]And the total line number of the arm file is numd lines.
And 6.6.2, making qd equal to l, using a variable opa to represent the total number of the arm operation codes used in the arm file, and making opa equal to 1. If qd is less than or equal to numd, turning to step 6.6.3; if qd > numd, x is stated(i)And the corresponding arm file is an empty file, and the step is 6.7.
6.6.3, let pd equal to 1.
6.6.4, if strd [ qd ] contains ">" character, it indicates strd [ qd ] contains an arm instruction, opa +1, go to 6.6.5; if strd [ qd ] does not contain the ">" character, go to 6.6.7.
6.6.5, if strd [ qd ]]Contains content Ld[pd]Sub-string of
Figure BDA0002431352400000221
Figure BDA0002431352400000222
Turning to step 6.6.7; if strd [ qd ]]The non-content is Ld[pd]And 6.6.6 steps.
6.6.6, changing the step to 6.6.5 if pd is less than or equal to 197, if pd is more than 197, indicating that one time of the pair L is completeddGo to step 6.6.7.
6.6.7, let qd be qd + 1. If qd is less than or equal to numd, turning to step 6.6.3; if qd > numd, x is stated(i)And after the corresponding arm file is scanned, turning to step 6.6.8.
6.6.8, let pd equal to 1.
6.6.9 step (1), let
Figure BDA0002431352400000223
6.6.10, let pd be pd + 1. If pd is less than or equal to 197, turning to step 6.6.9; if pd > 197, this indicates
Figure BDA0002431352400000224
And 6.7, completing the calculation.
And 6.7, making i equal to i + 1. If i is less than or equal to N, turning to 6.2; and if i is larger than N, the frequency fingerprints are generated by calculating the N samples in the D, the frequency fingerprints are sent to a detection module, and the seventh step is carried out.
And seventhly, the detection module receives the frequency fingerprints from the frequency fingerprint generation module, trains the multi-core support vector machine model, and becomes a classifier suitable for classifying and judging the software to be detected. Let kperm、kapi、ksmali、karmThe kernel functions respectively representing the usage of the authority vector, the API vector, the smali opcode vector and the arm opcode vector in the frequency fingerprint, β are weight vectors, which can be expressed as (β)perm,βapi,βsmali,βarm) Element β of βperm、βapi、βsmali、βarmRespectively represents kperm、kapi、ksmali、karmLet T be the set { perm, api, smali, arm }, and the multi-kernel support vector machine model Y can be represented as:
Figure BDA0002431352400000225
α(i)as a Lagrangian multiplier, { α(1),α(2),...,α(i),...,α(N)The construction vector α sgn (a) is a step function of the parameter a, sgn (a) ═ 1 when a > 0, sgn (a) ═ 0 when a ═ 0, sgn (a) ═ 1 when a < 0, α, β are obtained by solving equation (5):
Figure BDA0002431352400000226
the constraint conditions of formula (5) are formula (6) to formula (9):
Figure BDA0002431352400000231
0≤α(i)≤C (7)
t∈Tβt=1 (8)
βt≥0,t∈T (9)
wherein C is a penalty coefficient, C is greater than or equal to 0, and C is generally equal to 100 and used for indicating the size of penalty for misclassification.
b is a scalar, and obtained α, β is given by the following equation:
Figure BDA0002431352400000232
wherein,
Figure BDA0002431352400000233
are support vector sample points.
The method for training the multi-core support vector machine model comprises the following steps:
and 7.1, calculating and generating a kernel matrix according to the frequency fingerprint of the D-interior sample received from the frequency fingerprint generating module. Let KtIs a kernel matrix, T ∈ T, representing four kernel matrices Kperm、Kapi、Ksmali and Karm。KtThe scale is N rows and N columns, the element of the ith row and the jth column is
Figure BDA0002431352400000234
Selecting a polynomial kernel of degree 3, KtThe calculation method comprises the following steps:
and 7.1.1, changing i to 1.
And 7.1.2, changing j to 1.
7.1.3 step of calculating
Figure BDA0002431352400000235
Figure BDA0002431352400000236
Figure BDA0002431352400000237
To represent
Figure BDA0002431352400000238
And
Figure BDA0002431352400000239
the inner product of (d).
7.1.4, if j is less than or equal to N, making j equal to j +1, and turning to 7.1.3; if j is greater than N, go to step 7.1.5.
7.1.5, if i is less than or equal to N, making i equal to i +1, and turning to 7.1.2; if i > N, KtAnd 7.2, after the calculation is finished, turning to the step.
7.2, optimizing α and β parameters by the following method:
7.2.1 initialize α each element in the vector is 0 and initialize β each element in the vector is 1/4.
7.2.2 Using equation (5), a pair α is selected in descending order of superscript r, s(r)、α(s)α is optimized to
Figure BDA00024313524000002310
And vector β is used as a fixed value, the optimization method is as follows:
7.2.2.1 Using the constraint of equation (6), equation (5) becomes α(r)Unitary quadratic function g (α)(r)) For g (α)(r)) The derivative is found α with the result after the derivative equal to 0(r)
7.2.2.2 solving α by using the constraint of equation (6)(s)
7.2.2.3 mixing α(r),α(s)Updated to obtain optimized α named α*
7.2.3 blend α*β is optimized as a fixed value by the following method:
7.2.3.1 calculating the partial derivative of β of formula (5), making the result after calculating the partial derivative equal to 0, solving the solution satisfying the constraint conditions of formula (8) and formula (9), i.e. βperm、βapi、βsmali、βarmThe optimized results are respectively named
Figure BDA0002431352400000241
7.2.3.2 will be
Figure BDA0002431352400000242
Spliced into optimized β named β*
7.2.4, it is judged whether α, β satisfy the optimization termination conditions of formula (12) to formula (14):
Figure BDA0002431352400000243
Figure BDA0002431352400000244
L(α*,β*)-L(α,β)≤ (14)
when the formula (14) is met, the α and β parameters are optimized so that the function value in the formula (5) is changed to be smaller than the threshold value, and the value is made to be 0.01, which indicates that the optimized α and β meet the requirements, the multi-core support vector machine model is trained, and then the step 7.3 is carried out, otherwise, the step 7.2.2 is carried out.
And 7.3, calculating the value of b by using a formula (10), and finishing training and optimizing the multi-core support vector machine model defined by the formula (4) to form the classifier.
Eighthly, detecting the software to be detected by using an android malicious software detection system based on frequency fingerprint extraction, and judging whether the software to be detected is malicious software or not, wherein the method comprises the following steps:
and 8.1, preprocessing the software to be detected by a sample preprocessing module. Using the software to be detected as a sample x(a)The sample pretreatment method of 3.3 steps is adopted to carry out the pretreatment on the x(a)Carrying out pretreatment to obtain x(a)And outputting the xml file, the smali file and the arm instruction file to a frequency fingerprint calculation module.
8.2 Steps, for x(a)Computationally generating frequency fingerprints
Figure BDA0002431352400000245
8.2.1, adopting the authority extraction method of 6.3 steps to extract x(a)Authority of application, get x(a)Authority vector of
Figure BDA0002431352400000246
Step 8.2.2, counting x by adopting the API statistical method of step 6.4(a)API used, get x(a)API vector
Figure BDA0002431352400000251
8.2.3, adopting the statistical method of the smali operation codes in the 6.5 steps to count x(a)The used smali operation code, get x(a)Of a smali opcode vector
Figure BDA0002431352400000252
8.2.4 steps, and counting x by adopting the arm operation code statistical method in the 6.6 steps(a)The arm opcode used, yields x(a)Arm opcode vector of
Figure BDA0002431352400000253
8.2.5, step (b), mixing
Figure BDA0002431352400000254
After the calculation, the components are spliced into
Figure BDA0002431352400000255
8.3 step (b), mixing
Figure BDA0002431352400000256
And inputting the detection module, and calculating and outputting the value of F according to the formula (4), wherein F is equal to +1 or-1, and +1 represents that the software to be detected is malicious software and-1 represents benign software, so that the aim of judging whether the software to be detected is the malicious software is fulfilled.

Claims (10)

1. An android malicious software detection method based on frequency fingerprint extraction is characterized by comprising the following steps:
the method comprises the steps that firstly, an android malicious software detection system based on frequency fingerprint extraction is constructed, the android malicious software detection system based on frequency fingerprint extraction is installed in a Google official or third-party android application software market server and consists of a sample preprocessing module, a frequency fingerprint generation module and a detection module;
the sample preprocessing module is connected with the frequency fingerprint generating module, receives a sample of a reference test set and a sample to be detected, preprocesses the sample, generates three types of files, namely, an android manifest.xml file, a smali file and an arm instruction file, and outputs the three types of files to the frequency fingerprint generating module;
the frequency fingerprint generation module is connected with the sample preprocessing module and the detection module, receives the android manifest, the smal file and the arm instruction file from the sample preprocessing module, performs feature screening and frequency fingerprint calculation, generates a frequency fingerprint and outputs the frequency fingerprint to the detection module;
the frequency fingerprint generation module consists of a characteristic screening module and a frequency fingerprint calculation module; the characteristic screening module is connected with the sample preprocessing module and the frequency fingerprint computing module, receives android manifest, smal files and arm instruction files from the sample preprocessing module, performs characteristic screening on the three files to obtain authority, API, smal operation codes and arm operation code characteristics, and sends the authority, API, smal operation codes and arm operation code characteristics to the frequency fingerprint computing module; the frequency fingerprint calculation module is connected with the sample preprocessing module, the feature screening module and the detection module, receives the authority, the API, the smali operation code and the arm operation code features from the feature screening module, receives the android manifest.xml, the smali file and the arm instruction file from the sample preprocessing module, calculates to generate a frequency fingerprint, and sends the frequency fingerprint to the detection module;
the detection module is connected with the frequency fingerprint generation module, is a multi-core support vector machine model, receives the frequency fingerprints of the reference test set D and the frequency fingerprints of the software to be detected from the frequency fingerprint generation module, performs training optimization by using the frequency fingerprints of the reference test set D to form a classifier suitable for detecting the software to be detected, and then performs detection classification on the software to be detected according to the frequency fingerprints of the software to be detected to obtain a judgment result of whether the software to be detected is malicious software;
secondly, constructing a benchmark test set D, wherein the method comprises the following steps:
2.1 step of adding N1Individual android malware as malicious samples, N1Is a positive integer and N1>1000;
2.2 step (b), adding N2Benign software as a benign sample, N2Is a positive integer and N2>1000;
And 2.3, adding labels to the malicious samples and the benign samples to form a benchmark test set D, wherein N is the total number of the samples in D, and N is equal to N1+N2(ii) a Definition of x(i)Is the ith sample in D, y(i)Is x(i)Label of (a), y(i)Equal to 1 denotes x(i)As a malicious sample, y(i)Equal to-1 denotes x(i)I is more than or equal to 1 and less than or equal to N;
2.4 storing D in a memory which can be read by both the preprocessing module and the frequency fingerprint generation module;
thirdly, preprocessing the N samples in the D by a sample preprocessing module to obtain N android Manifest xml files, N smali files and N arm instruction files, wherein the method comprises the following steps:
step 3.1, enabling the variable i to be 1;
3.2 step, take the ith sample x from D(i)
3.3 step, using sample pretreatment method to x(i)Carrying out pretreatment to obtain x(i)Xml file, smali file and arm instruction file, the method is as follows:
3.3.1 Steps, using decompression tool vs. x(i)Decompress and extract x(i)Xml, classes, dex and so runtime files in (1);
3.3.2, using an android manifest. xml file special decompilation tool AXM L Printer2 to decompilate the android manifest. xml file from a binary form into a text form;
3.3.3, inversely compiling classes into a smali file by using a dex file format inverse compiling tool bakmali, if a plurality of smali files are generated, combining the plurality of smali files into one smali file, and turning to 3.3.4 steps; if only 1 smali file is generated, directly rotating to 3.3.4 steps;
3.3.4, reversely compiling the so runtime file into an arm instruction file in a text form by using an arm instruction disassembling tool gcc-arm-none-eabi, if a plurality of arm instruction files are generated, combining the plurality of arm instruction files into one arm instruction file, and turning to the 3.4 step; if the arm instruction file is not generated, establishing an empty arm instruction file, and turning to the step 3.4;
3.4, changing i to i +1, and if i is less than or equal to N, turning to 3.2; if i is larger than N, the N samples generate corresponding N android Manifest xml files, N smali files and N arm instruction files, the N android Manifest xml files, the N smali files and the N arm instruction files corresponding to the N samples of D are sent to a feature screening module, and the fourth step is carried out;
fourthly, the feature screening module performs feature screening on N android files, N smal files and N arm instruction files corresponding to N samples of D received from the sample preprocessing module to obtain right features, API features, smal operation code features and arm operation code features suitable for classifying D, and the method comprises the following steps:
4.1, selecting 167 types of android system permissions defined in an android developer document, and taking the 167 types of permissions as features, namely permission features;
4.2, selecting 256 APIs from the APIs in the pscout list, wherein the method comprises the following steps:
4.2.1 step, build a list LapiSelecting all 32437 APIs in the pscout list to add to LapiThe vth API is noted as Lapi[v],1≤v≤32437;
4.2.2 step, establishing a two-dimensional array Z of 32437 rows and N columnsapiRow v, column i element Zapi[v][i]Is defined as 1 or 0, 1 represents LapiThe v API in D is present in the i sample, 0 represents not present;
4.2.3 step, initialize ZapiAll the elements in the table are 0, and the initialization variable i is 1;
4.2.4, scanning the smali file of the ith sample of the D according to lines to obtain the attributes appearing in the ith sampleAt LapiAPI of, for ZapiThe ith column element of (1) is assigned; the u line character string of the notation smal file is str [ u]Recording the total line number of the smali file as U, wherein U is more than or equal to 1 and less than or equal to U;
4.2.5, making i equal to i + 1; if i is less than or equal to N, turning to 4.2.4 steps; if i is more than N, completing the two-dimensional array ZapiTo 4.2.6;
4.2.6 calculating a list LapiInformation gain IG of each API to the reference test set D, and information gain IG of the v-th API to D (D | L)api[v]) Expressed as IG (D | L)api[v]) L will be counted from large to smallapiSequencing the internal APIs, and taking the top 256 sequenced APIs as API characteristics;
4.3, using 256 kinds of smali operation codes with the length of 8 binary bits predefined by the android Dalvik virtual machine as the characteristics of the smali operation codes;
4.4, selecting a total 197 arm instruction operation codes listed by the arm instruction quick reference manual as arm operation code features;
4.5, sending the authority feature, the API feature, the smali operation code feature and the arm operation code feature to a frequency fingerprint calculation module;
fifthly, determining a frequency fingerprint format, wherein the method comprises the following steps:
respectively arranging 167 authority features, 256 API features, 256 smali operation code features and 197 arm operation code features according to an alphabetical order to form vectors which are respectively called as an authority vector, an API vector, a smali operation code vector and an arm operation code vector of the android software; the permission vector of the android software is composed of 167 integers, and each integer takes the value of 1 or 0; if the value of the integer at the position of the pa is 1, the pa in the 167 screened permissions is applied in the android software; if the integer value at the position of the pa is 0, it indicates that the pa in the 167 screened permissions is not applied in the android software; pa is an integer, 1 is more than or equal to pa is less than or equal to 167; an API vector of the android software consists of 256 decimal numbers, and the decimal number at the position of the pb indicates the frequency of the pb of the 256 screened APIs in the android software; pb is an integer, and pb is more than or equal to 1 and less than or equal to 256; the method comprises the steps that a smali operation code vector of the android software consists of 256 decimals, and the decimal at the position of the pc indicates the frequency of the pc of 256 kinds of screened smali operation codes appearing in the android software; pc is an integer, and pc is more than or equal to 1 and less than or equal to 256; an arm opcode vector of android software consists of 197 decimals, the decimal at the position of the pdth position indicating the frequency of occurrence of the pdth type of 197 arm opcodes screened in the android software; pd is an integer, and pd is more than or equal to 1 and less than or equal to 197;
connecting the four vectors end to form a vector with the length of 876 as a frequency fingerprint, wherein 167 integers and 709 decimal numbers contained in the frequency fingerprint are both called as elements of the frequency fingerprint;
sixthly, the frequency fingerprint calculation module receives the authority feature, the API feature, the smali operation code feature and the arm operation code feature from the feature screening module, receives the android manifest xml file, the smali file and the arm instruction file from the sample preprocessing module, and calculates and generates frequency fingerprints for N samples in the reference test set D, wherein the method comprises the following steps:
step 6.1, order LaAs a list of permissions, list member La[pa]The name character string of the pa-type authority arranged in the order of letters in the 167 authorities, and LbIs an API List, List Member Lb[pb]For the name string of the alphabetically arranged pb-th API of the 256 APIs, let LcAs a list of smali opcodes, list Member Lc[pc]The name character string of the pc type smali operation code arranged in the order of letters in the 256 kinds of smali operation codes, and LdAs an arm opcode List, List Member Ld[pd]The name character string is the name character string of the pd-th arm operation code arranged in the order of letters in 197-type arm operation codes; let variable i equal to 1;
6.2, taking the ith sample x in D(i)Is x(i)Generating frequency fingerprints
Figure FDA0002431352390000041
876 elements are contained, and each element is initialized to be 0; will be provided with
Figure FDA0002431352390000042
The authority vector in (1) is recorded as
Figure FDA0002431352390000043
The pa-th element in (b) is marked as
Figure FDA0002431352390000044
API vector notation
Figure FDA0002431352390000045
Pb th element of (1)
Figure FDA0002431352390000046
The smali opcode vector is noted
Figure FDA0002431352390000047
The pc-th element in (1)
Figure FDA0002431352390000048
arm opcode vector as
Figure FDA0002431352390000049
Pd th element in (2)
Figure FDA00024313523900000410
6.3, adopting a permission extraction method to extract x(i)Authority of application, get x(i)Authority vector of
Figure FDA00024313523900000411
The method comprises the following steps:
step 6.3.1, scan by line x(i)Xml file, the qa row character string of the xml file is stro [ qa]Marking the total number of rows of the android manifest.xml file as numa rows;
step 6.3.2, let qa equal to 1;
6.3.3, if stra [ qa ] contains a substring with the content of "uses-permission", making pa equal to 1, and turning to 6.3.4; if stra [ qa ] does not contain the character string with the content of "uses-permission", 6.3.6 steps are carried out;
6.3.4, if stra [ qa]Contains content La[pa]A substring of (a), indicates x(i)Application for La[pa]Authority, order
Figure FDA0002431352390000051
6.3.6 steps are carried out; if stra [ qa [ ]]The non-content is La[pa]Turning to 6.3.5 steps;
6.3.5, let pa equal to pa +1, if pa is less than or equal to 167, go to 6.3.4, if pa is greater than 167, it means that one-pass pair L is completedaChecking, 6.3.6 steps are carried out;
step 6.3.6, let qa be qa + 1; if qa is less than or equal to numa, turning to 6.3.3 steps; if qa > numa, x is stated(i)Xml document is scanned,
Figure FDA0002431352390000052
after the calculation is finished, 6.4 steps are carried out;
6.4, counting x by adopting an API statistical method(i)API used, get x(i)API vector
Figure FDA0002431352390000053
The method comprises the following steps:
step 6.4.1, scan by line x(i)Corresponding smali file, the qb line character string of the smali file is marked as strb [ qb [ ]]Recording the total line number of the smali file as a numb line;
step 6.4.2, making qb equal to 1, using a variable inv to represent the total number of APIs in the smali file, and making inv equal to 1;
6.4.3, changing the variable pb to 1;
6.4.4, if strb [ qb ] contains a substring with the content of 'invoke', making inv equal to inv +1, and turning to 6.4.5; if the substring of the 'invoke' is not contained, turning to step 6.4.7;
6.4.5, if strb [ qb ]]Contains content Lb[pb]Sub-string of (2), caption x(i)Call name Lb[pb]API of (1), order
Figure FDA0002431352390000054
Turning to step 6.4.7; if strb [ qb [ ]]The non-content is Lb[pb]Turning to step 6.4.6;
6.4.6, changing pb to pb +1, if pb is less than or equal to 256, turning to 6.4.5, if pb is more than 256, indicating that one-time pairing L is completedbGo to step 6.4.7;
6.4.7, making qb equal to qb + 1; if qb is less than or equal to numb, turning to 6.4.3 steps; if qb > numb, say x(i)After the corresponding smali file is scanned, turning to step 6.4.8;
6.4.8, making pb 1;
6.4.9 step (1), let
Figure FDA0002431352390000055
6.4.10, making pb + 1; if pb is less than or equal to 256, turning to 6.4.9; if pb > 256, this indicates
Figure FDA0002431352390000056
After the calculation is finished, 6.5 steps are carried out;
6.5, adopting a smali operation code statistical method to count x(i)The used smali operation code, get x(i)Of a smali opcode vector
Figure FDA0002431352390000057
The method comprises the following steps:
step 6.5.1, scan by line x(i)Corresponding smali file, wherein the qc line character string of the smali file is strc [ qc ] of]Recording the total line number of the smali file as a numc line;
6.5.2, making qc equal to 1, using a variable ops to represent the total amount of the smali operation codes in the smali file, and making ops equal to 1;
6.5.3, making pc equal to 1;
6.5.4, if strc [ qc ]]Contains content Lc[pc]Sub-string of
Figure FDA0002431352390000061
Switching to 6.5.6 step when ops is ops + 1; if strc [ qc ]]The non-content is Lc[pc]Turning to step 6.5.5;
6.5.5, changing pc to pc +1, if pc is less than or equal to 256, turning to 6.5.4, if pc is more than 256, indicating that one-time pairing L is completedcGo to step 6.5.6;
6.5.6, making qc equal to qc + 1; if qc is less than or equal to numc, 6.5.3 steps are carried out; if qc > numc, x is stated(i)After the corresponding smali file is scanned, turning to step 6.5.7;
6.5.7, making pc equal to 1;
6.5.8 step (1), let
Figure FDA0002431352390000062
6.5.9, making pc equal to pc + 1; if pc is less than or equal to 256, turning to 6.5.8; if pc > 256, this indicates
Figure FDA0002431352390000063
After the calculation is finished, 6.6 steps are carried out;
6.6, counting x by an arm operation code statistical method(i)The arm opcode used, yields x(i)Arm opcode vector of
Figure FDA0002431352390000064
The method comprises the following steps:
step 6.6.1, scan by line x(i)Corresponding arm file, memory the qd line character string of arm file as strd [ qd ]]The total line number of the arm file is numd lines;
step 6.6.2, let qd be 1, use variable opa to represent the total number of the arm opcodes used in the arm file, and let opa be 1; if qd is less than or equal to numd, turning to step 6.6.3; if qd > numd, x is stated(i)If the corresponding arm file is an empty file, turning to the step 6.7;
6.6.3, making pd equal to 1;
6.6.4, if strd [ qd ] contains ">" character, it indicates strd [ qd ] contains an arm instruction, opa +1, go to 6.6.5; if strd [ qd ] does not contain the ">" character, go to 6.6.7;
6.6.5, if strd [ qd ]]Contains content Ld[pd]Sub-string of
Figure FDA0002431352390000065
Turning to step 6.6.7; if strd [ qd ]]The non-content is Ld[pd]6.6.6 steps;
6.6.6, making pd equal to pd +1, if pd is less than or equal to 197, turning to 6.6.5, if pd is greater than 197, then it shows that one-pass pair L is completeddGo to step 6.6.7;
6.6.7, making qd-qd + 1; if qd is less than or equal to numd, turning to step 6.6.3; if qd > numd, x is stated(i)After the corresponding arm file is scanned, turning to step 6.6.8;
6.6.8, making pd equal to 1;
6.6.9 step (1), let
Figure FDA0002431352390000071
6.6.10, making pd ═ pd + 1; if pd is less than or equal to 197, turning to step 6.6.9; if pd > 197, this indicates
Figure FDA0002431352390000072
After the calculation is finished, 6.7 steps are carried out;
6.7, making i equal to i + 1; if i is less than or equal to N, turning to 6.2; if i is larger than N, the frequency fingerprints are generated by calculating the N samples in the D, the frequency fingerprints are sent to a detection module, and the seventh step is carried out;
seventhly, the detection module receives the frequency fingerprints from the frequency fingerprint generation module, trains the multi-core support vector machine model, and becomes a classifier suitable for classifying and judging software to be detected, and the method comprises the following steps: for the benchmark test set D, the characteristic space of the multi-core support vector machine model is a set of frequency fingerprints of N samples in D; let kperm、kapi、ksmali、karmRepresenting the kernel functions used by the authority vector, the API vector, the smali opcode vector, and the arm opcode vector, respectively, within the frequency fingerprint, β being weight vectors, is represented as (β)perm,βapi,βsmali,βarm) Element β of βperm、βapi、βsmali、βarmRespectively represents kperm、kapi、ksmali、karmLet T be the set { perm, api, smali,arm }, and the multi-core support vector machine model Y is expressed as:
Figure FDA0002431352390000073
α(i)as a Lagrangian multiplier, { α(1),α(2),...,α(i),...,α(N)The construction vector α, sgn (a) is a step function of the parameter a, sgn (a) ═ 1 when a > 0, sgn (a) ═ 0 when a ═ 0, sgn (a) ═ 1 when a < 0, α, β are obtained by solving the formula (5):
Figure FDA0002431352390000074
the constraint conditions of formula (5) are formula (6) to formula (9):
Figure FDA0002431352390000075
0≤α(i)≤C (7)
t∈Tβt=1 (8)
βt≥0,t∈T (9)
wherein C is a penalty coefficient, and C is more than or equal to 0 and is used for representing the size of the penalty of misclassification;
b is a scalar, and obtained α, β is given by the following equation:
Figure FDA0002431352390000081
wherein,
Figure FDA0002431352390000082
is a support vector sample point;
the method for training the multi-core support vector machine model comprises the following steps:
7.1, calculating and generating a kernel matrix according to the frequency fingerprint of the D-interior sample received from the frequency fingerprint generating module; let KtIs a coreThe matrix, T ∈ T, represents the four kernel matrices Kperm、Kapi、KsmaliAnd Karm;KtThe scale is N rows and N columns, the element of the ith row and the jth column is
Figure FDA0002431352390000083
Selecting a polynomial kernel of degree 3, KtThe calculation method comprises the following steps:
7.1.1, changing i to 1;
7.1.2, changing j to 1;
7.1.3 step of calculating
Figure FDA0002431352390000084
Figure FDA0002431352390000085
Figure FDA0002431352390000086
To represent
Figure FDA0002431352390000087
And
Figure FDA0002431352390000088
inner product of (d);
7.1.4, if j is less than or equal to N, making j equal to j +1, and turning to 7.1.3; if j is more than N, go to step 7.1.5;
7.1.5, if i is less than or equal to N, making i equal to i +1, and turning to 7.1.2; if i > N, Kt7.2, after the calculation is finished, turning;
7.2, optimizing α and β parameters by the following method:
7.2.1 initialize α each element in the vector to 0, initialize β each element in the vector to 1/4;
7.2.2 Using equation (5), in order of increasing superscript r, s, will (α)(1),α(2),...,α(r-1),α(r+1),...,α(s),α(s+1),...,α(N)) And vector β as a fixed value, selecting a pair α(r)、α(s)α is optimized to obtain optimized α named as α*
7.2.3 blend α*β was optimized as a fixed value to obtain an optimized β named β*
7.2.4, it is judged whether α, β satisfy the optimization termination conditions of formula (12) to formula (14):
Figure FDA0002431352390000089
Figure FDA00024313523900000810
L(α*,β*)-L(α,β)≤(14)
when the formula (14) is met, the α and β parameters are optimized to ensure that the change of the function value in the formula (5) is less than the threshold value, 0 & lt & ltltoreq.0.1, which indicates that α and β after optimization meet the requirements, the multi-core support vector machine model is trained, 7.3 steps are carried out, otherwise, the step 7.2.2 is carried out;
7.3, calculating a value b by a formula (10), and finishing training and optimizing the multi-core support vector machine model defined by the formula (4) to form a classifier;
eighthly, detecting the software to be detected received by the google official or a third-party android application software market server from the user by using an android malicious software detection system based on frequency fingerprint extraction, and judging whether the software to be detected is malicious software, wherein the method comprises the following steps of:
8.1, preprocessing the software to be detected by a sample preprocessing module; using the software to be detected as a sample x(a)The sample pretreatment method of 3.3 steps is adopted to carry out the pretreatment on the x(a)Carrying out pretreatment to obtain x(a)Outputting the xml file, the smali file and the arm instruction file to a frequency fingerprint calculation module;
8.2 step, frequency fingerprint computing Module Pair x(a)Computing to produce x(a)Frequency fingerprint of
Figure FDA0002431352390000091
The method comprises the following steps:
8.2.1, adopting the authority extraction method of 6.3 steps to extract x(a)Authority of application, get x(a)Authority vector of
Figure FDA0002431352390000092
Step 8.2.2, counting x by adopting the API statistical method of step 6.4(a)API used, get x(a)API vector
Figure FDA0002431352390000093
8.2.3, adopting the statistical method of the smali operation codes in the 6.5 steps to count x(a)The used smali operation code, get x(a)Of a smali opcode vector
Figure FDA0002431352390000094
8.2.4 steps, and counting x by adopting the arm operation code statistical method in the 6.6 steps(a)The arm opcode used, yields x(a)Arm opcode vector of
Figure FDA0002431352390000095
8.2.5, step (b), mixing
Figure FDA0002431352390000096
After the calculation, splicing into x(a)Frequency fingerprint of
Figure FDA0002431352390000097
8.3 step (b), mixing
Figure FDA0002431352390000098
An input detection module for calculating the value of output F according to formula (4), wherein F is equal to +1 or-1, and +1 represents that the software to be detected is malicious software and-1 represents that the software to be detected is goodAnd the purpose of judging whether the software to be detected is malicious software is achieved.
2. The method of claim 1, wherein the malicious samples are obtained from Drebin, Genome and AMD datasets from open sources at step 2.1.
3. The method of claim 1, wherein the 2.2 steps of benign samples refer to benign software obtained by crawling google play and Apkpure application stores, which is obtained by detection and filtering through local antivirus software and VirusTotal online antivirus website.
4. The method as claimed in claim 1, wherein the decompression tool at step 3.3.1 refers to Gzip or 7 zip.
5. The method as claimed in claim 1, wherein in the third step, the AXM L Printer2 requires version 2.0 or more, the bakmali requires version 2.4.0 or more, and the gcc-arm-none-easy requires version 9-2019-q4-major or more.
6. The method as claimed in claim 1, wherein 4.2.4 steps of the scali file of the ith sample of the line scanning D result in that L attributes appearing in the ith sampleapiAPI of, for ZapiThe method for assigning the value to the ith column element of (1) is as follows:
4.2.4.1, initializing u to 1;
4.2.4.2, if str [ u ] is an API character string, converting to 4.2.4.2.1; if str [ u ] is not an API character string, 4.2.4.3 is converted;
4.2.4.2.1, initializing a variable v to be 1;
4.2.4.2.2 step, if str [ u ]]Contains content Lapi[v]Substring of (a), assignment Zapi[v][i]1, 4.2.4.3; otherwise, go to step 4.2.4.2.3;
and step 4.2.4.2.3, making v equal to v + 1. If v is less than or equal to 32437, turning to step 4.2.4.2.2; if v is more than 32437, 4.2.4.3 steps are carried out;
4.2.4.3, if U is equal to U +1, turning to 4.2.4.2; and if U is larger than U, the scanning of the smali file of the ith sample is finished, and the operation is finished.
7. The method of claim 1, wherein the android malware detection method based on frequency fingerprint extraction is characterized in that 4.2.6 step calculates the list LapiThe information gain IG of each API to the reference test set D is determined by the information gain IG (D | L) of the v-th API to Dapi[v]) It is shown that,
4.2.6.1, changing v to 1;
4.2.6.2 step, let i equal to 1, let the first variable M11Let a second variable M equal to 012Let a third variable M equal to 021Let a fourth variable M equal to 022=0;
4.2.6.3, if Zapi[v][i]Is equal to 1 and y(i)Equal to 1, order M11=M11+ 1; if Z isapi[v][i]Is equal to 1 and y(i)Equal to 0, let M12=M12+ 1; if Z isapi[v][i]Is equal to 0 and y(i)Equal to 1, order M21=M21+ 1; if Z isapi[v][i]Is equal to 0 and y(i)Equal to 0, let M22=M22+1;
4.2.6.4, making i equal to i + 1; if i is less than or equal to N, turning to step 4.2.6.3; if i is greater than N, go to step 4.2.6.5;
computing IG (D | L) in 4.2.6.5 stepsapi[v]) The method comprises the following steps:
IG(D|Lapi[v])=H(D)-H(D|Lapi[v]](1)
wherein H (D) is the empirical entropy of the benchmark test set D, and H (D) is calculated by the following method:
Figure FDA0002431352390000111
H(D|Lapi[v]) Is a list LapiThe empirical conditional entropy of the vth API pair D of (D | L), Hapi[v]) Comprises the following steps:
Figure FDA0002431352390000112
4.2.6.6 step, let v equal to v +1, if v is less than or equal to 32437, turn to 4.2.6.2, if v >32437, explain list LapiAnd finishing the calculation of the information gain of D by all the APIs in the system.
8. The android malware detection method based on frequency fingerprint extraction as claimed in claim 1, wherein in the seventh step the penalty coefficient C is 100.
9. The android malware detection method based on frequency fingerprint extraction as claimed in claim 1, wherein the method for optimizing α in step 7.2.2 is as follows:
7.2.2.1 Using the constraint of equation (6), equation (5) becomes α(r)Unitary quadratic function g (α)(r)) For g (α)(r)) The derivative is found α with the result after the derivative equal to 0(r)
7.2.2.2 solving α by using the constraint of equation (6)(s)
7.2.2.3 mixing α(r),α(s)Updated to obtain optimized α named α*
10. The android malware detection method based on frequency fingerprint extraction as claimed in claim 1, wherein the method for optimizing β in step 7.2.3 is as follows:
7.2.3.1 calculating the partial derivative of β of formula (5), making the result after calculating the partial derivative equal to 0, solving the solution satisfying the constraint conditions of formula (8) and formula (9), i.e. βpermβapi、βsmali、βarmThe optimized results are respectively named
Figure FDA0002431352390000113
Figure FDA0002431352390000114
7.2.3.2 will be
Figure FDA0002431352390000115
Spliced into optimized β named β*
CN202010237052.6A 2020-03-30 2020-03-30 Android malicious software detection method based on frequency fingerprint extraction Active CN111460452B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010237052.6A CN111460452B (en) 2020-03-30 2020-03-30 Android malicious software detection method based on frequency fingerprint extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010237052.6A CN111460452B (en) 2020-03-30 2020-03-30 Android malicious software detection method based on frequency fingerprint extraction

Publications (2)

Publication Number Publication Date
CN111460452A true CN111460452A (en) 2020-07-28
CN111460452B CN111460452B (en) 2022-09-09

Family

ID=71683415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010237052.6A Active CN111460452B (en) 2020-03-30 2020-03-30 Android malicious software detection method based on frequency fingerprint extraction

Country Status (1)

Country Link
CN (1) CN111460452B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001376B (en) * 2020-10-29 2021-02-26 深圳开源互联网安全技术有限公司 Fingerprint identification method, device, equipment and storage medium based on open source component
CN112632538A (en) * 2020-12-25 2021-04-09 北京工业大学 Android malicious software detection method and system based on mixed features
CN114091028A (en) * 2022-01-19 2022-02-25 南京明博互联网安全创新研究院有限公司 Android application information leakage detection method based on data flow

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
CN109165510A (en) * 2018-09-04 2019-01-08 中国民航大学 Android malicious application detection method based on binary channels convolutional neural networks
CN109271788A (en) * 2018-08-23 2019-01-25 北京理工大学 A kind of Android malware detection method based on deep learning
CN109753800A (en) * 2019-01-02 2019-05-14 重庆邮电大学 Merge the Android malicious application detection method and system of frequent item set and random forests algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
CN109271788A (en) * 2018-08-23 2019-01-25 北京理工大学 A kind of Android malware detection method based on deep learning
CN109165510A (en) * 2018-09-04 2019-01-08 中国民航大学 Android malicious application detection method based on binary channels convolutional neural networks
CN109753800A (en) * 2019-01-02 2019-05-14 重庆邮电大学 Merge the Android malicious application detection method and system of frequent item set and random forests algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李创丰等: "基于CNN和朴素贝叶斯方法的安卓恶意应用检测算法", 《信息安全研究》 *
苗博等: "基于随机森林的Android恶意代码检测系统", 《信息技术与信息化》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001376B (en) * 2020-10-29 2021-02-26 深圳开源互联网安全技术有限公司 Fingerprint identification method, device, equipment and storage medium based on open source component
CN112632538A (en) * 2020-12-25 2021-04-09 北京工业大学 Android malicious software detection method and system based on mixed features
CN114091028A (en) * 2022-01-19 2022-02-25 南京明博互联网安全创新研究院有限公司 Android application information leakage detection method based on data flow
CN114091028B (en) * 2022-01-19 2022-04-19 南京明博互联网安全创新研究院有限公司 Android application information leakage detection method based on data flow

Also Published As

Publication number Publication date
CN111460452B (en) 2022-09-09

Similar Documents

Publication Publication Date Title
EP3284029B1 (en) Recurrent neural networks for malware analysis
Kolosnjaji et al. Empowering convolutional networks for malware classification and analysis
CN107908963B (en) Method for automatically detecting core characteristics of malicious codes
Bavishi et al. Context2Name: A deep learning-based approach to infer natural variable names from usage contexts
Singh et al. Detection of malicious software by analyzing the behavioral artifacts using machine learning algorithms
CN110348214B (en) Method and system for detecting malicious codes
CN111460452B (en) Android malicious software detection method based on frequency fingerprint extraction
CN109784056B (en) Malicious software detection method based on deep learning
CN111639337B (en) Unknown malicious code detection method and system for massive Windows software
Gao et al. Android malware detection via graphlet sampling
CN106503558A (en) A kind of Android malicious code detecting methods that is analyzed based on community structure
Karbab et al. Petadroid: adaptive android malware detection using deep learning
Kakisim et al. Sequential opcode embedding-based malware detection method
Zhang et al. Exploring function call graph vectorization and file statistical features in malicious PE file classification
Benoit et al. Binary level toolchain provenance identification with graph neural networks
CN108985052A (en) A kind of rogue program recognition methods, device and storage medium
CN116149669A (en) Binary file-based software component analysis method, binary file-based software component analysis device and binary file-based medium
Naeem et al. Digital forensics for malware classification: An approach for binary code to pixel vector transition
CN111241497A (en) Open source code tracing detection method based on software multiplexing feature learning
CN113971283A (en) Malicious application program detection method and device based on features
De La Rosa et al. Efficient characterization and classification of malware using deep learning
CN114707151B (en) Zombie software detection method based on API call and network behavior
Lee et al. Toward machine learning based analyses on compressed firmware
Zhao et al. Malware homology identification based on a gene perspective
Jha et al. Deepmal4j: Java malware detection employing deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant