CN113378167A - Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing - Google Patents

Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing Download PDF

Info

Publication number
CN113378167A
CN113378167A CN202110737511.1A CN202110737511A CN113378167A CN 113378167 A CN113378167 A CN 113378167A CN 202110737511 A CN202110737511 A CN 202110737511A CN 113378167 A CN113378167 A CN 113378167A
Authority
CN
China
Prior art keywords
algorithm
file
naive bayes
feature
authority
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110737511.1A
Other languages
Chinese (zh)
Inventor
杨明极
赵艺博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202110737511.1A priority Critical patent/CN113378167A/en
Publication of CN113378167A publication Critical patent/CN113378167A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/53Decompilation; Disassembly
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Virology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A malicious software detection method based on the mixing of an improved naive Bayes algorithm and a gated cyclic unit belongs to the field of software detection. The traditional Android defense mechanism is difficult to deal with the rapid increase of the number and the types of malicious software. A malicious software detection method based on a mixture of an improved naive Bayes algorithm and a gated loop unit is characterized in that an apktool is used for decompiling a to-be-detected software sample set file to obtain a decompiled resource file of an application program, a feature set is extracted from the decompiled resource file, extracted feature geometries are sorted from low to high according to the use times, features with high frequency are selected and combined into a feature set, and the feature set is quantized; and processing the characteristic with time sequence change by adopting a gating circulation unit to detect the dynamic characteristic. The invention can effectively detect the malicious software using the obfuscation technology and improve the accuracy of detection.

Description

Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing
Technical Field
The invention relates to a malicious software detection method based on a mixture of an improved naive Bayes algorithm and a gated loop unit.
Background
In recent years, the rapid development of mobile internet has made smart phones gradually become national basic information devices. Nowadays, smart phones are used for various applications, such as photographing, map navigation, instant messaging, internet payment, internet shopping, entertainment, online learning, and the like. Therefore, a great deal of personal information is stored in the smart phone, including privacy information such as personal photos, call records, and the like; accounts, such as online banking account numbers, social account numbers, and the like; and equipment information such as position information, mobile phone numbers and the like. Due to the characteristic that the smart phone is connected with the network in real time, personal information of a user is easily leaked and utilized by malicious applications, and therefore the smart phone has potential safety problems.
In the operating system of the smart phone, Android and iOS operating systems are mainly used. The traditional Android defense mechanism is difficult to deal with the rapid increase of the number and the types of malicious software, and an Android platform is vulnerable to new and unknown malicious software. While the Android smart phone is widely popularized, due to the fact that source codes at the bottom layer of the system are completely free of sources, the safety auditing mechanism of a third-party application store is not standard, factors such as the safety consciousness of user information is weak and the like, malicious software on an Android platform is large in way, the number of user infections is increased day by day, great threats are caused to the personal privacy and property safety of users, and particularly the safe mobile phone payment environment is more and more important due to the rise of mobile payment in recent years. Based on the open source of the Android system, Android malware detection is always one of the hotspots of network security research. Aiming at the problems that the obfuscation technology is widely used in the Android malicious software at present and the detection effect of the Android malicious software using the obfuscation technology is poor, in order to protect the safety of the Android operating system, an effective Android malicious software detection system is needed to help improve the increasingly severe safety situation of the Android system and a user.
Disclosure of Invention
Due to the open source of the Android system, Android malware detection has been one of the hot spots of network security research. Aiming at the problems that the obfuscation technology in the Android malicious software is widely used at present and the Android malicious software using the obfuscation technology is poor in detection effect, the invention aims to solve the problems and provides a malicious software detection method based on the mixing of an improved naive Bayes algorithm and a gated loop unit.
A malware detection method based on a mixture of an improved naive Bayes algorithm and a gated loop unit is realized by the following steps:
the method comprises the following steps of firstly, decompiling a to-be-detected software sample set file by using an apktool to obtain a decompiled resource file of an application program, wherein the decompiled resource file comprises the following steps: xml manifest file and smali byte code file;
step two, extracting a feature set from the decompiled resource file, wherein the feature set comprises the following steps: a Permission set, an Intent set and a sensitive API set; sorting the extracted feature geometries from low to high according to the using times, selecting the features with high frequency, merging the features into a feature set, and quantifying the feature set; wherein the content of the first and second substances,
permission set and Intent set are obtained from android manifest.xml file;
the sensitive API set is obtained from a smali file; performing decompiling and analysis on class.dex files by a bakmali tool to obtain called API interfaces, wherein chmod is a sensitive API for changing user permissions; obtaining the sensitive API characteristics by analyzing class.
And step three, processing the characteristic with time sequence change by adopting a gating cycle unit so as to detect the dynamic characteristic.
Preferably, the extracting of the feature set from the decompiled resource file in the second step is to specifically process the static features by using a method associated with a rule mining algorithm and a TF-IDF algorithm, remove features with large relevance, extract the static features, perform weighting processing on a naive bayes algorithm, reduce feature dimensions, and remove redundant features; static features include, among other things, requested permissions, components, intents, and sensitive APIs.
Preferably, the processing of the feature with time sequence change by using the gate control cycle unit to detect the dynamic feature in the third step is specifically to install an Xpoesd frame through a simulator or a mobile device to obtain root authority and extract the dynamic feature through an automated test tool Monkey Runner + dynamic analysis tool instrumentation;
the dynamic characteristics refer to behavior characteristics of the Android application software during operation, and include file read-write operation, call request, short message request, data encryption and decryption operation, network data input and output, and private information reading, and the behaviors can express behavior intention of the application software.
Preferably, the method for processing static features by using the rule mining algorithm and the TF-IDF algorithm comprises the following specific steps:
(1) constructing a feature weighting algorithm based on the TF-IDF algorithm;
finding a piece of text information with good category distinction, wherein if the frequency of a certain piece of text information in a document is higher and the frequency of the certain piece of text information in other files in the file set is lower, the importance degree of the text information is higher, namely the text information is the text information to be determined and belongs to a piece of text information with good category distinction, and the weight calculation of the TF-IDF algorithm is as the following formula:
Weight=TFi,j×IDFi (1-1)
therefore, the TF-IDF value of the i-th text information is the product of the TF value and the IDF value, wherein the expression of the word frequency TF is shown as the formula (1-2), wherein N isi,jIndicating the frequency, sigma, of the ith text information in the jth documentkNK,jRepresenting the occurrence times of all words in the jth document;
Figure BDA0003142110560000021
the expression of the inverse file frequency IDF is shown in the formula (1-3), wherein D represents the number of all files in the file set, Di represents the number of all files in the file set in which the ith section of text information appears,
Figure BDA0003142110560000031
regarding each sensitive API as a piece of text information, taking all sensitive API sets called by each sample as a file, taking all software sample libraries as file sets, and calculating the weight value of each sensitive API characteristic; the higher the weight value is, the more commonly the sensitive API is called in the malicious software, and the calling feature vector of the sensitive API with the weight is obtained; keeping the larger weight value and removing the sensitive API with the smaller weight value, thereby performing feature dimension reduction on the authority, the sensitive API and the like; obtaining a feature vector with weight information, wherein the weight information of the feature is used as an important basis for subsequently improving the Bayesian algorithm;
(2) selecting characteristics based on association rules:
an iterative mode of searching layer by layer is adopted, namely a (k +1) authority set is explored through a k authority set;
firstly, scanning an authority test set, counting the occurrence times of each authority, collecting the authorities meeting the minimum support degree to obtain a set of frequent 1 authority sets, and recording the set as Q1;
then, finding out a set Q2 of the frequent 2 authority sets by means of Q1, and so on until the frequent k authority sets can not be found;
obtaining an association rule through an Apriori algorithm, wherein the strength of the association rule is measured through support degree and confidence degree; wherein, the support degree and confidence measure formula is expressed as:
the support degree is as follows:
Figure BDA0003142110560000032
confidence coefficient:
Figure BDA0003142110560000033
wherein, X and Y are subsets of the authority set, sigma (X) represents the number of APKs containing the authority subset X, and N represents the total number of APK data sets; the support ensures the proportion of the subset occupied in the whole data set, and the confidence ensures the proportion of Y occupied in the data set containing X; if the association rule meeting the set support degree and confidence degree contains X → Y and Y → X, only one of the two permission subsets can be selected, so that the relevance in the permission is greatly weakened;
(3) traditional naive Bayes algorithm
Defining: assuming that a sample has n attribute features, and a vector X represents an attribute set composed of n attributes, the attribute feature information of the sample can be represented by a vector X (X1, X2, X3 ….. Xn), and the sample classes are divided into C1, C2, …, Cm on a given total probability event, where each Ci is a class; obtaining the conditional probability of each attribute on a training sample set under the condition of a certain category Ci through calculation, namely P (X1| Ci), P (X2| Ci), … and P (Xn | Ci); finally, when classifying the sample Xi, respectively calculating the posterior probability of the sample for each class, namely P (C1| Xi), P (C2| Xi), … and P (Cm | Xi), and taking the class Ci with the highest posterior probability as the class to which the final classification of the sample belongs;
the posterior probability is defined by the naive bayes algorithm principle given by the definition as follows:
Figure BDA0003142110560000041
wherein P (X) is constant under all the categories, so that the maximum posterior probability can be judged only by the maximum P (Ci) P (X | Ci); p (Ci) represents the probability of each attribute appearing in the training sample, P (X | Ci) being the conditional probability of each attribute under all categories; since the naive bayes algorithm assumes that the attributes are independent of each other, there are:
Figure BDA0003142110560000042
in summary, the maximum posterior probability is only required to be the maximum when the sample X belongs to a certain class, and is defined as:
Figure BDA0003142110560000043
wherein CMAP is the final decision classification of the naive Bayes algorithm according to the maximum posterior probability;
(4) improved naive Bayes based on weighting
In the characteristic weighting module, a weight value W (X) of each characteristic attribute is calculated by using a TF-IDF algorithmk) The redundancy of a plurality of authority characteristics is removed by using the weight values, and a naive Bayes algorithm is improved at the same time; taking the TF-IDF weight value obtained by the calculation as the basis, and taking the weight value W (X)k) The new posterior probability is obtained by substituting the new posterior probability into a naive Bayes posterior probability calculation formula (1-6):
Figure BDA0003142110560000044
preferably, the process of extracting dynamic features by the automated testing tool Monkey Runner + dynamic analysis tool inspection is,
and (3) processing the dynamic feature vector by adopting a GRU model:
Xtas input to the current cell, YtIs the output of the current cell, htIs the hidden state of the current unit; h ist-1A hidden state which is output by the previous unit and is transmitted to the unit, wherein the hidden state comprises the related information of the previous unit;
the GRU internal calculation process is as follows:
as is hadamard product, representing multiplication of corresponding elements in the matrix; + is a matrix addition operation, representing the addition of corresponding elements in the matrix; two gating states, r gating control reset, z gating control update, Wr、WzAnd W are both weight matrices;
activation function:
Figure BDA0003142110560000045
Figure BDA0003142110560000046
step 1: acquiring two gating states r and z;
r=σ(WrXt+Wrht-1) (1-12)
z=σ(WzXt+Wzht-1) (1-13)
step 2: using reset gate r to reset data, and calculating to obtain h';
ht-1'=ht-1⊙r (1-14)
h'=tanh(WXt+Wht-1') (1-15)
step 3: updating memory, wherein the more the gating signal is close to 1, the more data representing memory is, the closer to 0 is, the more data representing forgetting is;
ht=z⊙ht-1+(1-z)⊙h' (1-16)
as described above, in conjunction with Xt and ht-1, the GRU will obtain the output Yt of the current hidden unit and pass it to the next unit as hidden state ht, where Yt and ht are numerically the same.
The invention has the beneficial effects that:
because static analysis has a poor analysis effect on Android malware using obfuscation techniques, more and more Android malware is disguised using obfuscation techniques in the present situation. Therefore, the invention adopts a plurality of static characteristic methods to analyze through the resource type characteristics and the grammar type characteristics, the resource type characteristic analysis method is extracted from the resource files stored in the APK and comprises the characteristics related to the certificate, the characteristics related to the package name, the characteristics related to the malicious code hiding and the like, and the other is the grammar type characteristics which are extracted from the resource codes in the APK and comprise the characteristics related to the authority and intention related to the sensitive API calling of the characteristics.
According to the invention, the Association Rule Mining (ARM) and the TF-IDF algorithm are used for carrying out feature processing, the features with larger association are removed, the effective features are extracted, and the weighting processing is carried out on the naive Bayes algorithm, so that the feature dimension can be effectively reduced, the redundant features are removed, and the calculation efficiency is optimized. In dynamic detection, due to the fact that dynamic features have time sequence correlation, a time sequence is formed after entity embedded coding, and a Recurrent Neural Network (RNN) processing sequence or time sequence correlation features have specific advantages and can well process features with time sequence changes. The GRU is used as a special recurrent neural network widely used, the problems of gradient elimination and gradient explosion in the long sequence training process are solved, compared with a common RNN model, the GRU can have better performance in a longer time sequence, parameters are less compared with other RNN models, the training difficulty is lower, and the training efficiency can be improved to a great extent, so that a Gated Recursive Unit (GRU) is adopted for dynamic detection.
The method provided by the invention is verified by applying the Android Malware Dataset (AMD) Malware data set, has better classification performance, can effectively detect Malware using obfuscation technology, and improves the accuracy of detection.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic illustration of static feature detection according to the present invention;
fig. 3 is a schematic diagram of an internal structure of a GRU according to the present invention.
Detailed Description
The first embodiment is as follows:
in this embodiment, as shown in fig. 1, a malware detection method based on a mixture of an improved naive bayes algorithm and a gated loop unit is implemented by the following steps:
the method comprises the following steps of firstly, decompiling a to-be-detected software sample set file by using an apktool to obtain a decompiled resource file of an application program, wherein the decompiled resource file comprises the following steps: xml manifest file and smali byte code file;
step two, extracting a feature set from the decompiled resource file, wherein the feature set comprises the following steps: a Permission set, an Intent set and a sensitive API set; sorting the extracted feature geometries from low to high according to the using times, selecting the features with high frequency, merging the features into a feature set, and quantifying the feature set; wherein the content of the first and second substances,
permission set and Intent set are obtained from android manifest.xml file;
the sensitive API set is obtained from a smali file;
decompiling and analyzing class.dex files by a bakmali tool to obtain called API interfaces, wherein chmod is a sensitive API for changing user permissions; obtaining the sensitive API characteristics by analyzing class.
And step three, processing the characteristic with time sequence change by adopting a gating cycle unit so as to detect the dynamic characteristic.
The second embodiment is as follows:
different from the first specific embodiment, in the second specific embodiment, a malware detection method based on the mixing of the improved naive bayes algorithm and the gated loop unit extracts a feature set from a decompiled resource file, specifically, a method associated with A Rule Mining (ARM) algorithm and a TF-IDF algorithm is used for processing static features, removing features with large relevance, extracting the static features, and performing weighting processing on the naive bayes algorithm, so that feature dimensionality is effectively reduced, redundant features are removed, and calculation efficiency is optimized; wherein the static features include requested permissions, components, intents, and sensitive APIs;
although the classification effect of the naive Bayes algorithm is good, when the number of attributes is large or the correlation among the attributes is large, the classification effect of the naive Bayes algorithm is not good, and the contribution degree of the default characteristic attribute of the traditional naive Bayes algorithm to the reiterated attribute is the same; the invention solves the problems by carrying out the Android malicious software static detection based on the improved naive Bayes algorithm.
The third concrete implementation mode:
different from the first or second specific embodiment, in the third embodiment, the gated loop unit is adopted to process the characteristic with time sequence change so as to detect the dynamic characteristic, specifically, an Xpoesd frame is installed through a simulator or a mobile device to obtain root authority, and the dynamic characteristic is extracted through an automatic test tool Monkey Runner + dynamic analysis tool instrumentation;
the dynamic characteristics refer to behavior characteristics of the Android application software during running, and include file read-write operation, call request, short message request, data encryption and decryption operation, network data input and output, and private information reading, and the behaviors can express behavior intention of the application software;
the dynamic feature action _ sendnet indicates data transfer through a network, the dynamic feature actions _ telephones indicates the use of a call service, and the dynamic feature action _ sendmss indicates the sending of a short message. The extraction of the dynamic features is mainly realized based on monitoring related API calls, and each dynamic feature corresponds to a group of a plurality of API calls.
Currently, for monitoring dynamic behaviors, relevant tools can be used for carrying out work. In previous studies, researchers collected recorded dynamic features using a Droidbox, a dynamic analysis tool for analyzing Android applications based on taitdroid, which recorded the dynamic behavior of the application over time after it was started through the hook technology monitoring system API. Although the function of the Droidbox is complete, the Droidbox cannot be automatically detected, cannot execute all functions of Android application software, and possibly has omission in a monitoring result, so that an automatic testing tool, MonkeyRunner and a dynamic analysis tool, instance, is selected to extract dynamic features.
The fourth concrete implementation mode:
the third difference from the specific embodiment is that, in the malicious software detection method based on the mixing of the improved naive bayes algorithm and the gated loop unit, the static features are processed by using the method of associating the rule mining (ARM) algorithm with the TF-IDF algorithm, the features are processed by using the Association Rule Mining (ARM) algorithm and the TF-IDF algorithm, the features with larger correlation are removed, the dimensions of the features are reduced, the naive bayes algorithm is weighted, and the accuracy of judging the malicious application of the Android is improved by the weighted naive bayes algorithm. The method comprises the following specific steps:
(1) constructing a feature weighting algorithm based on the TF-IDF algorithm;
TF-IDF (Term Frequency-Inverse Document Frequency) is one of the commonly used feature weighting algorithms, is usually used for evaluating the importance degree of a section of text information in a Document set, and has wide application in the field of retrieval and classification of text information. The TF-IDF algorithm is composed of two parts, namely Term Frequency (Term Frequency) and Inverse file Frequency (Inverse Document Frequency), wherein the Term Frequency represents the Frequency of occurrence of a certain section of text information in a Document, and the Inverse file Frequency represents the Frequency of occurrence of a file containing the section of text information in a file set. The main idea of the TF-IDF algorithm is to find a piece of text information with good category distinction, and if the frequency of a certain piece of text information appearing in a document is higher and the frequency of the text information appearing in other files of a file set is lower, the importance degree of the text information is higher, that is, the text information to be determined is the text information belonging to a piece of text information with good category distinction, and the weight calculation of the TF-IDF algorithm is as follows:
Weight=TFi,j×IDFi (1-1)
therefore, the TF-IDF value of the i-th text information is the product of the TF value and the IDF value, wherein the expression of the word frequency TF is shown as the formula (1-2), wherein N isi,jIndicating the frequency, sigma, of the ith text information in the jth documentkNK,jRepresenting the occurrence times of all words in the jth document;
Figure BDA0003142110560000081
the expression of the inverse file frequency IDF is shown in the formula (1-3), wherein D represents the number of all files in the file set, Di represents the number of all files in the file set in which the ith section of text information appears, and Di plus 1 processing is carried out in order to prevent the generation of the denominations when Di is 0.
Figure BDA0003142110560000082
Regarding the sensitive API feature vector, according to the idea of a TF-IDF algorithm, regarding each sensitive API as a piece of text information, regarding all sensitive API sets called by each sample as a file, regarding all software sample libraries as file sets, and calculating the weight value of each sensitive API feature; the higher the weight value is, the more commonly the sensitive API is called in the malicious software, and the calling feature vector of the sensitive API with the weight is obtained; keeping the larger weight, removing the sensitive API with the smaller weight, and indicating that the smaller weight does not have distinctiveness, thereby performing feature dimension reduction on the authority, the sensitive API and the like; obtaining a feature vector with weight information, wherein the weight information of the feature is used as an important basis for subsequently improving the Bayesian algorithm;
(2) selecting characteristics based on association rules;
there is often a significant correlation between rights characteristic limits. For example, READ _ CONTACTS and WRITE _ CONTACTS have high correlation with the READ-WRITE rights of the contact information, and are almost paired or missing in the sample. If the authority features include a large number of features with strong correlation, the classification overhead is increased, and the classification accuracy is interfered. The method performs redundancy removal processing on the authority characteristics with strong correlation by using association rule learning.
The APriori algorithm is one of classical association rule learning algorithms in the field of data mining and the field of computers, and is mainly used for processing association rules among content information related to transactions such as commodity lists, access lists and the like in a database. The algorithm adopts an iteration mode of searching layer by layer, namely, a (k +1) authority set is explored through a k authority set;
firstly, scanning an authority test set, counting the occurrence times of each authority, collecting the authorities meeting the minimum support degree to obtain a set of frequent 1 authority sets, and recording the set as Q1;
then, finding out a set Q2 of the frequent 2 authority sets by means of Q1, and so on until the frequent k authority sets can not be found;
obtaining an association rule through an Apriori algorithm, wherein the strength of the association rule is measured through support (support) and confidence (confidence); wherein, the measurement formula of the support degree and the confidence degree is expressed as:
the support degree is as follows:
Figure BDA0003142110560000083
confidence coefficient:
Figure BDA0003142110560000084
wherein, X and Y are subsets of the authority set, sigma (X) represents the number of APKs containing the authority subset X, and N represents the total number of APK data sets; the support ensures the proportion of the subset occupied in the whole data set, and the confidence ensures the proportion of Y occupied in the data set containing X; if the association rule meeting the set support degree and confidence degree contains X → Y and Y → X, only one of the two permission subsets can be selected, so that the relevance in the permission is greatly weakened;
(3) traditional naive bayes algorithm:
the principle of the naive Bayes algorithm is as follows:
assuming that a sample has n attribute features, and a vector X represents an attribute set composed of n attributes, the attribute feature information of the sample can be represented by a vector X (X1, X2, X3 ….. Xn), and the sample classes are divided into C1, C2, …, Cm on a given total probability event, where each Ci is a class; obtaining the conditional probability of each attribute on a training sample set under the condition of a certain category Ci through calculation, namely P (X1| Ci), P (X2| Ci), … and P (Xn | Ci); finally, when classifying the sample Xi, respectively calculating the posterior probability of the sample for each class, namely P (C1| Xi), P (C2| Xi), … and P (Cm | Xi), and taking the class Ci with the highest posterior probability as the class to which the final classification of the sample belongs;
the posterior probability is defined by the naive bayes algorithm principle given by the definition as follows:
Figure BDA0003142110560000091
wherein P (X) is constant under all the categories, so that the maximum posterior probability can be judged only by the maximum P (Ci) P (X | Ci); p (Ci) represents the probability of each attribute appearing in the training sample, P (X | Ci) being the conditional probability of each attribute under all categories; since the naive bayes algorithm assumes that the attributes are independent of each other, there are:
Figure BDA0003142110560000092
in summary, the sample X belongs to a certain class, and only the Maximum posterior probability (MAP) is the Maximum, where the Maximum posterior probability is defined as:
Figure BDA0003142110560000093
wherein CMAP is the final decision classification of the naive Bayes algorithm according to the maximum posterior probability;
(4) improved naive Bayes based on weighting
The naive bayes algorithm has a good overall performance, but also has disadvantages. The features are defaulted to have the same weight and are mutually independent, so that the feature weighting module based on the TF-IDF algorithm is used for improving the accuracy of the naive Bayes classification algorithm.
The correlation problem between features is not considered when using a naive bayes classification algorithm. In order to reduce the influence of strong correlation between features on a classification algorithm, the invention provides a feature selection based on association rule mining, which can well discover the strong correlation rules between the authorities, only one authority is reserved in a frequent authority set as a detection feature, and the number of the authorities is reduced again;
in the characteristic weighting module, a weight value W (X) of each characteristic attribute is calculated by using a TF-IDF algorithmk) The redundancy removal of a plurality of authority characteristics is completed by utilizing the weight value, and the improvement of a naive Bayes algorithm is also completed; because the traditional naive Bayes classification algorithm does not consider different characteristics to the malignancyThe influence degrees of the software classification results are different, so that the weight value W (X) is determined by taking the TF-IDF weight value obtained by the calculation as the basisk) The new posterior probability is obtained by substituting the new posterior probability into a naive Bayes posterior probability calculation formula (1-6):
Figure BDA0003142110560000101
the overall static detection flow is illustrated as static feature detection shown in fig. 2 below.
The fifth concrete implementation mode:
different from the specific implementation manner, in the malware detection method based on the mixture of the improved naive bayes algorithm and the gated loop unit in the implementation manner, the process of extracting the dynamic features by the automated testing tool Monkey Runner + dynamic analysis tool inspection is that,
gru (gate recovery unit) is a special Recurrent Neural Network (RNN). Mainly aims to solve the problems of gradient extinction and gradient explosion in the long sequence training process. GRUs can perform better in longer sequences than normal RNNs. The transmission state is controlled through the gating state, and unimportant information which needs to be memorized for a long time is memorized and forgotten; unlike the common RNN, which only has a memory superposition mode, the GRU can have better performance in a longer time operation sequence, and compared with other RNN model parameters, the GRU has the advantages of lower training speed, lower difficulty and less data volume required for achieving the generalization effect. Therefore, the GRU model is more suitable for processing dynamic feature vectors with less dimensions;
and (3) processing the dynamic feature vector by adopting a GRU model:
the internal structure of the GRU is shown in FIG. 3, XtAs input to the current cell, YtIs the output of the current cell, htIs the hidden state of the current unit; h ist-1A hidden state (hidden state) which is output by the previous unit and is transmitted to the unit, wherein the hidden state contains the related information of the previous unit;
the GRU internal calculation process is as follows:
as is hadamard product, representing multiplication of corresponding elements in the matrix; + is a matrix addition operation, representing the addition of corresponding elements in the matrix; two gating states, r gating control reset, z gating control update, Wr、WzAnd W are both weight matrices;
activation function:
Figure BDA0003142110560000102
Figure BDA0003142110560000103
step 1: acquiring two gating states r and z;
r=σ(WrXt+Wrht-1) (1-12)
z=σ(WzXt+Wzht-1) (1-13)
step 2: using reset gate r to reset data, and calculating to obtain h';
ht-1'=ht-1⊙r (1-14)
h'=tanh(WXt+Wht-1') (1-15)
step 3: updating memory, wherein the more the gating signal is close to 1, the more data representing memory is, the closer to 0 is, the more data representing forgetting is;
ht=z⊙ht-1+(1-z)⊙h' (1-16)
as described above, in conjunction with Xt and ht-1, the GRU will obtain the output Yt of the current hidden unit and pass it to the next unit as hidden state ht, where Yt and ht are numerically the same.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A malicious software detection method based on the mixing of an improved naive Bayes algorithm and a gated cyclic unit is characterized by comprising the following steps: the method is realized by the following steps:
the method comprises the following steps of firstly, decompiling a to-be-detected software sample set file by using an apktool to obtain a decompiled resource file of an application program, wherein the decompiled resource file comprises the following steps: xml manifest file and smali byte code file;
step two, extracting a feature set from the decompiled resource file, wherein the feature set comprises the following steps: a Permission set, an Intent set and a sensitive API set; sorting the extracted feature geometries from low to high according to the using times, selecting the features with high frequency, merging the features into a feature set, and quantifying the feature set; wherein the content of the first and second substances,
permission set and Intent set are obtained from android manifest.xml file;
the sensitive API set is obtained from a smali file; performing decompiling and analysis on class.dex files by a bakmali tool to obtain called API interfaces, wherein chmod is a sensitive API for changing user permissions; obtaining the sensitive API characteristics by analyzing class.
And step three, processing the characteristic with time sequence change by adopting a gating cycle unit so as to detect the dynamic characteristic.
2. The malware detection method based on the mixture of the modified naive bayes algorithm and the gated round robin unit as claimed in claim 1, wherein: step two, extracting a feature set from the decompiled resource file, specifically, processing static features by using a method associated with a rule mining algorithm and a TF-IDF algorithm, removing features with high relevance, extracting the static features, performing weighting processing on a naive Bayes algorithm, reducing feature dimensions, and removing redundant features; static features include, among other things, requested permissions, components, intents, and sensitive APIs.
3. The malware detection method based on the mixture of the modified naive bayes algorithm and the gated round robin unit as claimed in claim 1, wherein: processing the characteristics with time sequence change by adopting a gate control cycle unit to detect the dynamic characteristics, namely specifically, installing an Xpos frame through a simulator or a mobile device to obtain root authority, and extracting the dynamic characteristics through an automatic test tool Monkey Runner + dynamic analysis tool Instrucge;
the dynamic characteristics refer to behavior characteristics of the Android application software during operation, and include file read-write operation, call request, short message request, data encryption and decryption operation, network data input and output, and private information reading, and the behaviors can express behavior intention of the application software.
4. The malware detection method based on the mixture of the modified naive bayes algorithm and the gated round robin unit as claimed in claim 2, wherein:
the method for processing the static characteristics by using the rule mining algorithm and the TF-IDF algorithm comprises the following specific steps:
(1) constructing a feature weighting algorithm based on the TF-IDF algorithm;
finding a piece of text information with good category distinction, wherein if the frequency of a certain piece of text information in a document is higher and the frequency of the certain piece of text information in other files in the file set is lower, the importance degree of the text information is higher, namely the text information is the text information to be determined and belongs to a piece of text information with good category distinction, and the weight calculation of the TF-IDF algorithm is as the following formula:
Weight=TFi,j×IDFi (1-1)
therefore, the TF-IDF value of the i-th text information is the product of the TF value and the IDF value, wherein the expression of the word frequency TF is shown as the formula (1-2), wherein N isi,jIndicating the frequency, sigma, of the ith text information in the jth documentkNK,jRepresenting the occurrence times of all words in the jth document;
Figure FDA0003142110550000021
the expression of the inverse file frequency IDF is shown in the formula (1-3), wherein D represents the number of all files in the file set, Di represents the number of all files in the file set in which the ith section of text information appears,
Figure FDA0003142110550000022
regarding each sensitive API as a piece of text information, taking all sensitive API sets called by each sample as a file, taking all software sample libraries as file sets, and calculating the weight value of each sensitive API characteristic; the higher the weight value is, the more commonly the sensitive API is called in the malicious software, and the calling feature vector of the sensitive API with the weight is obtained; keeping the larger weight value and removing the sensitive API with the smaller weight value, thereby performing feature dimension reduction on the authority, the sensitive API and the like; obtaining a feature vector with weight information, wherein the weight information of the feature is used as an important basis for subsequently improving the Bayesian algorithm;
(2) selecting characteristics based on association rules:
an iterative mode of searching layer by layer is adopted, namely a (k +1) authority set is explored through a k authority set;
firstly, scanning an authority test set, counting the occurrence times of each authority, collecting the authorities meeting the minimum support degree to obtain a set of frequent 1 authority sets, and recording the set as Q1;
then, finding out a set Q2 of the frequent 2 authority sets by means of Q1, and so on until the frequent k authority sets can not be found;
obtaining an association rule through an Apriori algorithm, wherein the strength of the association rule is measured through support degree and confidence degree; wherein, the support degree and confidence measure formula is expressed as:
the support degree is as follows:
Figure FDA0003142110550000023
confidence coefficient:
Figure FDA0003142110550000024
wherein, X and Y are subsets of the authority set, sigma (X) represents the number of APKs containing the authority subset X, and N represents the total number of APK data sets; the support ensures the proportion of the subset occupied in the whole data set, and the confidence ensures the proportion of Y occupied in the data set containing X; if the association rule meeting the set support degree and confidence degree contains X → Y and Y → X, only one of the two permission subsets can be selected, so that the relevance in the permission is greatly weakened;
(3) traditional naive Bayes algorithm
Defining: assuming that a sample has n attribute features, and a vector X represents an attribute set composed of n attributes, the attribute feature information of the sample can be represented by a vector X (X1, X2, X3 ….. Xn), and the sample classes are divided into C1, C2, …, Cm on a given total probability event, where each Ci is a class; obtaining the conditional probability of each attribute on a training sample set under the condition of a certain category Ci through calculation, namely P (X1| Ci), P (X2| Ci), … and P (Xn | Ci); finally, when classifying the sample Xi, respectively calculating the posterior probability of the sample for each class, namely P (C1| Xi), P (C2| Xi), … and P (Cm | Xi), and taking the class Ci with the highest posterior probability as the class to which the final classification of the sample belongs;
the posterior probability is defined by the naive bayes algorithm principle given by the definition as follows:
Figure FDA0003142110550000031
wherein P (X) is constant under all the categories, so that the maximum posterior probability can be judged only by the maximum P (Ci) P (X | Ci); p (Ci) represents the probability of each attribute appearing in the training sample, P (X | Ci) being the conditional probability of each attribute under all categories; since the naive bayes algorithm assumes that the attributes are independent of each other, there are:
Figure FDA0003142110550000032
in summary, the maximum posterior probability is only required to be the maximum when the sample X belongs to a certain class, and is defined as:
Figure FDA0003142110550000033
wherein CMAP is the final decision classification of the naive Bayes algorithm according to the maximum posterior probability;
(4) improved naive Bayes based on weighting
In the characteristic weighting module, a weight value W (X) of each characteristic attribute is calculated by using a TF-IDF algorithmk) The redundancy of a plurality of authority characteristics is removed by using the weight values, and a naive Bayes algorithm is improved at the same time; taking the TF-IDF weight value obtained by the calculation as the basis, and taking the weight value W (X)k) The new posterior probability is obtained by substituting the new posterior probability into a naive Bayes posterior probability calculation formula (1-6):
Figure FDA0003142110550000034
5. the malware detection method based on the mixture of the modified naive Bayes algorithm and the gated loop unit as claimed in claim 3, wherein:
the process of extracting the dynamic characteristics by the automatic test tool Monkey Runner + dynamic analysis tool inspection is that,
and (3) processing the dynamic feature vector by adopting a GRU model:
Xtas input to the current cell, YtIs the output of the current cell, htIs the hidden state of the current unit; h ist-1A hidden state which is output by the previous unit and is transmitted to the unit, wherein the hidden state comprises the related information of the previous unit;
the GRU internal calculation process is as follows:
as is hadamard product, representing multiplication of corresponding elements in the matrix; + is a matrix addition operation, representing the addition of corresponding elements in the matrix; two gating states, r gating control reset, z gating control update, Wr、WzAnd W are both weight matrices;
activation function:
Figure FDA0003142110550000041
Figure FDA0003142110550000042
step 1: acquiring two gating states r and z;
r=σ(WrXt+Wrht-1) (1-12)
z=σ(WzXt+Wzht-1) (1-13)
step 2: using reset gate r to reset data, and calculating to obtain h';
ht-1′=ht-1⊙r (1-14)
h'=tanh(WXt+Wht-1') (1-15)
step 3: updating memory, wherein the more the gating signal is close to 1, the more data representing memory is, the closer to 0 is, the more data representing forgetting is;
ht=z⊙ht-1+(1-z)⊙h' (1-16)
as described above, in conjunction with Xt and ht-1, the GRU will obtain the output Yt of the current hidden unit and pass it to the next unit as hidden state ht, where Yt and ht are numerically the same.
CN202110737511.1A 2021-06-30 2021-06-30 Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing Pending CN113378167A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110737511.1A CN113378167A (en) 2021-06-30 2021-06-30 Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110737511.1A CN113378167A (en) 2021-06-30 2021-06-30 Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing

Publications (1)

Publication Number Publication Date
CN113378167A true CN113378167A (en) 2021-09-10

Family

ID=77580250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110737511.1A Pending CN113378167A (en) 2021-06-30 2021-06-30 Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing

Country Status (1)

Country Link
CN (1) CN113378167A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116166805A (en) * 2023-02-24 2023-05-26 北京青萌数海科技有限公司 Commodity coding prediction method and device
CN116484382A (en) * 2023-04-07 2023-07-25 中国人民解放军61660部队 Dynamic detection method, system, electronic equipment and storage medium for An Zhuo Loudong

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
CN108491719A (en) * 2018-03-15 2018-09-04 重庆邮电大学 A kind of Android malware detection methods improving NB Algorithm
CN109753800A (en) * 2019-01-02 2019-05-14 重庆邮电大学 Merge the Android malicious application detection method and system of frequent item set and random forests algorithm
CN112632539A (en) * 2020-12-28 2021-04-09 西北工业大学 Dynamic and static mixed feature extraction method in Android system malicious software detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
CN108491719A (en) * 2018-03-15 2018-09-04 重庆邮电大学 A kind of Android malware detection methods improving NB Algorithm
CN109753800A (en) * 2019-01-02 2019-05-14 重庆邮电大学 Merge the Android malicious application detection method and system of frequent item set and random forests algorithm
CN112632539A (en) * 2020-12-28 2021-04-09 西北工业大学 Dynamic and static mixed feature extraction method in Android system malicious software detection

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
张怡婷 等: ""基于朴素贝叶斯的Android软件恶意行为智能识别"", 《东南大学学报(自然科学版)》, vol. 45, no. 2, 20 March 2015 (2015-03-20), pages 224 - 230 *
张骁敏: ""基于权限与行为的Android恶意软件检测研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 2018, 15 July 2018 (2018-07-15), pages 138 - 98 *
欧阳立: ""基于深度学习的Android恶意软件检测技术研究"", 《中国优秀硕士学位论文全文数据库社会科学Ⅰ辑》, no. 2020, 15 December 2020 (2020-12-15), pages 113 - 181 *
纪策: ""基于Android的恶意软件检测与防护技术研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 2019, 15 May 2019 (2019-05-15), pages 138 - 145 *
邢成: ""基于网络流量的异常检测算法研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 2021, 15 May 2021 (2021-05-15), pages 139 - 109 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116166805A (en) * 2023-02-24 2023-05-26 北京青萌数海科技有限公司 Commodity coding prediction method and device
CN116166805B (en) * 2023-02-24 2023-09-22 北京青萌数海科技有限公司 Commodity coding prediction method and device
CN116484382A (en) * 2023-04-07 2023-07-25 中国人民解放军61660部队 Dynamic detection method, system, electronic equipment and storage medium for An Zhuo Loudong

Similar Documents

Publication Publication Date Title
US11025649B1 (en) Systems and methods for malware classification
Feng et al. AC-Net: Assessing the consistency of description and permission in Android apps
CN109614795B (en) Event-aware android malicious software detection method
US11580222B2 (en) Automated malware analysis that automatically clusters sandbox reports of similar malware samples
Ban et al. Integration of multi-modal features for android malware detection using linear SVM
Song et al. Permission Sensitivity‐Based Malicious Application Detection for Android
CN113378167A (en) Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing
Qiu et al. Predicting the impact of android malicious samples via machine learning
Nicheporuk et al. An Android Malware Detection Method Based on CNN Mixed-Data Model.
Yang et al. An Android Malware Detection Model Based on DT‐SVM
Singh et al. Ransomware detection using process memory
Wang et al. Fgl_droid: an efficient android malware detection method based on hybrid analysis
Wei et al. Toward identifying APT malware through API system calls
Lin et al. The Application of Computer Intelligence in the Cyber‐Physical Business System Integration in Network Security
Gao et al. Quorum chain-based malware detection in android smart devices
Congyi et al. Method for detecting Android malware based on ensemble learning
Wang et al. Malware detection using cnn via word embedding in cloud computing infrastructure
Niu et al. An improved permission management scheme of android application based on machine learning
Liu et al. Learning‐Based Detection for Malicious Android Application Using Code Vectorization
Wang et al. Deep learning-based multi-classification for malware detection in IoT
Wang et al. A detection model of malicious Android applications based on Naive Bayes
Dai et al. [Retracted] Anticoncept Drift Method for Malware Detector Based on Generative Adversarial Network
CN113409014A (en) Big data service processing method based on artificial intelligence and artificial intelligence server
Go et al. Detecting intrusion via insider attack in database transactions by learning disentangled representation with deep metric neural network
Chen et al. Android malware detection based on static behavior feature analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210910

WD01 Invention patent application deemed withdrawn after publication