CN110162975B

CN110162975B - Multi-step abnormal point detection method based on neighbor propagation clustering algorithm

Info

Publication number: CN110162975B
Application number: CN201910452071.8A
Authority: CN
Inventors: 朱会娟; 冯霞; 王良民; 黎洋; 顾伟; 曹晓雯; 房浩
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2019-05-28
Filing date: 2019-05-28
Publication date: 2022-10-25
Anticipated expiration: 2039-05-28
Also published as: CN110162975A

Abstract

The invention discloses a multi-step abnormal point detection method based on a neighbor propagation clustering algorithm. The invention can effectively solve the problem of dimension disaster when detecting the abnormal points, thereby avoiding the interference of redundant characteristics or excessive data noise of irrelevant characteristics to the abnormal point detection technology; meanwhile, the excessive dependence of the traditional abnormal point detection technology based on clustering or distance on the selection of the initial value is overcome, the effectiveness of the method is verified by a Virusschare and Google Play-acquired actual data aggregation cross-fold verification method, and the method has a wide application prospect in the field of network security.

Description

Multi-step abnormal point detection method based on neighbor propagation clustering algorithm

Technical Field

The invention belongs to the network security technology, and particularly relates to a multi-step abnormal point detection method based on a neighbor propagation clustering algorithm.

Background

Along with diversified propagation ways and complex application environments brought by rapid development of the internet, great convenience is brought to propagation and attack of malicious software, and the aggressivity and the harmfulness of the malicious software are stronger than those of traditional computer viruses. Due to the characteristics of Android, such as strictless audit of application stores, random release and download of Android application programs from a third-party application market by users and the like, android becomes a main attack target of malicious software, and Android equipment is listed as an attack target by mobile malicious software up to 97% according to latest research data. Malware is the generic term for software that is installed privately without explicit prompting or permission, and that has malicious intent or performs malicious functions that violate the legitimate interests of the user. Malware often has some significant features, such as frequently accessing files, using a network, sending short messages, obtaining a user's address book, and so on. A research and analysis report (last half of 2018) on network privacy security and network fraud behaviors shows a list of network security problems of fraud behaviors such as counterfeiting bank short messages and the like caused by stealing privacy information such as user address books, user geographic positions and other entertainment and payment through an Android application program. Therefore, the Android platform-based malware analysis and detection plays a crucial role in the research of network security.

However, conventional malware detection methods tend to be "retrospective", i.e., they rely on a sufficient known sample of malware to mine out the corresponding malware patterns after the malware has spread widely. Aiming at the realistic situation of Android malicious software detection, the invention introduces an anomaly detection technology. Anomaly Detection (Outlier Detection) aims at detecting data that does not conform to normal behavior. Anomaly detection has wide application in the fields of databases, data mining, machine learning, statistics and the like, and comprises fraud detection of credit cards or insurance industry, intrusion detection and fault diagnosis in networks, new feature identification in satellite image analysis, health medical monitoring, occurrence of emergencies in public safety, identification of novel molecular structures in drug research and the like. Distance-based and cluster-based anomaly detection methods are two more typical anomaly detection methods, but in practical applications, two major challenges are faced: (1) The accuracy of the abnormal point detection technology is low due to data noise caused by redundant features or excessive irrelevant features of high-dimensional data; (2) The efficiency of this type of solution depends greatly on whether the initial value is set reasonably or not, based on the traditional clustering method or the distance-based outlier detection technique (e.g., KNN, K-means, K-center) requires accurate prior knowledge and relies heavily on the selection of the initial value, such as the number of clusters and the initialization of the cluster center.

Disclosure of Invention

The invention aims to: the invention aims to solve the defects in the prior art, and provides a multi-step abnormal point detection method based on a neighbor propagation clustering algorithm.

The technical scheme is as follows: the invention discloses a multi-step abnormal point detection method based on a neighbor propagation clustering algorithm, which comprises the following steps of:

step 1, obtaining normal Android applications from the Android official website Google Play, and obtaining normal Android applications from a virus data sample library (for examplehttp://virusshare.com/) The method comprises the steps of obtaining a malicious App, constructing an application App sample set (containing normal samples and malicious samples), and dividing the App sample set into a training set and a testing set;

step 2, extracting the data stream in the sample set by using a FLOWDROID tool, thereby constructing a feature set X = (X) of the data stream frequency ₁ ,x ₂ ,...,x _n )∈R ^m×n M refers to the counted number of data streams, namely the original characteristic dimension of the data set, and n represents the number of samples in the sample set; for example { user information → log };

step 3, constructing a characteristic vector by taking the data stream as a characteristic, taking the frequency of calling the corresponding data stream characteristic in each sample App as a characteristic value, and marking the sample App as 0 if the sample App does not call the corresponding characteristic value of a certain data stream;

step 4, reducing the dimension of the high-dimensional data in the step 3 by adopting an EstSNE dimension reduction technology;

step 5, dividing an App sample into 13 subclasses (such as account information, contact information, database operation and the like) related to user sensitive information, specifically, if the App calls the contact information stored on the device through an application program interface, the App is classified into a "contact method class", which is a superimposable partition, that is, one App may exist in multiple subclasses at the same time, because it is considered that location information, contact information, other sensitive information and the like may be called at the same time in the same App;

step 6, clustering partial normal apps in each subclass by adopting a near propagation algorithm AP, namely dividing the apps into different themes to excavate the normal mode of the theme, and calculating the reference point of the theme;

step 7, calculating the abnormal score of the candidate sample set by adopting an NPOD method, namely calculating the abnormal score of the candidate App in the 13 subclasses by taking the 13 groups of reference point sets calculated in the step 6 as reference sets, marking the abnormal score as 0 if the App is not divided into the corresponding subclasses, and constructing an abnormal score vector;

step 8, training a 1SVM (one-class Support Vector Machine) classifier model by adopting a pre-divided training set (all normal samples);

and 9, adopting a pre-divided test set (comprising normal samples and malicious samples), and then performing Android malicious software prediction and evaluation through the 1SVM classifier trained in the step 8.

Further, the detailed process of performing dimension reduction on the high-dimensional data in step 4 is as follows:

using X = [ X = ₁ ,x ₂ ,...,x _n ]∈R ^m×n Representing a high-dimensional data set, constructing probability distribution P among high-dimensional objects and probability distribution Q of points in a low-dimensional space by an EstSNE dimension reduction method, and then obtaining the optimal low-dimensional representation of the points by minimizing the target KL divergence, namely:

p _ij represents a sample x _i And x _j The similarity in the high-dimensional space X is calculated according to the formula:

δ _i represents the variance of the gaussian distribution; wherein p is _i|j Is calculated by _j|i The same;

q _ij representing a sample y _i And y _j In a low-dimensional space (i.e., a space reduced by X), Y = [ Y = ₁ ,y ₂ ,...,y _n ]∈R ^d×n D is data after dimensionality reduction, and the calculation mode is as follows: q. q.s _ij ＝((1+||y _i -y _j || ² )K) ^-1 ，

Here, p and q are used for cycle counting.

Further, in step 6, the reference point calculation method is as follows:

(6.1) using a negative Euclidean distance s (i, j) = - | | x _i -x _j || ² Calculating a similarity matrix N between every two samples in a normal sample set s, and setting a reference degree p as a median of s;

(6.2) initializing attribution values A respectively _N×N And an attraction degree matrix R _N×N Is 0;

(6.3) passing rules

Updating the attraction matrix by rules

Updating a attribution degree matrix, wherein the attraction degree r (i, j) represents the attraction degree of the data point j suitable as the class representation of the data point i, and the attribution degree a (i, j) represents the attribution degree of the data point i for selecting the data point j as the class representation of the data point i;

if the iteration times exceed the set maximum value or when the clustering center is not changed in a plurality of iterations, stopping calculation, determining the class center and various sample points, and otherwise, continuously updating the attraction degree r (i, j) and the attribution degree a (i, j) in an iteration manner;

(6.4) setting each cluster center as a reference point

Wherein k is the automatically determined number of clusters and h is the total number of cluster centers.

Further, the method for calculating the anomaly score by using the NPOD in the step 7 comprises the following steps:

(7.1) traversing the candidate sample set X for which the computation of the anomaly score is required _c ；

(7.2) passing formula

Calculating to obtain a reference set C _ref (x _c )；

(7.3) passing formula OutScr (x) _c )＝(locDist(x _c )+gloDist(x _c ) 2) computing candidate samples x _c Abnormal score of (Outscr) _g (x _c )，

Wherein locDist (x) _c )＝[lo/(l-2)]×[o(x _c )/l]L is the number of elements in the reference set;

gloDist(x _c ) = gl/(k-2), k is the calculated reference point

The number of (2);

are elements in the reference set;

(7.4) traverse 13 subclasses involved in user sensitive information to construct an anomaly score vector OutscrVector (x) ← { Outscr } ₁ (x),...,Outscr _catNum (x)}。

Has the beneficial effects that: the multi-step anomaly detection model for the Android malicious software is constructed by utilizing an EstSNE dimension reduction method, an anomaly score calculation method NPOD (non-uniform finite automaton), a 1SVM (support vector machine) classification algorithm and the like based on a neighbor propagation clustering algorithm; compared with the prior art, the invention has the following advantages:

1) High efficiency: by extracting the characteristics of the data stream and calculating the frequency, the malicious software can be comprehensively represented in fine granularity, and a multi-step abnormal point detection technology is realized by combining the dimensionality reduction technology PCA and t-SNE in the machine learning method, the AP clustering algorithm and the 1SVM algorithm, so that the efficient detection of the Android malicious software is completed;

2) Easy expansion: under the environment supporting the Android platform, the method can effectively detect newly appeared malicious software or malicious software variants;

3) Intelligentization: because the malware is detected without depending on the known malware pattern, the normal behavior pattern is mined to detect the abnormal points, so that the malware is effectively identified, the problems that the traditional abnormal point detection technology excessively depends on dimension disaster and initial value setting and the like can be solved, and the problem that the detection accuracy is low when the known sample is insufficient when the novel malware or malware variants appear in the early stage is solved.

Drawings

FIG. 1 is a general framework schematic of the present invention;

FIG. 2 is a schematic diagram of data flow features extracted in the present invention;

FIG. 3 is a schematic diagram of a reference point and an abnormal point according to the present invention;

FIG. 4 is a schematic illustration of the anomaly score vector calculated by the present invention;

FIG. 5 is a schematic diagram of a 1SVM classification model in the present invention.

Detailed Description

The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

As shown in fig. 1, the present invention comprises the following three steps: (1) Adopting a mixed dimensionality reduction technology EsttNE which combines the advantages of PCA and t-SNE into a whole; (2) The method for calculating the abnormal value score NPOD without the parameters is provided by combining an AP clustering algorithm, and the function of the abnormal value score not only considers the local distance between a candidate sample and a reference cluster, but also considers the global distance of the candidate sample and the reference cluster; (3) A one-class SVM classifier is trained for pre-malware. The method comprises the following specific steps:

step 1, acquiring a normal Android application program from the Google Play of an Android official websiteAnd from a virus data sample library (e.g. virus library)http://virusshare.com/) Obtaining a malicious App, and constructing an application program App sample set;

step 2, extracting the data stream in the sample set by using a FLOWDROID tool, thereby constructing a feature set X = (X) of the data stream frequency ₁ ,x ₂ ,...,x _n )∈R ^m×n M is the counted number of data streams, i.e., the original feature dimension of the data set, for example, { user information → log };

step 3, constructing a characteristic vector by taking the data stream as a characteristic, taking the frequency of calling a corresponding data stream characteristic in each sample App as a characteristic value, and marking the frequency as 0 if the sample App does not call the corresponding characteristic value of a certain data stream; fig. 2 shows an example of the original features of the present embodiment, data streams (the calling frequency of these data streams will be the input features of estsne);

step 4, reducing the dimension of the high-dimensional data in the step 3 by adopting EstSNE dimension reduction technology;

step 5, dividing App samples into 13 subclasses (e.g. subclasses of account information, contact information, database operation, and the like) related to user sensitive information, as shown in fig. 4, if an App calls a contact information stored on a device through an application program interface, the App is classified into a "contact information class", which is a superimposable division, that is, one App may exist in multiple subclasses at the same time, because it is considered that location information, contact information, other sensitive information, and the like may be called at the same time in the same App;

step 6, as shown in fig. 3, clustering part of the normal apps in each subclass by using a near propagation algorithm AP, namely dividing the apps into different themes to mine the normal mode of the theme, and calculating the reference points of the theme;

step 7, calculating abnormal scores of the candidate sample set by adopting an NPOD method, namely calculating the abnormal scores of the candidate App in the 13 subclasses according to the 13 groups of reference point sets calculated in the step 6, marking the abnormal scores as 0 if the App is not divided into the corresponding subclasses, and constructing an abnormal score vector;

and 9, predicting and evaluating Android malicious software by adopting a pre-divided test set (comprising normal samples and malicious samples) and the 1SVM classifier trained in the step 8.

As shown in fig. 5, in order to evaluate the effectiveness of the method in detecting the Android malware, the embodiment introduces related evaluation criteria: precision (Precision), accuracy (Accuracy), F-measure, respectively defined as follows:

wherein, TP (true Positive): true positive, which is a positive sample correctly classified by the classifier; TN (True Negative): the true negative case refers to a negative sample correctly classified by the classifier; FP (False Positive): refers to a negative sample that is incorrectly labeled as a positive sample; FN (False Negative): a positive sample that is incorrectly labeled as a negative sample.

Under the same experimental environment, for example, c =256, g =0.0658, nu =0.06 is set in 1SVM and a polynomial kernel function is adopted, and comparison of the experimental results shown in table 1 can show that the present invention is superior to the conventional ORCA abnormal point detection method, wherein the ORCA abnormal point detection method is based on a K-nearest neighbor (KNN) algorithm, the Accuracy (Accuracy) of the present invention can reach 95.74%, and the Accuracy (Accuracy) of the ORCA method is 90.09%, that is, the Accuracy (Accuracy) of the present invention is improved by 5.65% under the same experimental environment.

TABLE 1 Experimental comparison of the method of the present invention and the ORCA anomaly detection method in the aspect of Android malware detection

The effectiveness of the method is verified by a ten-fold cross-validation method for aggregation of the real data acquired from Virusshire and Google Play in the embodiment, and the experimental result shows that the method can achieve the accuracy rate of 95.74%. Moreover, the method is compared with the traditional ORCA anomaly model under the same experimental conditions, and the comparison result shows that the performance of the multi-step anomaly point detection method created by the method is obviously superior to that of the ORCA method.

In conclusion, the method can simultaneously solve two problems of dimension disaster and excessive dependence on initial parameter setting, and is applied to Android malicious software detection for the first time; the data flow calling frequency of each application program is extracted to serve as an original feature, the EstSNE is used for reducing dimensions, then classification is carried out, an NPOD method is used for calculating abnormal scores of the samples in all the sub-classes, and finally the 1SVM classifier is trained to carry out malicious software prediction.

Claims

1. A multi-step abnormal point detection method based on a neighbor propagation clustering algorithm is characterized by comprising the following steps: the method comprises the following steps:

step 1, acquiring a normal Android application program from Google Play of an Android official website, acquiring a malicious application program from a virus data sample library, constructing an application program sample set, wherein the application program sample set contains normal samples and malicious samples, and dividing the application program sample set into a training set and a testing set;

step 2, extracting the data stream in the sample set by using a FLOWDROID tool, thereby constructing a high-dimensional data set X = (X) of the data stream frequency ₁ ,x ₂ ,...,x _n )∈R ^m×n M is the counted number of data streams, namely the original characteristic dimension of the data set, n is the number of samples in the sample setAn amount;

step 3, constructing a characteristic vector by taking the data stream as a characteristic, taking the frequency of calling a corresponding data stream characteristic in each sample application program as a characteristic value, and marking the frequency as 0 if the sample application program does not call the corresponding characteristic value of a certain data stream;

step 4, adopting EstSNE dimension reduction technology to reduce the dimension of the high-dimensional data set constructed in the step 2;

step 5, dividing application program samples into 13 subclasses related to user sensitive information; the subclasses are divided according to the SUSI standard;

step 6, clustering is carried out on part of normal application programs in each subclass by adopting a near propagation algorithm AP, namely, the application programs are divided into different themes to excavate the normal mode of the theme, and the reference points of the theme are calculated;

step 7, calculating the abnormal score of the candidate sample set by adopting an NPOD method, namely calculating the abnormal score of the candidate application program in the 13 subclasses according to the 13 groups of reference point sets calculated in the step 6, marking the abnormal score as 0 if the App is not divided into the corresponding subclasses, and finally constructing an abnormal score vector;

step 8, training an One-Class SVM classifier model by adopting a pre-divided training set;

and 9, predicting whether the Android application program is malicious software or not by adopting a pre-divided test set and then through the One-Class SVM classifier trained in the step 8.

2. The multi-step abnormal point detection method based on the neighbor propagation clustering algorithm according to claim 1, wherein: the detailed process of performing dimensionality reduction on the high-dimensional data set in the step 4 is as follows:

constructing probability distribution P among high-dimensional objects and probability distribution Q of the points in a low-dimensional space by an EstSNE dimension reduction method, and then obtaining the optimal low-dimensional representation of the points by minimizing the divergence of a target KL, namely:

p _ij representing a sample x _i And x _j The similarity in the high-dimensional space X,

δ _i represents the variance of the gaussian distribution; x is a radical of a fluorine atom _i And x _j Is a sample in a high dimensional space X;

q _ij representing a sample y _i And y _j In a low dimensional space Y = [ Y = ₁ ,y ₂ ,...,y _n ]∈R ^d×n D is data after dimensionality reduction, q _ij ＝((1+||y _i -y _j || ² )K) ^-1 ，

y _i And y _j Are samples in a low dimensional space.

3. The multi-step abnormal point detection method based on the neighbor propagation clustering algorithm according to claim 1, wherein: the reference point calculation method in the step 6 comprises the following specific steps:

(6.2) separately initializing attribution values A _N×N And an attraction degree matrix R _N×N Is 0;

(6.3) passing rules

Updating the attraction matrix by rules

UpdatingA matrix of the degree of attribution,

wherein, the attraction degree r (i, j) represents the attraction degree that the data point j is suitable for being represented by the class of the data point i, and the attribution degree a (i, j) represents the attribution degree that the data point i selects the data point j as the class representation of the data point i;

(6.4) setting each cluster center as a reference point

4. The multi-step abnormal point detection method based on the neighbor propagation clustering algorithm according to claim 3, wherein: the method for calculating the abnormal score by adopting the NPOD in the step 7 comprises the following steps:

(7.1) traversing the candidate sample set X for which an anomaly score needs to be calculated _c ；

(7.2) passing formula

Calculating to obtain a reference set C _ref (x _c ) Wherein

Represents a reference point in (6.4);

(7.3) passing formula OutScr (x) _c )＝(locDist(x _c )+gloDist(x _c ) 2) of the candidate sample x _c Abnormal score of (Outscr) _g (x _c )，

Wherein locDist (x) _c )＝[lo/(l-2)]×[o(x _c )/l]L is the number of elements in the reference set,

gloDist(x _c ) = gl/(k-2), k is the reference point calculated in (6.4)

The number of the (c) is greater than the total number of the (c),

are the elements in the reference set and,

(7.4) traverse 13 subclasses of 13 related user sensitive information to construct anomaly score vector Outscrvector (x) ← { Outscr } ₁ (x),...,Outscr _catNum (x)}。