CN110162975B - Multi-step abnormal point detection method based on neighbor propagation clustering algorithm - Google Patents
Multi-step abnormal point detection method based on neighbor propagation clustering algorithm Download PDFInfo
- Publication number
- CN110162975B CN110162975B CN201910452071.8A CN201910452071A CN110162975B CN 110162975 B CN110162975 B CN 110162975B CN 201910452071 A CN201910452071 A CN 201910452071A CN 110162975 B CN110162975 B CN 110162975B
- Authority
- CN
- China
- Prior art keywords
- sample
- data
- application program
- abnormal
- point detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
- G06F2221/033—Test or assess software
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Debugging And Monitoring (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a multi-step abnormal point detection method based on a neighbor propagation clustering algorithm. The invention can effectively solve the problem of dimension disaster when detecting the abnormal points, thereby avoiding the interference of redundant characteristics or excessive data noise of irrelevant characteristics to the abnormal point detection technology; meanwhile, the excessive dependence of the traditional abnormal point detection technology based on clustering or distance on the selection of the initial value is overcome, the effectiveness of the method is verified by a Virusschare and Google Play-acquired actual data aggregation cross-fold verification method, and the method has a wide application prospect in the field of network security.
Description
Technical Field
The invention belongs to the network security technology, and particularly relates to a multi-step abnormal point detection method based on a neighbor propagation clustering algorithm.
Background
Along with diversified propagation ways and complex application environments brought by rapid development of the internet, great convenience is brought to propagation and attack of malicious software, and the aggressivity and the harmfulness of the malicious software are stronger than those of traditional computer viruses. Due to the characteristics of Android, such as strictless audit of application stores, random release and download of Android application programs from a third-party application market by users and the like, android becomes a main attack target of malicious software, and Android equipment is listed as an attack target by mobile malicious software up to 97% according to latest research data. Malware is the generic term for software that is installed privately without explicit prompting or permission, and that has malicious intent or performs malicious functions that violate the legitimate interests of the user. Malware often has some significant features, such as frequently accessing files, using a network, sending short messages, obtaining a user's address book, and so on. A research and analysis report (last half of 2018) on network privacy security and network fraud behaviors shows a list of network security problems of fraud behaviors such as counterfeiting bank short messages and the like caused by stealing privacy information such as user address books, user geographic positions and other entertainment and payment through an Android application program. Therefore, the Android platform-based malware analysis and detection plays a crucial role in the research of network security.
However, conventional malware detection methods tend to be "retrospective", i.e., they rely on a sufficient known sample of malware to mine out the corresponding malware patterns after the malware has spread widely. Aiming at the realistic situation of Android malicious software detection, the invention introduces an anomaly detection technology. Anomaly Detection (Outlier Detection) aims at detecting data that does not conform to normal behavior. Anomaly detection has wide application in the fields of databases, data mining, machine learning, statistics and the like, and comprises fraud detection of credit cards or insurance industry, intrusion detection and fault diagnosis in networks, new feature identification in satellite image analysis, health medical monitoring, occurrence of emergencies in public safety, identification of novel molecular structures in drug research and the like. Distance-based and cluster-based anomaly detection methods are two more typical anomaly detection methods, but in practical applications, two major challenges are faced: (1) The accuracy of the abnormal point detection technology is low due to data noise caused by redundant features or excessive irrelevant features of high-dimensional data; (2) The efficiency of this type of solution depends greatly on whether the initial value is set reasonably or not, based on the traditional clustering method or the distance-based outlier detection technique (e.g., KNN, K-means, K-center) requires accurate prior knowledge and relies heavily on the selection of the initial value, such as the number of clusters and the initialization of the cluster center.
Disclosure of Invention
The invention aims to: the invention aims to solve the defects in the prior art, and provides a multi-step abnormal point detection method based on a neighbor propagation clustering algorithm.
The technical scheme is as follows: the invention discloses a multi-step abnormal point detection method based on a neighbor propagation clustering algorithm, which comprises the following steps of:
and 9, adopting a pre-divided test set (comprising normal samples and malicious samples), and then performing Android malicious software prediction and evaluation through the 1SVM classifier trained in the step 8.
Further, the detailed process of performing dimension reduction on the high-dimensional data in step 4 is as follows:
using X = [ X = 1 ,x 2 ,...,x n ]∈R m×n Representing a high-dimensional data set, constructing probability distribution P among high-dimensional objects and probability distribution Q of points in a low-dimensional space by an EstSNE dimension reduction method, and then obtaining the optimal low-dimensional representation of the points by minimizing the target KL divergence, namely:
p ij represents a sample x i And x j The similarity in the high-dimensional space X is calculated according to the formula: δ i represents the variance of the gaussian distribution; wherein p is i|j Is calculated by j|i The same;
q ij representing a sample y i And y j In a low-dimensional space (i.e., a space reduced by X), Y = [ Y = 1 ,y 2 ,...,y n ]∈R d×n D is data after dimensionality reduction, and the calculation mode is as follows: q. q.s ij =((1+||y i -y j || 2 )K) -1 ,Here, p and q are used for cycle counting.
Further, in step 6, the reference point calculation method is as follows:
(6.1) using a negative Euclidean distance s (i, j) = - | | x i -x j || 2 Calculating a similarity matrix N between every two samples in a normal sample set s, and setting a reference degree p as a median of s;
(6.2) initializing attribution values A respectively N×N And an attraction degree matrix R N×N Is 0;
(6.3) passing rulesUpdating the attraction matrix by rulesUpdating a attribution degree matrix, wherein the attraction degree r (i, j) represents the attraction degree of the data point j suitable as the class representation of the data point i, and the attribution degree a (i, j) represents the attribution degree of the data point i for selecting the data point j as the class representation of the data point i;
if the iteration times exceed the set maximum value or when the clustering center is not changed in a plurality of iterations, stopping calculation, determining the class center and various sample points, and otherwise, continuously updating the attraction degree r (i, j) and the attribution degree a (i, j) in an iteration manner;
(6.4) setting each cluster center as a reference pointWherein k is the automatically determined number of clusters and h is the total number of cluster centers.
Further, the method for calculating the anomaly score by using the NPOD in the step 7 comprises the following steps:
(7.1) traversing the candidate sample set X for which the computation of the anomaly score is required c ;
(7.3) passing formula OutScr (x) c )=(locDist(x c )+gloDist(x c ) 2) computing candidate samples x c Abnormal score of (Outscr) g (x c ),
Wherein locDist (x) c )=[lo/(l-2)]×[o(x c )/l]L is the number of elements in the reference set;
(7.4) traverse 13 subclasses involved in user sensitive information to construct an anomaly score vector OutscrVector (x) ← { Outscr } 1 (x),...,Outscr catNum (x)}。
Has the beneficial effects that: the multi-step anomaly detection model for the Android malicious software is constructed by utilizing an EstSNE dimension reduction method, an anomaly score calculation method NPOD (non-uniform finite automaton), a 1SVM (support vector machine) classification algorithm and the like based on a neighbor propagation clustering algorithm; compared with the prior art, the invention has the following advantages:
1) High efficiency: by extracting the characteristics of the data stream and calculating the frequency, the malicious software can be comprehensively represented in fine granularity, and a multi-step abnormal point detection technology is realized by combining the dimensionality reduction technology PCA and t-SNE in the machine learning method, the AP clustering algorithm and the 1SVM algorithm, so that the efficient detection of the Android malicious software is completed;
2) Easy expansion: under the environment supporting the Android platform, the method can effectively detect newly appeared malicious software or malicious software variants;
3) Intelligentization: because the malware is detected without depending on the known malware pattern, the normal behavior pattern is mined to detect the abnormal points, so that the malware is effectively identified, the problems that the traditional abnormal point detection technology excessively depends on dimension disaster and initial value setting and the like can be solved, and the problem that the detection accuracy is low when the known sample is insufficient when the novel malware or malware variants appear in the early stage is solved.
Drawings
FIG. 1 is a general framework schematic of the present invention;
FIG. 2 is a schematic diagram of data flow features extracted in the present invention;
FIG. 3 is a schematic diagram of a reference point and an abnormal point according to the present invention;
FIG. 4 is a schematic illustration of the anomaly score vector calculated by the present invention;
FIG. 5 is a schematic diagram of a 1SVM classification model in the present invention.
Detailed Description
The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.
As shown in fig. 1, the present invention comprises the following three steps: (1) Adopting a mixed dimensionality reduction technology EsttNE which combines the advantages of PCA and t-SNE into a whole; (2) The method for calculating the abnormal value score NPOD without the parameters is provided by combining an AP clustering algorithm, and the function of the abnormal value score not only considers the local distance between a candidate sample and a reference cluster, but also considers the global distance of the candidate sample and the reference cluster; (3) A one-class SVM classifier is trained for pre-malware. The method comprises the following specific steps:
and 9, predicting and evaluating Android malicious software by adopting a pre-divided test set (comprising normal samples and malicious samples) and the 1SVM classifier trained in the step 8.
As shown in fig. 5, in order to evaluate the effectiveness of the method in detecting the Android malware, the embodiment introduces related evaluation criteria: precision (Precision), accuracy (Accuracy), F-measure, respectively defined as follows:
wherein, TP (true Positive): true positive, which is a positive sample correctly classified by the classifier; TN (True Negative): the true negative case refers to a negative sample correctly classified by the classifier; FP (False Positive): refers to a negative sample that is incorrectly labeled as a positive sample; FN (False Negative): a positive sample that is incorrectly labeled as a negative sample.
Under the same experimental environment, for example, c =256, g =0.0658, nu =0.06 is set in 1SVM and a polynomial kernel function is adopted, and comparison of the experimental results shown in table 1 can show that the present invention is superior to the conventional ORCA abnormal point detection method, wherein the ORCA abnormal point detection method is based on a K-nearest neighbor (KNN) algorithm, the Accuracy (Accuracy) of the present invention can reach 95.74%, and the Accuracy (Accuracy) of the ORCA method is 90.09%, that is, the Accuracy (Accuracy) of the present invention is improved by 5.65% under the same experimental environment.
TABLE 1 Experimental comparison of the method of the present invention and the ORCA anomaly detection method in the aspect of Android malware detection
The effectiveness of the method is verified by a ten-fold cross-validation method for aggregation of the real data acquired from Virusshire and Google Play in the embodiment, and the experimental result shows that the method can achieve the accuracy rate of 95.74%. Moreover, the method is compared with the traditional ORCA anomaly model under the same experimental conditions, and the comparison result shows that the performance of the multi-step anomaly point detection method created by the method is obviously superior to that of the ORCA method.
In conclusion, the method can simultaneously solve two problems of dimension disaster and excessive dependence on initial parameter setting, and is applied to Android malicious software detection for the first time; the data flow calling frequency of each application program is extracted to serve as an original feature, the EstSNE is used for reducing dimensions, then classification is carried out, an NPOD method is used for calculating abnormal scores of the samples in all the sub-classes, and finally the 1SVM classifier is trained to carry out malicious software prediction.
Claims (4)
1. A multi-step abnormal point detection method based on a neighbor propagation clustering algorithm is characterized by comprising the following steps: the method comprises the following steps:
step 1, acquiring a normal Android application program from Google Play of an Android official website, acquiring a malicious application program from a virus data sample library, constructing an application program sample set, wherein the application program sample set contains normal samples and malicious samples, and dividing the application program sample set into a training set and a testing set;
step 2, extracting the data stream in the sample set by using a FLOWDROID tool, thereby constructing a high-dimensional data set X = (X) of the data stream frequency 1 ,x 2 ,...,x n )∈R m×n M is the counted number of data streams, namely the original characteristic dimension of the data set, n is the number of samples in the sample setAn amount;
step 3, constructing a characteristic vector by taking the data stream as a characteristic, taking the frequency of calling a corresponding data stream characteristic in each sample application program as a characteristic value, and marking the frequency as 0 if the sample application program does not call the corresponding characteristic value of a certain data stream;
step 4, adopting EstSNE dimension reduction technology to reduce the dimension of the high-dimensional data set constructed in the step 2;
step 5, dividing application program samples into 13 subclasses related to user sensitive information; the subclasses are divided according to the SUSI standard;
step 6, clustering is carried out on part of normal application programs in each subclass by adopting a near propagation algorithm AP, namely, the application programs are divided into different themes to excavate the normal mode of the theme, and the reference points of the theme are calculated;
step 7, calculating the abnormal score of the candidate sample set by adopting an NPOD method, namely calculating the abnormal score of the candidate application program in the 13 subclasses according to the 13 groups of reference point sets calculated in the step 6, marking the abnormal score as 0 if the App is not divided into the corresponding subclasses, and finally constructing an abnormal score vector;
step 8, training an One-Class SVM classifier model by adopting a pre-divided training set;
and 9, predicting whether the Android application program is malicious software or not by adopting a pre-divided test set and then through the One-Class SVM classifier trained in the step 8.
2. The multi-step abnormal point detection method based on the neighbor propagation clustering algorithm according to claim 1, wherein: the detailed process of performing dimensionality reduction on the high-dimensional data set in the step 4 is as follows:
constructing probability distribution P among high-dimensional objects and probability distribution Q of the points in a low-dimensional space by an EstSNE dimension reduction method, and then obtaining the optimal low-dimensional representation of the points by minimizing the divergence of a target KL, namely:
p ij representing a sample x i And x j The similarity in the high-dimensional space X, δ i represents the variance of the gaussian distribution; x is a radical of a fluorine atom i And x j Is a sample in a high dimensional space X;
3. The multi-step abnormal point detection method based on the neighbor propagation clustering algorithm according to claim 1, wherein: the reference point calculation method in the step 6 comprises the following specific steps:
(6.1) using a negative Euclidean distance s (i, j) = - | | x i -x j || 2 Calculating a similarity matrix N between every two samples in a normal sample set s, and setting a reference degree p as a median of s;
(6.2) separately initializing attribution values A N×N And an attraction degree matrix R N×N Is 0;
(6.3) passing rulesUpdating the attraction matrix by rulesUpdatingA matrix of the degree of attribution,
wherein, the attraction degree r (i, j) represents the attraction degree that the data point j is suitable for being represented by the class of the data point i, and the attribution degree a (i, j) represents the attribution degree that the data point i selects the data point j as the class representation of the data point i;
if the iteration times exceed the set maximum value or when the clustering center is not changed in a plurality of iterations, stopping calculation, determining the class center and various sample points, and otherwise, continuously updating the attraction degree r (i, j) and the attribution degree a (i, j) in an iteration manner;
4. The multi-step abnormal point detection method based on the neighbor propagation clustering algorithm according to claim 3, wherein: the method for calculating the abnormal score by adopting the NPOD in the step 7 comprises the following steps:
(7.1) traversing the candidate sample set X for which an anomaly score needs to be calculated c ;
(7.2) passing formulaCalculating to obtain a reference set C ref (x c ) WhereinRepresents a reference point in (6.4);
(7.3) passing formula OutScr (x) c )=(locDist(x c )+gloDist(x c ) 2) of the candidate sample x c Abnormal score of (Outscr) g (x c ),
Wherein locDist (x) c )=[lo/(l-2)]×[o(x c )/l]L is the number of elements in the reference set,
gloDist(x c ) = gl/(k-2), k is the reference point calculated in (6.4)The number of the (c) is greater than the total number of the (c),
(7.4) traverse 13 subclasses of 13 related user sensitive information to construct anomaly score vector Outscrvector (x) ← { Outscr } 1 (x),...,Outscr catNum (x)}。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910452071.8A CN110162975B (en) | 2019-05-28 | 2019-05-28 | Multi-step abnormal point detection method based on neighbor propagation clustering algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910452071.8A CN110162975B (en) | 2019-05-28 | 2019-05-28 | Multi-step abnormal point detection method based on neighbor propagation clustering algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110162975A CN110162975A (en) | 2019-08-23 |
CN110162975B true CN110162975B (en) | 2022-10-25 |
Family
ID=67629654
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910452071.8A Active CN110162975B (en) | 2019-05-28 | 2019-05-28 | Multi-step abnormal point detection method based on neighbor propagation clustering algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110162975B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110991508A (en) * | 2019-11-25 | 2020-04-10 | 珠海复旦创新研究院 | Anomaly detector recommendation method, device and equipment |
CN112839327B (en) * | 2021-01-21 | 2022-08-16 | 河北工程大学 | Personnel validity detection method and device based on WiFi signals |
CN113288122B (en) * | 2021-05-21 | 2023-12-19 | 河南理工大学 | Wearable sitting posture monitoring device and sitting posture monitoring method |
CN113569920B (en) * | 2021-07-06 | 2024-05-31 | 上海顿飞信息科技有限公司 | Second neighbor anomaly detection method based on automatic coding |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106599686B (en) * | 2016-10-12 | 2019-06-21 | 四川大学 | A kind of Malware clustering method based on TLSH character representation |
CN106845240A (en) * | 2017-03-10 | 2017-06-13 | 西京学院 | A kind of Android malware static detection method based on random forest |
CN106919841A (en) * | 2017-03-10 | 2017-07-04 | 西京学院 | A kind of efficient Android malware detection model DroidDet based on rotation forest |
-
2019
- 2019-05-28 CN CN201910452071.8A patent/CN110162975B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110162975A (en) | 2019-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110162975B (en) | Multi-step abnormal point detection method based on neighbor propagation clustering algorithm | |
US11188649B2 (en) | System and method for classification of objects of a computer system | |
Biggio et al. | Poisoning behavioral malware clustering | |
Frank et al. | Mining permission request patterns from android and facebook applications | |
Zarni Aung | Permission-based android malware detection | |
Jerome et al. | Using opcode-sequences to detect malicious Android applications | |
US20070136455A1 (en) | Application behavioral classification | |
CN105229661B (en) | Method, computing device and the storage medium for determining Malware are marked based on signal | |
US11580222B2 (en) | Automated malware analysis that automatically clusters sandbox reports of similar malware samples | |
JP2020115320A (en) | System and method for detecting malicious file | |
Shezan et al. | Read between the lines: An empirical measurement of sensitive applications of voice personal assistant systems | |
WO2019061664A1 (en) | Electronic device, user's internet surfing data-based product recommendation method, and storage medium | |
Imran et al. | Using hidden markov model for dynamic malware analysis: First impressions | |
CN107402957B (en) | Method and system for constructing user behavior pattern library and detecting user behavior abnormity | |
Sanz et al. | Anomaly detection using string analysis for android malware detection | |
Chen et al. | More semantics more robust: Improving android malware classifiers | |
Allix et al. | Machine learning-based malware detection for Android applications: History matters! | |
Wolfe et al. | Comprehensive behavior profiling for proactive Android malware detection | |
Li et al. | Novel Android Malware Detection Method Based on Multi-dimensional Hybrid Features Extraction and Analysis. | |
CN106998336B (en) | Method and device for detecting user in channel | |
CN105631336A (en) | System and method for detecting malicious files on mobile device, and computer program product | |
Morcos et al. | A surrogate-based technique for Android malware detectors' explainability | |
Ndagi et al. | Machine learning classification algorithms for adware in android devices: a comparative evaluation and analysis | |
Lajevardi et al. | Markhor: malware detection using fuzzy similarity of system call dependency sequences | |
Mirzaei et al. | Scrutinizer: Detecting code reuse in malware via decompilation and machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |