CN116611063A - Graph convolution neural network malicious software detection method based on multi-feature fusion - Google Patents

Graph convolution neural network malicious software detection method based on multi-feature fusion Download PDF

Info

Publication number
CN116611063A
CN116611063A CN202310541941.5A CN202310541941A CN116611063A CN 116611063 A CN116611063 A CN 116611063A CN 202310541941 A CN202310541941 A CN 202310541941A CN 116611063 A CN116611063 A CN 116611063A
Authority
CN
China
Prior art keywords
graph
neural network
file
apk
steps
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310541941.5A
Other languages
Chinese (zh)
Inventor
姚烨
朱怡安
刘瑞亮
李联
段俊花
钟冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202310541941.5A priority Critical patent/CN116611063A/en
Publication of CN116611063A publication Critical patent/CN116611063A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Probability & Statistics with Applications (AREA)
  • Virology (AREA)
  • Storage Device Security (AREA)

Abstract

The invention relates to a graph convolution neural network malicious software detection method based on multi-feature fusion, and belongs to the technical field of software security. According to the method, based on a function relation call graph of an APK file, sensitive authority characteristics and opcode characteristics are combined, the characteristics are fused and input into a model for training and detection, and finally the purpose of malware detection is achieved, in the APK preprocessing process, the influence of the APK on a training result is considered, APK size equalization processing is performed, and the influence on a detection result is reduced; in the processing process of the function call graph, in order to remove redundant common nodes, the function relation graph is optimized, the complexity of data is greatly reduced, the data processing speed is improved, and compared with the traditional malicious software detection technology, the method improves the training efficiency and accuracy of a subsequent software detection model to a certain extent, and has important significance in detecting a huge malicious software data set.

Description

Graph convolution neural network malicious software detection method based on multi-feature fusion
Technical Field
The invention belongs to the technical field of software security, and relates to a malicious software detection method of an android system, in particular to a novel static malicious software detection method based on a graph convolution neural network (Graph Convolutional Networks, GCN) and combining function call graphs (Function Call Graph, FCG).
Background
With the continuous development of internet technology, various mobile devices mainly including android systems are becoming more popular, and meanwhile, security problems are gradually increased, such as various malicious behaviors such as making a call, revealing privacy and the like, and great loss is caused to data and property security of users; the continuous increase of the types of malicious software and the continuous change of attack modes bring huge economic loss to the market of the mobile terminal taking Android as a main system. Therefore, the development of malware detection and safety protection research based on the Android system has good application value. However, conventional static malware detection techniques have some problems, mainly: the software feature information is too single and is limited to the software authority feature ignoring the relationship between the software features.
Disclosure of Invention
The technical problems to be solved by the invention are as follows:
at present, more and more android software illegally steals the privacy and information of users, a large number of malicious application programs are installed, and serious security threat is brought to the users, wherein some malicious software obtains the user information by applying for too many authorities irrelevant to the functions of the software, and the purpose of collecting the user privacy is achieved. The traditional malicious software detection method is low in detection method accuracy due to incomplete extraction of software characteristic information. Aiming at the problem of lower detection method precision caused by incomplete feature information extraction, the invention provides a graph convolution neural network malicious software detection method based on multi-feature fusion, which is used for detecting the malicious property of software based on a functional relation call graph.
In order to solve the technical problems, the invention adopts the following technical scheme:
a method for detecting malicious software of a graph convolution neural network based on multi-feature fusion is characterized by comprising the following steps:
step1: preprocessing a sample, performing APK size equalization, and extracting and optimizing a function call graph;
step2: extracting auxiliary features, and combining the features of the processed data; the auxiliary features comprise sensitive authority features and opcode features;
step3: after feature extraction and merging are completed, training a malicious software detection model, and adjusting related parameters to enable the malicious software detection model to achieve an optimal training effect; the malicious software detection model is a graph convolution neural network;
step4: detecting malicious codes by using a trained malicious software detection model;
step5: classification is performed using softmax to distinguish benign software from malware.
The invention further adopts the technical scheme that: the APK size equalization in step1 is specifically: for the APK size distribution of the initial data set, if an unbalanced place exists, the APK is added to balance, so that the data set is distributed in a fixed interval, and larger unbalance is avoided.
The invention further adopts the technical scheme that: the extracting and optimizing function call graph in the step1 comprises the following steps:
step1: decompiling the APK file using an ApkTool tool;
step2: extracting a function relation call graph by running a script file;
step3: using a Gephi tool to visualize the extracted function relation call graph;
step4: reading a gml file from the function relation call graph;
step5: the read sensitive API List is stored in a List;
step6: an optimization algorithm is called to optimize the function relation call graph;
step7: outputting the optimized function relation call graph.
The invention further adopts the technical scheme that: the rule of the optimization algorithm is as follows: if a certain node is not a sensitive API node and two adjacent nodes in the same direction are not sensitive nodes, deleting the node; otherwise, the API node is reserved.
The invention further adopts the technical scheme that: the opcode feature extraction in step2 includes the steps of:
step1: decompiling the APK file using the tool;
step2: running an android xml file, and analyzing the android management file;
step3: rights features are extracted by parsing the android management.
The invention further adopts the technical scheme that: the sensitive authority extraction in the step2 comprises the following steps:
step1: decompiling the APK file using an ApkTool tool;
step2: running a get_opcode.py script file, and extracting an opcode sequence file;
step3: the opcode sequence is stored in the file extract_data;
step4: mapping the opcodes into corresponding hexadecimal numbers in a dictionary, wherein the content of the dictionary is stored in DavlinkOpcodes. Txt;
step5: checking/decoding_data/file to temporarily store decoding file of application program;
step6: and checking the opseq.log file, and recording the process of decompiling the APK file.
The invention further adopts the technical scheme that: the inputs of the graph roll-up neural network described in step3 are: assuming that the graph data has N nodes, the characteristics of each node form a matrix X in N X D dimensions, and then the relationships between the nodes also form a matrix a in N X N dimensions, i.e. an adjacent matrix, where the matrix X and the matrix a are inputs of the model.
The invention further adopts the technical scheme that: the classification training is carried out by adopting a double-layer graph convolutional neural network, and the double-layer graph convolutional neural network takes the aggregated neighbor vector output by the GCN for the first time as the input of the characteristic value for the second time, which is equivalent to the characteristic value of the aggregated neighbors of two layers.
The invention further adopts the technical scheme that: step3 also includes graph embedding, i.e., a mapping of nodes or subgraphs as points into the space of low-dimensional vectors, after feature merging.
The invention further adopts the technical scheme that: the Node2vec algorithm is adopted to carry out graph embedding, wherein two super parameters for controlling the random walk strategy are p and q in the Node2vec algorithm, and the probability of walking to different vertexes is controlled through the p and q super parameters.
The invention has the beneficial effects that:
aiming at the problem of lower detection method precision caused by incomplete feature information extraction, the invention provides a graph convolution neural network malicious software detection method based on multi-feature fusion, which combines sensitive authority features and opcode features by taking a function relation call graph of an APK file as a basis, fuses the features, inputs the features into a model for training and detection, finally achieves the purpose of malicious software detection, and in the APK preprocessing process, takes the influence of the APK on a training result into consideration, carries out APK size equalization processing, and reduces the influence on a detection result; in the processing process of the function call graph, in order to remove redundant common nodes, the function relation graph is optimized, the complexity of data is greatly reduced, the data processing speed is improved, and compared with the traditional malicious software detection technology, the method improves the training efficiency and accuracy of a subsequent software detection model to a certain extent, and has important significance in detecting a huge malicious software data set.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.
FIG. 1FCG extract internal block diagram;
FIG. 2 is an optimized FCG comparison graph;
FIG. 3 is a diagram of a detection technique architecture;
FIG. 4 is a schematic diagram of a convolutional neural network;
FIG. 5Node2vec schematic;
fig. 6 is a classification flow chart.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The graph convolution neural network is essentially a feature extractor for graph data, can capture global information of the graph, well represents the features of nodes, has strong capability of learning image data, and can be used for various purposes such as edge prediction, graph classification and the like. Traditional static feature detection is to identify malicious software before program execution, typically decompiling application files, extracting authority features and detecting. Then, with the continuous change of the malicious software, the malicious software is difficult to be detected by a common detection mode, so that it is very important to explore a new malicious software detection method. Therefore, the invention provides a graph convolution neural network malicious software detection method based on multi-feature fusion. To enable detection of software malware.
The method mainly detects software from the angle of static features, takes a function relation call graph as a main feature, and takes sensitive authority features and opcode features as auxiliary features to perform feature fusion. Because the function relation call graph has the problems of large scale, excessive garbage and the like, the method for optimizing the function relation call graph is provided, and the complexity of data is reduced; meanwhile, as the size of APK (fully: android application package, android application package) can influence the number of nodes of FCG and further influence model training, the invention provides an APK size balancing method for solving the problem, which reduces errors of detection precision caused by unbalanced APK size and improves the accuracy of model detection. And finally, fusing the extracted features to obtain a feature vector matrix, and inputting the feature vector matrix into a model for training and detection.
The malicious software detection method based on the graph convolution neural network comprises the following steps:
step one: extraction of function relationship call graphs
A Function Call Graph (FCG), which is typically represented as a binary vector or directed graph, is used to reflect the relationships between functions in a program, where each node represents a function and each edge represents a function call, and can be specifically represented as:
G=(V,E) (1)
where V is the set of all nodes and E represents the relationship between function calls. The graphical representation may provide more information about the manner in which the application is operating than the vector representation, thereby facilitating malware detection. The samli code is typically obtained by decompiling an APK using an APK-Tool, and then from the samli code, a function call graph is directly generated by using an android-cg.
The FCG extracted by the invention is a directed graph G (M, E), where M is the set of all methods (M i ,m j )∈E∈,m i ,m j E.m. The node set M may be divided into two disjoint M external And M internal Corresponding to the external method (the dex file contains only definitions) and the internal method (the dex file contains definitions and implementations), respectively. I.e. m=m external ∪M internal Wherein M is external ∩M internal =Φ. The internal structure of a specific FCG extraction is shown in fig. 1.
In order to extract FCG and related node features from APK, the invention uses android to parse the dex file. The FCG is extracted from the parsed method code by traversing the method code according to the call instruction. After extracting the FCG, for each node m in the FCG, the features are computed and assigned:
(1) Degree characteristics: the degree of each node may capture the basic structure of the graph. Since the FCG is a directed graph,the node degree is therefore equal to the average of the ingress degree index and the egress degree outdegee, the specific formula is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,the byte code of (c) is not included in the dex code.
(2) The method attribute is as follows: for each node m of the FCG, several boolean and integer properties may be extracted from the method definition of the dex code.
·external(m)=1,m∈M external
Native (m) =1, m is a method written in native language (instead of Java virtual machine language, such as C, C ++);
public (m) =1, m is declared public in the dex code;
static (m) =1, m is declared as static in the dex code;
codesize (m) is the number of bytes that m accepts in the dex code,
and finally, grouping the attributes to form a method attribute characteristic vector as shown in a formula (3).
method-attributes(m)=[external(m),native(m),
public(m),static(m),
codesize(m)] (3)
For dex code, an opcode is includedEach opcode is represented by 8 bits, i.e. 2 8 256 possible opcodes, of which only 230 are used, because many have similar functions and operate on the same type of data.
Step two: equalizing APK size
The number of nodes of the FCG is proportional to the APK size. Since the initial dataset used to train the model has different APK size distributions for benign APKs and malicious APKs, the classifier may deviate, and it is the number of nodes that the GCN considers. To avoid this bias, for the APK size distribution of the initial dataset, if there is a large imbalance, the balancing will be done by adding additional APKs, ensuring that the dataset is distributed over a fixed interval, avoiding a large imbalance.
Step three: optimizing function relation call graph
Because the current APK has a large scale, the function call graph contains a plurality of common nodes which are unnecessary for detecting malicious software, belong to invalid information, increase the complexity of sample data and greatly reduce the detection efficiency. Therefore, the invention adopts an optimization algorithm to delete the common API nodes.
The rule of the optimization algorithm provided by the invention is as follows: if a certain node is not a sensitive API node and two adjacent nodes in the same direction are not sensitive nodes, deleting the node; otherwise, the API node is reserved.
As shown in fig. 2, black nodes represent sensitive API nodes, and white represents non-sensitive nodes. And the FCG before optimization is left, optimization is carried out according to the optimization rule, and finally a right function relation call graph is obtained. The FCG is optimized according to the optimization rule, so that the number of nodes can be reduced, interference of invalid information on experimental results is avoided, and efficiency is affected.
The specific extraction and optimization steps of the function relation call graph are as follows:
step1: decompiling the APK file using an ApkTool tool;
step2: extracting a function relation call graph by running a script file;
step3: using a Gephi tool to visualize the extracted function relation call graph;
step4: reading a gml file from the function relation call graph;
step5: the read sensitive API List is stored in a List;
step6: an optimization algorithm is called to optimize the function relation call graph;
step7: outputting the optimized function relation call graph.
Step four: extraction of sensitive authority features and Opcode features
(1) Sensitive rights extraction
According to the analysis of the malicious software programs at present, almost most of the malicious programs call sensitive authorities when illegal operations are performed, and when malicious behaviors are realized, sensitive API functions are frequently called. The invention is characterized by the rights list of android management xml files, which is then expressed as a feature vector according to the one-hot format, expressed as: g= { G 1 ,g 2 ,g 3 ,...,g 24 G e {0,1}, wherein when APK does not apply for the corresponding rights, the value of G is 0, and when the rights are applied for, the value of G is 1.
Step1: decompiling the APK file using the tool;
step2: running an android xml file, and analyzing the android management file;
step3: extracting authority characteristics by analyzing the android management.xml file;
in the process of extracting the rights, the main extracted rights list is shown in annex 1.
(2) Optode feature extraction
The Opcode features contain original information of the application program, and are generally obtained by disassembling an APK file, then matching the Opcode features from a smali file, and obtaining the Opcode features in the APK file by taking a function as a basic unit, wherein the specific steps are as follows:
step1: decompiling the APK file using an ApkTool tool;
step2: running a get_opcode.py script file, and extracting an opcode sequence file;
step3: the opcode sequence is stored in the file extract_data;
step4: mapping the opcodes into corresponding hexadecimal numbers in a dictionary, wherein the content of the dictionary is stored in DavlinkOpcodes. Txt;
step5: checking/decoding_data/file to temporarily store decoding file of application program;
step6: and checking the opseq.log file, and recording the process of decompiling the APK file.
Step five: malware detection model parameter training
The detection technology adopts a graph convolution neural network to perform feature learning on fusion features based on a function call graph and the like, and focuses on application of the function call graph. Based on the characteristics, the sensitive authority characteristics and the Opcode characteristics are added to perform characteristic fusion, and finally, a model is input to perform training and detection, wherein a specific detection technical framework is shown in fig. 3.
As shown in fig. 3, the malware detection mainly includes the following steps: the first step: firstly, preprocessing a sample, performing APK size equalization, extracting and optimizing a function call graph and the like; and a second step of: extracting auxiliary features, and combining the features of the processed data; and a third step of: after feature extraction and merging are completed, model training is carried out, and related parameters are adjusted to enable the model to achieve the optimal training effect; fourth step: detecting malicious codes; fifth step: classification is performed using softmax to distinguish benign software from malware.
The invention combines various malicious software features to form an input structural model of the GCN. And respectively extracting features from the APK file, then carrying out feature fusion and graph embedding, inputting a GCN model, and finally obtaining a detection result by using a classification function. The process of the graph convolution neural network malicious software detection model based on multi-feature fusion is as follows:
step1: input APK sample training set D e { APK 1 ,APK 2 ,…,APK n-1 ,APK n };
Step2: equalizing the APK size;
step3: analyzing the APK file by using an APK-Tool;
step4: extracting authority characteristics from an android manifest.xml file;
step5: extracting the Optode features from the Resource file and the smali code;
step6: extracting and optimizing a function relation call graph FCG;
step7: feature fusion, using Node2vec to embed the graph;
step8: feature classification using Softmax;
step9: output the classification result R (x i )=x i
The detection method is mainly used for researching malicious software detection based on a GCN technology aiming at the combination of the three characteristics of the OPcode characteristic, the high-risk authority characteristic and the function call relation. Before the study, the core part of the GCN was first known. Assuming that there are N nodes for some graph data, the features of each node form an N X D matrix X, and then the relationships between the nodes also form an N X N matrix a, i.e., an adjacency matrix, where X and a are the inputs of the model. For GCN, the layer-to-layer propagation is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,l is the identity matrix; />Is->Degree matrix of (2), the formula is->H is a feature of each layer, and for the input layer, H is X; sigma is a nonlinear activation function.
Based on the graphical representation, a basic theoretical model diagram of the GCN is shown in FIG. 4:
according to the multi-layer graph convolution network schematic diagram for semi-supervised learning, a graph with C input channels is taken as input, and outputs of F output channels are obtained through middle hidden layers. That is, the GCN inputs a graph, the characteristics of each node of the GCN are changed from X to Z through a plurality of layers, but no matter how many layers are in the middle, the connection relation among the nodes is shared, the graph structure (the edge is displayed as a black line) is shared among the layers, and the label is used as Y i And (3) representing.
In order to ensure the detection precision, the invention adopts double-layer GCN for classification training. The double-layer GCN takes the aggregated neighbor vector output by the first GCN as the input of the characteristic value of the second GCN, which is equivalent to the characteristic value of the aggregated neighbors of the two layers. The specific model formula is as follows:
H (0) =X (5)
first layer graph convolution:
second layer graph convolution:
as shown in FIG. 4, assume that there are four input nodes, X 1 ,X 2 ,X 3 ,X 4 The dimension of the feature vector for each node is C, where node X 1 And X 4 Is labeled, node X 2 And X 3 Is unlabeled, and after a multi-layer convolution operation, the feature vector dimensions of all four nodes become F. The parameters in the GCN may be trained by calculating the loss function using labeled nodes in the graph. And calculating a loss function of the real category and the predicted category by using the category of the part of the labeled nodes, so as to optimize GCN parameters.
Data preprocessing is required prior to model training. Because only the function call graph is used, the features are single, and therefore the high-risk authority features and the sensitive API features are extracted as auxiliary features, and the detection precision is enhanced. Thus, feature fusion is performed prior to training. After fusion, features need to be mapped, i.e. a mapping of nodes or subgraphs as points into the space of the low-dimensional vector. The more classical graph embedding methods are deep walk and Noe2vec. Among them, deep walk is the first deep learning-based graph embedding method, which uses random walk on the graph to obtain Node representation, while Node2vec introduces a biased random walk process to extend deep walk. The invention adopts Node2vec to embed the graph, which considers the weight information on the edge in the graph and furthest reserves the network neighbor of the Node. The learning embedding process is divided into two steps: (1) second order random walk; (2) learning vertex embedding using skip-gram. In the random walk process, given the current vertex v, the probability of accessing the next vertex x is:
wherein pi vx Is the unnormalized transition probability between vertex v and vertex x, and Z is the normalization constant.
In Node2vec there are two super-parameters controlling the random walk strategy, p and q respectively, assuming that the current random walk passes the edge (t, v) to reach the vertex v to set pi vx =α pq (t,x)·W vx ,W vx Is the edge weight between vertices v and x.
α pq Is determined by p and q, d tx Is the shortest path distance between vertex t and vertex x.
The Node2vec algorithm controls the probability of wandering to different vertices through two superparameters, p and q. Taking fig. 5 as an example, the process of migrating from the previous vertex t to the current vertex v in preparation for estimating how to migrate to the next vertex is illustrated.
In fig. 5, q denotes whether the control "inward" or "outward" walks, if q >1, tends to access the vertex close to t; if q <1 then it tends to visit vertices away from t. If p is set to a larger value, it is unlikely that the vertex that has just been visited will be asked, and if p is set to a smaller value, it is possible that the loop will return to one step. By adopting the Node2vec graph embedding algorithm, the vector representation of the characteristics of the application program can be obtained, and the method is better used for Android malicious software detection.
Step six: malicious software detection method
A GCN is a neural network running on a graph that summarizes node characteristics according to the nature of its neighborhood. Depending on how many convolutional layers are used, the GCN may capture information about neighbors (convolutional layers with graph) or any node with maximum k-hop distance (where k represents the number of convolutional layers of the graph used). A specific classification flow is shown in fig. 6. The core of the network consists of a graph roll layer and a classification layer. As shown in fig. 6, the node input layer, i.e., node level, does not change the structure of the graph, and the output Z is obtained by merely inputting the feature X of the graph. In fig. 6, the structure of the graph is represented by an adjacency order matrix a. In the present invention, the graph roll layer uses GCN for node aggregation and feature extraction. At the classification layer, the input is classified using a softmax classifier based on the features extracted by the graph convolution layer.
The GCN obtains node vectors by iteratively aggregating vectors of neighboring nodes, and the present invention seeks to obtain a representation of a graph by learning to convert the entire graph into a vector space graph. In this space, the geometric relationship between the learning vectors reflects the structural information of the graph and can be used as input for the next classification layer. The large difference between the output and the input prevents the multi-layer variation by reconstructing the adjacency matrix a by employing normalization operations, as shown in particular by equation (11) and equation (12).
Wherein I is an identity matrix for adding a self-loop connection.When the adjacent matrixes are multiplied, the characteristic vectors of all adjacent nodes of each node are added, and the node can be added; d is node degree matrix->Is a diagonal of (2); />Is a normalization operation performed to avoid unstable data, gradient explosion or disappearance of data due to repeated operations.
Each graph roll layer of the GCN model has a nonlinear activation function defined as follows:
wherein k is the number of layers,embedding matrix X, W formed by embedding vectors corresponding to nodes in the graph (k) B is the weight matrix of the k layer (k) Is the intercept of the k-th layer. Specifically N (v) A set of neighbors denoted node Ver, where Ver belongs to N due to self-circulation (v) RELU is an active function, < >>Representing the convolution operation employed.
The classification experiment of the invention adopts a double-layer GCN model, and inputs the output of the graph convolution layer into the classifier. The objective function of the classification is shown in equation (14).
After Z gives the probability distribution of the tag, the loss function uses cross entropy. Next, the weight matrix W is updated at the convolution layer of the graph using the back propagation loss and Adam algorithm (k) And intercept b (k) (k=1, 2). The GCN mainly gathers neighbor information and learns neighbor representations. The embedding matrix X and the adjacency matrix A form GCN input, and the graph structure framework can generalize and learn the embedding of each node. The description of the polymerization operation is shown in formula (15).
Finally, through learning, the model integrates the node structure and the node attribute of the last layer of the GCN, so that the two parts interact, model parameters are continuously adjusted in the training and classifying process, and finally, a better detection model for the malicious software program is obtained, and compared with the traditional malicious software detection method, the software detection precision and efficiency are improved.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made without departing from the spirit and scope of the invention.
Appendix 1: rights list
/>
/>
/>
/>
/>

Claims (10)

1. A method for detecting malicious software of a graph convolution neural network based on multi-feature fusion is characterized by comprising the following steps:
step1: preprocessing a sample, performing APK size equalization, and extracting and optimizing a function call graph;
step2: extracting auxiliary features, and combining the features of the processed data; the auxiliary features comprise sensitive authority features and opcode features;
step3: after feature extraction and merging are completed, training a malicious software detection model, and adjusting related parameters to enable the malicious software detection model to achieve an optimal training effect; the malicious software detection model is a graph convolution neural network;
step4: detecting malicious codes by using a trained malicious software detection model;
step5: classification is performed using softmax to distinguish benign software from malware.
2. The multi-feature fusion-based graph roll-up neural network malware detection method of claim 1, wherein the method comprises the steps of: the APK size equalization in step1 is specifically: for the APK size distribution of the initial data set, if an unbalanced place exists, the APK is added to balance, so that the data set is distributed in a fixed interval, and larger unbalance is avoided.
3. The multi-feature fusion-based graph roll-up neural network malware detection method of claim 1, wherein the method comprises the steps of: the extracting and optimizing function call graph in the step1 comprises the following steps:
step1: decompiling the APK file using an ApkTool tool;
step2: extracting a function relation call graph by running a script file;
step3: using a Gephi tool to visualize the extracted function relation call graph;
step4: reading a gml file from the function relation call graph;
step5: the read sensitive API List is stored in a List;
step6: an optimization algorithm is called to optimize the function relation call graph;
step7: outputting the optimized function relation call graph.
4. The multi-feature fusion-based graph roll-up neural network malware detection method of claim 3, wherein the method comprises the steps of: the rule of the optimization algorithm is as follows: if a certain node is not a sensitive API node and two adjacent nodes in the same direction are not sensitive nodes, deleting the node; otherwise, the API node is reserved.
5. The multi-feature fusion-based graph roll-up neural network malware detection method of claim 1, wherein the method comprises the steps of: the opcode feature extraction in step2 includes the steps of:
step1: decompiling the APK file using the tool;
step2: running an android xml file, and analyzing the android management file;
step3: rights features are extracted by parsing the android management.
6. The multi-feature fusion-based graph roll-up neural network malware detection method of claim 1, wherein the method comprises the steps of: the sensitive authority extraction in the step2 comprises the following steps:
step1: decompiling the APK file using an ApkTool tool;
step2: running a get_opcode.py script file, and extracting an opcode sequence file;
step3: the opcode sequence is stored in the file extract_data;
step4: mapping the opcodes into corresponding hexadecimal numbers in a dictionary, wherein the content of the dictionary is stored in DavlinkOpcodes. Txt;
step5: checking/decoding_data/file to temporarily store decoding file of application program;
step6: and checking the opseq.log file, and recording the process of decompiling the APK file.
7. The multi-feature fusion-based graph roll-up neural network malware detection method of claim 1, wherein the method comprises the steps of: the inputs of the graph roll-up neural network described in step3 are: assuming that the graph data has N nodes, the characteristics of each node form a matrix X in N X D dimensions, and then the relationships between the nodes also form a matrix a in N X N dimensions, i.e. an adjacent matrix, where the matrix X and the matrix a are inputs of the model.
8. The multi-feature fusion-based graph roll-up neural network malware detection method of claim 1, wherein the method comprises the steps of: the classification training is carried out by adopting a double-layer graph convolutional neural network, and the double-layer graph convolutional neural network takes the aggregated neighbor vector output by the GCN for the first time as the input of the characteristic value for the second time, which is equivalent to the characteristic value of the aggregated neighbors of two layers.
9. The multi-feature fusion-based graph roll-up neural network malware detection method of claim 1, wherein the method comprises the steps of: step3 also includes graph embedding, i.e., a mapping of nodes or subgraphs as points into the space of low-dimensional vectors, after feature merging.
10. The multi-feature fusion-based graph roll-up neural network malware detection method of claim 9, wherein the method comprises the steps of: the Node2vec algorithm is adopted to carry out graph embedding, wherein two super parameters for controlling the random walk strategy are p and q in the Node2vec algorithm, and the probability of walking to different vertexes is controlled through the p and q super parameters.
CN202310541941.5A 2023-05-15 2023-05-15 Graph convolution neural network malicious software detection method based on multi-feature fusion Pending CN116611063A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310541941.5A CN116611063A (en) 2023-05-15 2023-05-15 Graph convolution neural network malicious software detection method based on multi-feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310541941.5A CN116611063A (en) 2023-05-15 2023-05-15 Graph convolution neural network malicious software detection method based on multi-feature fusion

Publications (1)

Publication Number Publication Date
CN116611063A true CN116611063A (en) 2023-08-18

Family

ID=87673974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310541941.5A Pending CN116611063A (en) 2023-05-15 2023-05-15 Graph convolution neural network malicious software detection method based on multi-feature fusion

Country Status (1)

Country Link
CN (1) CN116611063A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034273A (en) * 2023-08-28 2023-11-10 山东省计算中心(国家超级计算济南中心) Android malicious software detection method and system based on graph rolling network
CN117574370A (en) * 2023-11-28 2024-02-20 中华人民共和国新疆出入境边防检查总站(新疆维吾尔自治区公安厅边境管理总队) Malicious code detection system
CN117574370B (en) * 2023-11-28 2024-05-31 中华人民共和国新疆出入境边防检查总站(新疆维吾尔自治区公安厅边境管理总队) Malicious code detection system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034273A (en) * 2023-08-28 2023-11-10 山东省计算中心(国家超级计算济南中心) Android malicious software detection method and system based on graph rolling network
CN117574370A (en) * 2023-11-28 2024-02-20 中华人民共和国新疆出入境边防检查总站(新疆维吾尔自治区公安厅边境管理总队) Malicious code detection system
CN117574370B (en) * 2023-11-28 2024-05-31 中华人民共和国新疆出入境边防检查总站(新疆维吾尔自治区公安厅边境管理总队) Malicious code detection system

Similar Documents

Publication Publication Date Title
Ni et al. Malware identification using visualization images and deep learning
Wang et al. Deep and broad URL feature mining for android malware detection
Yang et al. An improved ELM-based and data preprocessing integrated approach for phishing detection considering comprehensive features
Jia et al. Network intrusion detection based on IE-DBN model
Zahoora et al. Zero-day ransomware attack detection using deep contractive autoencoder and voting based ensemble classifier
CN112884204B (en) Network security risk event prediction method and device
CN113989583A (en) Method and system for detecting malicious traffic of internet
CN114172688B (en) Method for automatically extracting key nodes of network threat of encrypted traffic based on GCN-DL (generalized traffic channel-DL)
CN116611063A (en) Graph convolution neural network malicious software detection method based on multi-feature fusion
CN112235257A (en) Fusion type encrypted malicious traffic detection method and system
CN115344863A (en) Malicious software rapid detection method based on graph neural network
Xu et al. I2DS: interpretable intrusion detection system using autoencoder and additive tree
Chaganti et al. A multi-view feature fusion approach for effective malware classification using Deep Learning
Gu et al. From image to code: executable adversarial examples of android applications
CN116467710A (en) Unbalanced network-oriented malicious software detection method
CN116318928A (en) Malicious traffic identification method and system based on data enhancement and feature fusion
CN115391778A (en) Android malicious program detection method and device based on special-pattern attention network
Li et al. Anomaly detection by discovering bipartite structure on complex networks
Zheng et al. Tegdetector: a phishing detector that knows evolving transaction behaviors
CN112580044A (en) System and method for detecting malicious files
Feng et al. BejaGNN: behavior-based Java malware detection via graph neural network
KR102472850B1 (en) Malware detection device and method based on hybrid artificial intelligence
Wang et al. AIHGAT: A novel method of malware detection and homology analysis using assembly instruction heterogeneous graph
Liu et al. Interpretable deep learning method for attack detection based on spatial domain attention
CN117610002B (en) Multi-mode feature alignment-based lightweight malicious software threat detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination