CN113821799A - Multi-label classification method for malicious software based on graph convolution neural network - Google Patents

Multi-label classification method for malicious software based on graph convolution neural network Download PDF

Info

Publication number
CN113821799A
CN113821799A CN202111042100.7A CN202111042100A CN113821799A CN 113821799 A CN113821799 A CN 113821799A CN 202111042100 A CN202111042100 A CN 202111042100A CN 113821799 A CN113821799 A CN 113821799A
Authority
CN
China
Prior art keywords
label
graph
function call
extracting
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111042100.7A
Other languages
Chinese (zh)
Other versions
CN113821799B (en
Inventor
宋玉蓉
白敬华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202111042100.7A priority Critical patent/CN113821799B/en
Publication of CN113821799A publication Critical patent/CN113821799A/en
Application granted granted Critical
Publication of CN113821799B publication Critical patent/CN113821799B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Virology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

The invention provides a multi-label classification method for malicious software based on a graph convolution neural network, wherein a classification model comprises the following steps: s100: extracting the characteristics of the function call graph, disassembling the original binary file to obtain an assembly code, extracting the semantic and structural characteristics of the function call graph to obtain a graph embedding vector of the sample; s200: extracting the characteristics of the multi-label relationship, and constructing a model for extracting the label relationship through a label relationship graph to obtain a multi-label classifier; s300: performing point multiplication on the graph embedding vector and the multi-label classifier, and performing structure mapping on a result obtained by the point multiplication to obtain a classification result; s400: and constructing a multi-label loss function, and calculating the loss value of the classification model by calculating the difference value between the classification result and the real result of each label. Compared with the prior art, the multi-label classification method has a good multi-label classification effect on the malicious software with various labels.

Description

Multi-label classification method for malicious software based on graph convolution neural network
Technical Field
The invention relates to the technical field of malicious software detection, in particular to a multi-label classification method for malicious software based on a graph convolution neural network.
Background
With the game of the malicious software protection technology and the malicious software, in the current network environment, the malicious software is not limited to an attack of a behavior, such as WannaCry which erupts in 2017, because the malicious software has the behavior of extorting sharp encrypted data and is classified as extorting virus by the public, but the malicious software removes the behavior of extorting virus encrypted files and also has the behaviors of copying and spreading worms through the network and disguising trojans and horses respectively. In recent years, with the development of Graph Neural Networks (GNNs), unsophisticated performances are obtained in the work of extracting the connections between entities, various fields begin to try to introduce Graph Neural networks to carry out research, and the field of malware detection also tries to carry out research by taking Control Flow Graphs (CFGs) and Function Call Graphs (FCGs) of binary files as entry points.
Graph Convolutional neural Network (GCN) is a Graph representation learning method. The method is a natural popularization of the convolutional neural network on graph data, and can simultaneously carry out end-to-end learning on node attribute information and topological structure information.
In view of the above, it is necessary to provide a new malware multi-tag classification method based on a graph convolution neural network to solve the above problem.
Disclosure of Invention
The invention aims to provide a multi-label classification method for malicious software based on a graph convolution neural network, which has a good classification effect on the malicious software with various labels.
In order to achieve the above object, the present invention provides a multi-tag malware classification method based on a convolutional neural network, which comprises the following steps:
s100: extracting the characteristics of the function call graph, disassembling the original binary file to obtain an assembly code, extracting the semantic and structural characteristics of the function call graph to obtain a graph embedding vector of the sample;
s200: extracting the characteristics of the multi-label relationship, and constructing a model for extracting the label relationship through a label relationship graph to obtain a multi-label classifier;
s300: performing point multiplication on the graph embedding vector and the multi-label classifier, and performing structure mapping on a result obtained by the point multiplication to obtain a classification result;
s400: and constructing a multi-label loss function, and calculating the loss value of the classification model by calculating the difference value between the classification result and the real result of each label.
As a further improvement of the present invention, the step S100 specifically includes:
s1: extracting the semantic and structural characteristics of the function call graph, disassembling the original binary file to obtain assembly codes, and constructing a model for extracting the semantic and structural characteristics of the function call graph, wherein the function call graph of the model is represented by
Figure BDA0003249704780000021
Wherein the content of the first and second substances,
Figure BDA0003249704780000022
set of statement blocks, epsilon, representing a function call graph1A set of connected edges representing statement blocks of a function call graph,
Figure BDA0003249704780000023
the vector obtained by embedding words in the semantic features representing the statement blocks is expressed as
Figure BDA0003249704780000024
Wherein n represents the number of the sentence blocks of the function call graph, r represents the vector dimension of the sentence blocks after the words are embedded,
obtaining the graph embedding vector of the function call graph through GCN training
Figure BDA0003249704780000025
As a further improvement of the present invention, the step S1 specifically includes:
s11: acquiring a function call graph of the binary file, disassembling the original binary file to obtain an assembly code, analyzing the jump relation of the assembly code to obtain the function call graph of the binary file
Figure BDA0003249704780000031
Wherein the content of the first and second substances,
Figure BDA0003249704780000032
set of nodes, ε, for the statement blocks in the function call graph1Connecting edge sets of statement blocks in the function call graph;
s12: counting all statement blocks of the function call graph, extracting operation codes (Opcodes) as words, wherein each Opcode corresponds to a word in a Natural Language Processing (NLP) task, each statement block corresponds to a sentence, training the statement blocks by adopting a word vectorization (World2Vec) model to obtain a vector of each statement block
Figure BDA0003249704780000033
Further, the function call graph is obtained and is expressed as
Figure BDA0003249704780000034
Wherein the content of the first and second substances,
Figure BDA0003249704780000035
s13: and (3) processing the obtained function call graph by a GCN model formula (1) to obtain the information update of the node after each layer of convolution:
H(l+1)=f(Hl,A) (1)
wherein the attention score calculation method using kipf graph convolution and self-attention graph pooling mechanism (SAGPool) yields equation (2):
Figure BDA0003249704780000036
where, σ is a non-linear activation function,
Figure BDA0003249704780000037
is that
Figure BDA0003249704780000038
The degree matrix of (c) is,
Figure BDA0003249704780000039
represents the weight of each layer of learning,
the result after graph convolution is subjected to global pooling by using SAGPool, and node selection is performed by using SAGPool
Figure BDA00032497047800000310
Wherein the pooling ratio k ∈ (0, 1) is a hyper-parameter representing the number of nodes to be reserved,
obtaining a result after global pooling through a Masking operation;
s14: reading (Readout) the result after the global pooling to obtain the graph embedding vector
Figure BDA00032497047800000311
As a further improvement of the present invention, the step S200 specifically includes:
s2: extracting behavior characteristics of the multi-label relationship, and constructing a model for extracting the label relationship through a label relationship graph, wherein the model is input into the label relationship graph and expressed as
Figure BDA0003249704780000041
Wherein the content of the first and second substances,
Figure BDA0003249704780000042
represents the set of all labels of the sample, ε2A set of connected edges representing the label relationship,
Figure BDA0003249704780000043
the vector obtained by One-bit effective coding (One-Hot coding) of the representative node is expressed as
Figure BDA0003249704780000044
Wherein C and C respectively represent the category number of the nodes and the dimension of the label after being coded,
after multi-label training, a multi-label classifier is obtained and is represented as
Figure BDA0003249704780000045
As a further improvement of the present invention, the step S2 specifically includes:
s21: counting the labels of the samples to obtain the conditional probability and the joint probability of each label, and obtaining the probability among different labels according to a formula p (A | B) ═ p (A, B)/p (B);
s22: constructing a correlation coefficient matrix Aij
Figure BDA0003249704780000046
Wherein A isijRepresents the probability that, when j is present, i occurs,
performing One-Hot coding on each label to obtain a label relation graph
Figure BDA0003249704780000047
Wherein the content of the first and second substances,
Figure BDA0003249704780000048
representing the vector obtained by the node after One-Hot coding,
Figure BDA0003249704780000049
c and C respectively represent the category number of the nodes and the dimension of the label after encoding;
s23: label relationship graph obtained using GCN pairs
Figure BDA00032497047800000410
Semi-supervised learning is carried out, different node relations are mapped into one vector,
wherein the object classifier of the learned function is
Figure BDA00032497047800000411
The convolution formula is:
Figure BDA00032497047800000412
wherein the content of the first and second substances,
Figure BDA00032497047800000413
namely a multi-label classifier of the layer l.
As a further improvement of the present invention, the step S300 specifically includes:
s3: embedding the graph into vector X and the multi-label classifier
Figure BDA00032497047800000418
Performs a dot product operation, represented as
Figure BDA00032497047800000414
Obtaining a multi-label classification score, and then obtaining a final classification result through nonlinear operation
Figure BDA00032497047800000415
Figure BDA00032497047800000416
Wherein the content of the first and second substances,
Figure BDA00032497047800000417
representing the corresponding multi-classification result after the sample training,
and mapping the multi-classification result structure into {0,1} by using an activation function (Sigmoid).
As a further improvement of the present invention, the optimization objectives of step S400 are:
label Y e Y for each exemplar1,y2,…yC},
Where y ∈ {0,1}, 0 denotes that the sample does not have this behavioral characteristic, 1 denotes that the sample has this behavioral characteristic,
the loss function of the model is a plurality of two-class loss sums, and the loss function is expressed as the following formula:
Figure BDA0003249704780000051
and calculating the loss value of the model by calculating the difference value between the classification result and the real result of each label.
The invention has the beneficial effects that: the invention extracts each block semantic information in the function call graph by using World2Vec through function call graph analysis constructed from binary file collection codes, and extracts the structure information of the function call graph by using GCN to obtain the semantic and graph structure embedding of each binary file, thereby effectively reflecting the operation characteristics executed by the binary files in the operation process. Secondly, the multi-label classifier is constructed by using multi-label classification and establishing the relation among different labels, the condition that the malicious software can have various types of behaviors is fully considered, the behavior of the malicious software is fully analyzed, and the multi-label classifier has a good classifying effect on the malicious software with various labels.
Drawings
FIG. 1 is a flowchart of the multi-label malware classification method based on the graph convolution neural network of the present invention.
Fig. 2 is a flowchart of the function call graph feature extraction in fig. 1.
FIG. 3 is a flow diagram of the get multi-label classifier of FIG. 1.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings of the embodiments of the present disclosure.
Referring to fig. 1-3, a multi-label malware classification method based on a convolutional neural network of the present invention includes the following steps:
s100: extracting the characteristics of the function call graph, disassembling the original binary file to obtain an assembly code, extracting the semantic and structural characteristics of the function call graph to obtain a graph embedding vector of the sample;
s200: extracting the characteristics of the multi-label relationship, and constructing a model for extracting the label relationship through a label relationship graph to obtain a multi-label classifier;
s300: performing point multiplication on the graph embedding vector and the multi-label classifier, and performing structure mapping on a result obtained by the point multiplication to obtain a classification result;
s400: and constructing a multi-label loss function, and calculating the loss value of the classification model by calculating the difference value between the classification result and the real result of each label.
Specifically, the step S100 specifically includes:
s1: extracting the semantic and structural characteristics of the function call graph, disassembling the original binary file to obtain assembly codes, and constructing a model for extracting the semantic and structural characteristics of the function call graph, wherein the function call graph of the model is represented by
Figure BDA0003249704780000061
Wherein the content of the first and second substances,
Figure BDA0003249704780000062
set of statement blocks, epsilon, representing a function call graph1Representative functionCalling the set of connected edges of the statement block of the graph,
Figure BDA0003249704780000063
the vector obtained by embedding words in the semantic features representing the statement blocks is expressed as
Figure BDA0003249704780000064
Wherein n represents the number of the sentence blocks of the function call graph, r represents the vector dimension of the sentence blocks after the words are embedded,
obtaining the graph embedding vector of the function call graph through GCN training
Figure BDA0003249704780000065
Further, the step S1 specifically includes:
s11: acquiring a function call graph of the binary file, disassembling the original binary file to obtain an assembly code, analyzing the jump relation of the assembly code to obtain the function call graph of the binary file
Figure BDA0003249704780000066
Wherein the content of the first and second substances,
Figure BDA0003249704780000067
set of nodes, ε, for the statement blocks in the function call graph1Connecting edge sets of statement blocks in the function call graph;
s12: counting all statement blocks of the function call graph, extracting Opcodes therein as words, wherein each Opcode corresponds to a word in an NLP task, each statement block corresponds to a sentence, training the statement blocks by adopting a World2Vec model to obtain a vector of each statement block
Figure BDA0003249704780000071
Further, the function call graph is obtained and is expressed as
Figure BDA0003249704780000072
Wherein the content of the first and second substances,
Figure BDA0003249704780000073
s13: and (3) processing the obtained function call graph by a GCN model formula (1) to obtain the information update of the node after each layer of convolution:
H(l+1)=f(Hl,A) (1)
wherein formula (2) is obtained using a kipf graph convolution and a SAGPool attention score calculation method:
Figure BDA0003249704780000074
where, σ is a non-linear activation function,
Figure BDA0003249704780000075
is that
Figure BDA0003249704780000076
The degree matrix of (c) is,
Figure BDA0003249704780000077
represents the weight of each layer of learning,
the result after graph convolution is subjected to global pooling by using SAGPool, and node selection is performed by using SAGPool
Figure BDA00032497047800000714
Wherein the pooling ratio k ∈ (0, 1) is a hyper-parameter representing the number of nodes to be reserved,
obtaining a result after global pooling through Masking operation;
s14: performing Readout operation on the result after the global pooling to obtain the graph embedding vector
Figure BDA0003249704780000078
In the present application, step S200 specifically includes:
s2: extracting behavior characteristics of the multi-label relationship, and constructing a model for extracting the label relationship through a label relationship graph, wherein the model is input into the label relationship graph and expressed as
Figure BDA0003249704780000079
Wherein the content of the first and second substances,
Figure BDA00032497047800000710
represents the set of all labels of the sample, ε2A set of connected edges representing the label relationship,
Figure BDA00032497047800000711
the vector obtained by encoding the representative node by One-Hot is expressed as
Figure BDA00032497047800000712
Wherein C and C respectively represent the category number of the nodes and the dimension of the label after being coded,
after multi-label training, a multi-label classifier is obtained and is represented as
Figure BDA00032497047800000713
Further, step S2 specifically includes:
s21: and counting the labels of the samples to obtain the conditional probability and the joint probability of each label, and obtaining the probability among different labels according to a formula p (A | B) ═ p (A, B)/p (B).
S22: constructing a correlation coefficient matrix Aij
Figure BDA0003249704780000081
Wherein A isijRepresents the probability that, when j is present, i occurs,
performing One-Hot coding on each label, whereinTo obtain a label relationship diagram
Figure BDA0003249704780000082
Wherein the content of the first and second substances,
Figure BDA0003249704780000083
representing the vector obtained by the node after One-Hot coding,
Figure BDA0003249704780000084
wherein C and C represent the category number of the node and the dimension of the label after encoding respectively.
S23: label relationship graph obtained using GCN pairs
Figure BDA0003249704780000085
Semi-supervised learning is carried out, different node relations are mapped into one vector,
wherein the object classifier of the learned function is
Figure BDA0003249704780000086
The convolution formula is:
Figure BDA0003249704780000087
wherein the content of the first and second substances,
Figure BDA0003249704780000088
namely a multi-label classifier of the layer l.
Step S300 specifically includes:
s3: embedding the graph into vector X and the multi-label classifier
Figure BDA0003249704780000089
Performs a dot product operation, represented as
Figure BDA00032497047800000810
Obtaining a multi-label classification score, thenObtaining the final classification result by nonlinear operation
Figure BDA00032497047800000811
Figure BDA00032497047800000812
Wherein the content of the first and second substances,
Figure BDA00032497047800000813
representing the corresponding multi-classification result after the sample training,
and mapping the multi-classification result structure into {0,1} by using Sigmoid.
The optimization goal of step S400 is: label Y e Y for each exemplar1,y2,…yC},
Where y ∈ {0,1}, 0 denotes that the sample does not have this behavioral characteristic, 1 denotes that the sample has this behavioral characteristic,
the loss function of the model is a plurality of two-class loss sums, and the loss function is expressed as the following formula:
Figure BDA00032497047800000814
and calculating the loss value of the model by calculating the difference value between the classification result and the real result of each label.
In summary, the invention extracts each block of semantic information in the function call graph by World2Vec through function call graph analysis constructed from the binary file collection codes, and then extracts the structure information of the function call graph by using the GCN network to obtain the semantic and graph structure embedding of each binary file, thereby effectively reflecting the operation characteristics executed by the binary files in the operation process. Secondly, the multi-label classifier is constructed by using multi-label classification and establishing the relation among different labels, the condition that the malicious software can have various types of behaviors is fully considered, the behavior of the malicious software is fully analyzed, and the multi-label classifier has a good classifying effect on the malicious software with various labels.
Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims (7)

1. A multi-label classification method for malicious software based on a graph convolution neural network is characterized by comprising the following steps:
s100: extracting the characteristics of the function call graph, disassembling the original binary file to obtain an assembly code, extracting the semantic and structural characteristics of the function call graph to obtain a graph embedding vector of the sample;
s200: extracting the characteristics of the multi-label relationship, and constructing a model for extracting the label relationship through a label relationship graph to obtain a multi-label classifier;
s300: performing point multiplication on the graph embedding vector and the multi-label classifier, and performing structure mapping on a result obtained by the point multiplication to obtain a classification result;
s400: and constructing a multi-label loss function, and calculating the loss value of the classification model by calculating the difference value between the classification result and the real result of each label.
2. The multi-label malware classification method based on the graph convolution neural network as claimed in claim 1, wherein the step S100 specifically includes:
s1: extracting the semantic and structural characteristics of the function call graph, disassembling the original binary file to obtain assembly codes, and constructing a model for extracting the semantic and structural characteristics of the function call graph, wherein the function call graph of the model is represented by
Figure FDA0003249704770000011
Wherein the content of the first and second substances,
Figure FDA0003249704770000012
a collection of statement blocks representing a function call graph,
Figure FDA0003249704770000013
a set of connected edges representing statement blocks of a function call graph,
Figure FDA0003249704770000014
the vector obtained by embedding words in the semantic features representing the statement blocks is expressed as
Figure FDA0003249704770000015
Wherein n represents the number of the sentence blocks of the function call graph, r represents the vector dimension of the sentence blocks after the words are embedded,
obtaining the graph embedding vector of the function call graph through graph convolution neural network training
Figure FDA0003249704770000016
3. The multi-label malware classification method based on the graph convolution neural network as claimed in claim 2, wherein the step S1 specifically includes:
s11: acquiring a function call graph of the binary file, disassembling the original binary file to obtain an assembly code, analyzing the jump relation of the assembly code to obtain the function call graph of the binary file
Figure FDA0003249704770000017
Wherein the content of the first and second substances,
Figure FDA0003249704770000021
set of nodes, ε, for the statement blocks in the function call graph1Connecting edge sets of statement blocks in the function call graph;
s12: counting all the statement blocks of the function call graph, extracting operation codes as words, wherein each operation code corresponds to a word in a natural language processing task, each statement block corresponds to a sentence, training the statement blocks by adopting a word vectorization model to obtain a vector of each statement block
Figure FDA0003249704770000022
Further, the function call graph is obtained and is expressed as
Figure FDA0003249704770000023
Wherein the content of the first and second substances,
Figure FDA0003249704770000024
s13: and (3) processing the obtained function call graph by a graph convolution neural network model formula (1) to obtain the information update of each layer of convolved nodes:
H(l+1)=f(Hl,A) (1)
wherein, the attention score calculation method using kipf graph convolution and self-attention graph pooling mechanism yields formula (2):
Figure FDA0003249704770000025
where, σ is a non-linear activation function,
Figure FDA0003249704770000026
is that
Figure FDA0003249704770000027
The degree matrix of (c) is,
Figure FDA0003249704770000028
represents the weight of each layer of learning,
nodes after convolution of the graphIf the self-attention-drawing pooling mechanism is adopted for global pooling, the node selection is adopted
Figure FDA00032497047700000212
Wherein the pooling ratio k ∈ (0, 1) is a hyper-parameter representing the number of nodes to be reserved,
obtaining a result after global pooling through a mask operation;
s14: reading out the result after the global pooling to obtain the graph embedding vector
Figure FDA0003249704770000029
4. The multi-label malware classification method based on the graph convolution neural network as claimed in claim 1, wherein the step S200 specifically includes:
s2: extracting behavior characteristics of the multi-label relationship, and constructing a model for extracting the label relationship through a label relationship graph, wherein the model is input into the label relationship graph and expressed as
Figure FDA00032497047700000210
Wherein the content of the first and second substances,
Figure FDA00032497047700000211
represents the set of all labels of the sample, ε2A set of connected edges representing the label relationship,
Figure FDA0003249704770000031
the vector obtained by one-bit effective coding of the representative node is expressed as
Figure FDA0003249704770000032
Wherein C and C respectively represent the category number of the nodes and the dimension of the label after being coded,
after multi-label training, a multi-label classifier is obtained and is represented as
Figure FDA0003249704770000033
5. The multi-label malware classification method based on the graph convolution neural network as claimed in claim 4, wherein the step S2 specifically includes:
s21: counting the labels of the samples to obtain the conditional probability and the joint probability of each label, and obtaining the probability among different labels according to a formula p (A | B) ═ p (A, B)/p (B);
s22: constructing a correlation coefficient matrix Aij
Figure FDA0003249704770000034
Wherein A isijRepresents the probability that, when j is present, i occurs,
carrying out one-bit effective coding on each label to obtain a label relation graph
Figure FDA0003249704770000035
Wherein the content of the first and second substances,
Figure FDA0003249704770000036
representing the vector obtained by the node after one-bit effective coding,
Figure FDA0003249704770000037
c and C respectively represent the category number of the nodes and the dimension of the label after encoding;
s23: label relationship graph obtained by using graph convolution neural network pair
Figure FDA0003249704770000038
Performing semi-supervised learningMapping different node relations into a vector,
wherein the object classifier of the learned function is
Figure FDA0003249704770000039
The convolution formula is:
Figure FDA00032497047700000310
wherein the content of the first and second substances,
Figure FDA00032497047700000311
namely a multi-label classifier of the layer l.
6. The multi-tag malware classification method based on the graph convolution neural network as claimed in claim 1, wherein the step S300 specifically includes:
s3: embedding the graph into vector X and the multi-label classifier
Figure FDA00032497047700000312
Performs a dot product operation, represented as
Figure FDA00032497047700000313
Obtaining a multi-label classification score, and then obtaining a final classification result through nonlinear operation
Figure FDA00032497047700000314
Figure FDA00032497047700000315
Wherein the content of the first and second substances,
Figure FDA00032497047700000316
representing the corresponding multi-classification result after the sample training,
and mapping the multi-classification result structure into {0,1} by using an activation function.
7. The multi-label malware classification method based on the graph convolution neural network as claimed in claim 1, wherein the optimization goal of the step S400 is:
label Y e Y for each exemplar1,y2,…yC},
Where y ∈ {0,1}, 0 denotes that the sample does not have this behavioral characteristic, 1 denotes that the sample has this behavioral characteristic,
the loss function of the model is a plurality of two-class loss sums, and the loss function is expressed as the following formula:
Figure FDA0003249704770000041
and calculating the loss value of the model by calculating the difference value between the classification result and the real result of each label.
CN202111042100.7A 2021-09-07 2021-09-07 Malicious software multi-label classification method based on graph convolution neural network Active CN113821799B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111042100.7A CN113821799B (en) 2021-09-07 2021-09-07 Malicious software multi-label classification method based on graph convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111042100.7A CN113821799B (en) 2021-09-07 2021-09-07 Malicious software multi-label classification method based on graph convolution neural network

Publications (2)

Publication Number Publication Date
CN113821799A true CN113821799A (en) 2021-12-21
CN113821799B CN113821799B (en) 2023-07-28

Family

ID=78921925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111042100.7A Active CN113821799B (en) 2021-09-07 2021-09-07 Malicious software multi-label classification method based on graph convolution neural network

Country Status (1)

Country Link
CN (1) CN113821799B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114373484A (en) * 2022-03-22 2022-04-19 南京邮电大学 Voice-driven small sample learning method for Parkinson disease multi-symptom characteristic parameters
CN114640502A (en) * 2022-02-17 2022-06-17 南京航空航天大学 Android malicious software detection method and detection system based on traffic fingerprint and graph data characteristics
CN115758370A (en) * 2022-09-09 2023-03-07 中国人民解放军军事科学院系统工程研究院 Software source code defect detection method, device and storage medium
CN117610002A (en) * 2024-01-22 2024-02-27 南京众智维信息科技有限公司 Multi-mode feature alignment-based lightweight malicious software threat detection method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487143A (en) * 2020-11-30 2021-03-12 重庆邮电大学 Public opinion big data analysis-based multi-label text classification method
CN112966271A (en) * 2021-03-18 2021-06-15 中山大学 Malicious software detection method based on graph convolution network
US20210271822A1 (en) * 2020-02-28 2021-09-02 Vingroup Joint Stock Company Encoder, system and method for metaphor detection in natural language processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210271822A1 (en) * 2020-02-28 2021-09-02 Vingroup Joint Stock Company Encoder, system and method for metaphor detection in natural language processing
CN112487143A (en) * 2020-11-30 2021-03-12 重庆邮电大学 Public opinion big data analysis-based multi-label text classification method
CN112966271A (en) * 2021-03-18 2021-06-15 中山大学 Malicious software detection method based on graph convolution network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王昕岩等: "一种加权图卷积神经网络的新浪微博谣言检测方法", 小型微型计算机系统, vol. 42, no. 8 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114640502A (en) * 2022-02-17 2022-06-17 南京航空航天大学 Android malicious software detection method and detection system based on traffic fingerprint and graph data characteristics
CN114373484A (en) * 2022-03-22 2022-04-19 南京邮电大学 Voice-driven small sample learning method for Parkinson disease multi-symptom characteristic parameters
CN115758370A (en) * 2022-09-09 2023-03-07 中国人民解放军军事科学院系统工程研究院 Software source code defect detection method, device and storage medium
CN115758370B (en) * 2022-09-09 2024-06-25 中国人民解放军军事科学院系统工程研究院 Software source code defect detection method, device and storage medium
CN117610002A (en) * 2024-01-22 2024-02-27 南京众智维信息科技有限公司 Multi-mode feature alignment-based lightweight malicious software threat detection method
CN117610002B (en) * 2024-01-22 2024-04-30 南京众智维信息科技有限公司 Multi-mode feature alignment-based lightweight malicious software threat detection method

Also Published As

Publication number Publication date
CN113821799B (en) 2023-07-28

Similar Documents

Publication Publication Date Title
CN113821799B (en) Malicious software multi-label classification method based on graph convolution neural network
CN111371806B (en) Web attack detection method and device
CN110084296B (en) Graph representation learning framework based on specific semantics and multi-label classification method thereof
Hoogeboom et al. Argmax flows and multinomial diffusion: Learning categorical distributions
CN111274134B (en) Vulnerability identification and prediction method, system, computer equipment and storage medium based on graph neural network
Torralba et al. Contextual models for object detection using boosted random fields
CN113596007B (en) Vulnerability attack detection method and device based on deep learning
CN113010683B (en) Entity relationship identification method and system based on improved graph attention network
CN113535964B (en) Enterprise classification model intelligent construction method, device, equipment and medium
CN111931935A (en) Network security knowledge extraction method and device based on One-shot learning
CN115033890A (en) Comparison learning-based source code vulnerability detection method and system
CN107908757B (en) Website classification method and system
CN115344863A (en) Malicious software rapid detection method based on graph neural network
CN115017879A (en) Text comparison method, computer device and computer storage medium
CN114329474A (en) Malicious software detection method integrating machine learning and deep learning
CN116956289B (en) Method for dynamically adjusting potential blacklist and blacklist
CN113378178A (en) Deep learning-based graph confidence learning software vulnerability detection method
CN113392929B (en) Biological sequence feature extraction method based on word embedding and self-encoder fusion
CN115310019A (en) Webpage classification method and device, electronic equipment and storage medium
CN113343235B (en) Application layer malicious effective load detection method, system, device and medium based on Transformer
CN114817516A (en) Sketch mapping method, device and medium based on reverse matching under zero sample condition
CN114742572A (en) Abnormal flow identification method and device, storage medium and electronic device
CN112800435A (en) SQL injection detection method based on deep learning
CN113971282A (en) AI model-based malicious application program detection method and equipment
CN112770323A (en) Mobile malicious application family classification method based on network traffic space time characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant