CN113360912A - Malicious software detection method, device, equipment and storage medium - Google Patents

Malicious software detection method, device, equipment and storage medium Download PDF

Info

Publication number
CN113360912A
CN113360912A CN202110905738.2A CN202110905738A CN113360912A CN 113360912 A CN113360912 A CN 113360912A CN 202110905738 A CN202110905738 A CN 202110905738A CN 113360912 A CN113360912 A CN 113360912A
Authority
CN
China
Prior art keywords
software
graph
detected
function call
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110905738.2A
Other languages
Chinese (zh)
Inventor
贾鹏
王炎
方勇
吴小王
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202110905738.2A priority Critical patent/CN113360912A/en
Publication of CN113360912A publication Critical patent/CN113360912A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

The embodiment of the application provides a malicious software detection method, a malicious software detection device, malicious software detection equipment and a malicious software detection storage medium, which relate to the technical field of network information security, wherein the method comprises the following steps: statically analyzing a binary file of the software to be detected to obtain an assembly code and a function call graph of the software to be detected; converting the assembly code to obtain a semantic feature vector of each function in the software to be detected; combining the semantic feature vector and the function call graph to generate an attribute function call graph; and inputting the attribute function call graph into a neural network classification model of the graph to obtain malicious attribute information of the software to be detected. According to the malicious software detection method, the semantic features and the structural features of the binary program can be automatically extracted, the semantic features and the structural features are combined and judged through the graph neural network, the problems of incomplete feature representation, high false alarm rate and high false alarm rate in the existing detection method are solved, and malicious software can be quickly and accurately detected.

Description

Malicious software detection method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of network information security, in particular to a malicious software detection method, a malicious software detection device, malicious software detection equipment and a malicious software detection storage medium.
Background
The rapid development of computer technology makes people daily life more convenient, and simultaneously prompts the attack means and technology of hackers to be continuously improved, so that the network attack problem is increasingly serious, the security threat quantity of network space is increased year by year, and malicious software is the largest category of all security threats. Malware refers to programs that cause damage to or perform undesirable operations on a computer system. An attacker can attack a computer system through malicious software, achieve the purposes of privilege elevation, remote control, privacy stealing and the like, and further attack other terminals in a computer network.
Early malware was simple in structure and did not employ complex protection techniques, and security vendors could detect it using techniques such as signatures. However, with the advent of computer technology, new attack methods are continually being employed by malware to implement malicious attacks. First, a new generation of malware may use multiple different processes simultaneously and use some obfuscation techniques to hide itself so that it may persist in the system. Second, to bypass or disable various security mechanisms to implement malicious behavior, new malware uses more complex code, structures, and techniques, making it more destructive and more difficult to detect.
The existing malicious software detection method has the defects that unknown malicious software cannot be dealt with, and the false alarm conditions are high in the detection of the new generation of malicious software. Therefore, in order to solve the increasingly serious cyberspace security problem, the introduction of new effective malware detection methods has become increasingly slow.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for detecting malicious software, and aims to solve at least one technical problem.
A first aspect of an embodiment of the present application provides a method for detecting malware, where the method includes:
statically analyzing a binary file of the software to be detected to obtain an assembly code and a function call graph of the software to be detected;
converting the assembly code to obtain a semantic feature vector of each function in the software to be detected;
combining the semantic feature vector and the function call graph to generate an attribute function call graph;
and inputting the attribute function call graph into a neural network classification model of the graph to obtain malicious attribute information of the software to be detected.
Optionally, the statically analyzing the binary file of the software to be detected to obtain the assembly code and the function call graph of the software to be detected includes:
judging whether the binary file adopts a shell adding technology or not;
when the binary file adopts a shelling technology, processing the binary file by using an automatic shelling technology;
disassembling the unshelled binary file to obtain an assembly code of the software to be detected;
and constructing a function call graph of the software to be detected according to the assembly code.
Optionally, the converting the assembly code to obtain a semantic feature vector of each function in the software to be detected includes:
normalizing the assembly code to obtain a normalized assembly code;
converting the normalized assembly code of each function into a plurality of lexical units token with standard formats;
mapping each token to a vector representation;
and aggregating the vector representations of all tokens in the function to obtain the semantic feature vector of the function.
Optionally, the graph neural network classification model includes: the system comprises a graph neural network layer, a full connection layer and an activation layer, wherein the graph neural network layer is connected with the full connection layer in a rear mode, and the full connection layer is connected with the activation layer in a rear mode;
the step of inputting the attribute function call graph into a neural network classification model of the graph to obtain malicious attribute information of the software to be detected comprises the following steps:
inputting the attribute function call graph into a graph neural network layer to obtain program embedded vector representation;
the full connection layer is used for judging the nonlinear relation between the program embedded vector representation and the malicious attribute;
and the activation layer is used for predicting malicious attribute information of the software to be detected according to the nonlinear relation.
A second aspect of the embodiments of the present application provides a malware detection apparatus, where the apparatus includes:
the static analysis information extractor is used for statically analyzing the binary file of the software to be detected to obtain the assembly code and the function call graph of the software to be detected;
the semantic feature extractor is used for converting the assembly code to obtain a semantic feature vector of each function in the software to be detected;
the structural feature combiner is used for combining the semantic feature vector and the function call graph to generate an attribute function call graph;
and the graph neural network classification model is used for detecting the malicious attribute information of the software to be detected according to the attribute function call graph.
Optionally, the static analysis information extractor includes:
the shell checking unit is used for judging whether the binary file adopts a shell adding technology;
the shelling unit is used for processing the binary file by utilizing an automatic shelling technology when the binary file adopts a shelling technology;
the disassembling unit is used for disassembling the unshelled binary file to obtain an assembly code of the software to be detected;
and the function call graph extracting unit is used for constructing the function call graph of the software to be detected according to the assembly code.
Optionally, the semantic feature extractor includes:
the normalization submodule is used for carrying out normalization processing on the assembly code to obtain a normalized assembly code;
the lexical unit token conversion sub-module is used for converting the normalized assembly codes of each function into a plurality of lexical unit tokens with standard formats;
a vector representation conversion submodule for mapping each token to a vector representation;
and the aggregation submodule is used for aggregating the vector representation of all tokens in the function to obtain the semantic feature vector of the function.
Optionally, the graph neural network classification model includes: the system comprises a graph neural network layer, a full connection layer and an activation layer, wherein the graph neural network layer is connected with the full connection layer in a rear mode, and the full connection layer is connected with the activation layer in a rear mode;
the graph neural network layer is used for receiving the attribute function call graph and acquiring program embedded vector representation;
the full connection layer is used for receiving the program embedded vector representation transmitted by the graph neural network layer and judging the nonlinear relation between the program embedded vector representation and the malicious attribute;
and the activation layer is used for receiving the nonlinear relation transmitted by the full connection layer and predicting malicious attribute information of the software to be detected according to the nonlinear relation.
A third aspect of embodiments of the present application provides a readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a method as described in the first aspect of the present application.
A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method according to the first aspect of the present application.
By adopting the malicious software detection method provided by the application, the detection object is a binary program, a program source code is not needed, a heavyweight program analysis technology is not relied on, the semantic feature vector and the function call graph of the binary program can be automatically extracted, the feature extraction speed is high, the extracted function feature vector and the extracted function call graph respectively represent the semantic feature of the program and the structural feature of the program, the two deep features of the semantic feature and the structural feature are combined and predicted through a graph neural network classification model to obtain malicious attribute information, the malicious software can be quickly and accurately detected, and the problems of incomplete feature representation, high false report rate and high false report rate existing in the existing detection method are solved.
In addition to the above advantages, the method of the present application has the following advantages: firstly, because the detection object is a binary program and does not depend on program source codes, the detection object not only meets the requirements in practical application, but also can simultaneously support PE files under a Windows system and ELF files under a Linux system, and can detect most of the actual binary files. Secondly, the method does not need professional knowledge in the binary field, can automatically extract the binary file features, saves a large amount of manpower and material resources, adopts a deep learning method to detect the malicious software, can further improve the model detection effect through an increment training method, and saves computing resources and has higher universality compared with the traditional malicious software detection method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a flowchart of a malware detection method according to an embodiment of the present application;
FIG. 2 is a diagram of a function call proposed in an embodiment of the present application;
FIG. 3 is a diagram illustrating a combination of a function call graph and a function semantic feature vector according to an embodiment of the present application;
fig. 4 is a schematic diagram of functional modules of a malware detection apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Malware detection techniques have experienced a history of signature-based, traditional machine learning algorithms to now deep learning methods. Signature-based methods detect quickly but cannot cope with unknown malware. Traditional machine learning-based methods can handle large-scale suspicious samples, but rely on manual design and feature extraction. Deep analysis of malware is required, with analysts requiring advanced programming knowledge, deep file system knowledge, robust code inspection capabilities, and reverse engineering capabilities. However, the number of experts in the field is insufficient to cope with the increasing malware attacks.
The existing detection method based on deep learning can automatically extract features from original input for detection, for example, a control flow graph is established for each function in a program, the control flow graphs of all functions are collected to form a master control flow graph, the master control flow graph is detected through a graph neural network to judge whether the master control flow graph is malicious software, for example, a detection method based on a gray graph is used for converting an original binary file into a gray graph for detection, and whether the master control flow graph is the malicious software is judged through gray levels, so that the existing malicious deep learning detection method can be seen to depend on training data too much and only use single shallow features, and therefore, higher false alarm and missed alarm conditions exist.
In order to solve the problems existing in the malicious software detection, the application provides a new malicious software detection method, a lightweight static analysis technology is used for extracting the structural information and the semantic information of the malicious software, the binary software is detected based on the structural information and the semantic information through a deep learning method, and the malicious software can be quickly and accurately detected.
Referring to fig. 1, fig. 1 is a flowchart of a malware detection method according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
step S110, statically analyzing a binary file of the software to be detected to obtain an assembly code and a function call graph of the software to be detected.
And performing static analysis on the binary file of the software to be detected, and extracting the detailed information of the assembly code included in the program. And extracting the calling relation of the function according to the obtained assembly code to construct a function calling graph.
In an optional embodiment of the present application, the statically analyzing a binary file of software to be detected to obtain an assembly code and a function call graph of the software to be detected includes:
1. judging whether the binary file adopts a shell adding technology or not;
the shell adding technology is called executable program resource compression, and the shell adding can prevent an external program or software from performing disassembling analysis or dynamic analysis on the program, so that the original program and the software in the protective shell are not damaged by the external program, and the normal operation of the original program is ensured. One common way is to embed a code in the binary program, and when the binary program is running, the control right of the program is first obtained, and then the control right is returned to the original code, so as to achieve the purpose of hiding the true OEP (entry point) of the program. This method is typically utilized by malware to bypass the scan.
2. When the binary file adopts a shelling technology, processing the binary file by using an automatic shelling technology;
and when detecting that the binary file of the software to be detected adopts the shell adding technology, removing the shell of the binary file, wherein the shell removing method can be selected according to the corresponding shell adding method. If the shell is not added, the shelling treatment is not carried out.
3. Disassembling the unshelled binary file to obtain an assembly code of the software to be detected;
converting the binary file after being de-shelled into assembly code, extracting the binary file in the logic address (such as 004011E 3) by adopting a linear scanning algorithm and a recursive scanning algorithm, and converting the read binary file (such as 1000100111011) into the assembly code, such as push, mov and the like
4. And constructing a function call graph of the software to be detected according to the assembly code.
According to the extracted assembly code, all functions are extracted by identifying boundary information such as the head and the tail of the function, the calling relation of the function is obtained according to the transfer instruction information, and a function calling graph is constructed, the generated function calling graph represents the calling relation of the function inside the program, as shown in FIG. 2, FIG. 2 shows the function calling graph of one program, as shown in FIG. 2, a main function calls a copy function, and the copy function calls four functions of copy _ fifo, copy _ file, copy _ specific and setfile. The function call graph is different from a control flow graph showing a program execution process, and the control flow graph focuses on information reflecting the execution state of the program and the like.
And step S120, converting the assembly code to obtain the semantic feature vector of each function in the software to be detected.
And after the assembly code of each function assembly code in the software to be detected is obtained, the assembly code of each function assembly code is converted to obtain the semantic feature vector of each function.
In an embodiment of the present application, the converting the assembly code to obtain a semantic feature vector of each function in the software to be detected includes:
A. normalizing the assembly code to obtain a normalized assembly code;
since the numerical constants and character strings used by each program are different, and the values are not associated with semantic features of function implementation, and only represent features used by a programmer language, if the values are added, not only is malicious software detection interfered, but also a bag of words in a word vector model is large, and the expansion of a detection method is not facilitated. Therefore, the assembly code needs to be normalized, and the normalization rule is as follows: replacing the numerical constants in the assembly instruction with the character "N", replacing the character string in the assembly instruction with the character "M", and then connecting the operation code and the operand of the assembly instruction using the symbol "-", for example, the original assembly code such as
Figure 713330DEST_PATH_IMAGE001
And
Figure 319892DEST_PATH_IMAGE002
and normalizing according to the normalization processing rule to obtain normalized assembly codes push _ N and call _ M.
B. Converting the normalized assembly code of each function into a plurality of lexical units token with standard formats;
for each function in the program, decomposing the normalized assembly code into a plurality of tokens, translating partial scenes of the tokens into lexical units, which are products of structural scanning of the function, representing a syntactic structure, and converting the syntactic structure into tokens conforming to a format specification according to the structure of the function, for example, directly using each line of assembly instructions after normalization as a token to obtain a token set (Push-ebp, mov-ebp-esp, sub-esp-N, Push-N, retn, … …),
C. each token is mapped to a vector representation.
The token set of each function is input into a preset word vector model, and each token in the set is mapped into a vector representation, for example, the token set in step B is mapped into the following vector representations
Figure 313255DEST_PATH_IMAGE003
D. And aggregating the vector representations of all tokens in the function to obtain the semantic feature vector of the function.
For each function in the program, it is necessary to aggregate the vector representations of all tokens in the function to generate the semantic feature vector of the function, and the aggregation method may adopt average value calculation, weighting calculation, etc., for example, the vector representation in step C is aggregated to obtain the following semantic feature vector of the function
Figure 67585DEST_PATH_IMAGE004
The Word vector model is trained, and a basic Word2Vec Word vector model is trained by adopting an assembly code corpus in advance, so that each token can be mapped into a language space of an assembly code to perform vector representation.
The semantic feature vector of the function contains the semantic features of the function, which represent the syntax, linguistic meaning, surface features, etc. of the function.
And step S130, combining the semantic feature vector and the function call graph to generate an attribute function call graph.
And combining the function semantic feature vector generated in the step S120 with the program function call graph generated in the step S110 to generate a function call graph AFCG with attribute information. As shown in fig. 3, fig. 3 is a schematic diagram illustrating the combination of a function call graph and a function semantic feature vector, and the function semantic feature is combined with a program structure feature to generate an attribute function call graph. The attribute function call graph is high-dimensional sparse graph structure data, nodes of the graph structure data represent functions, each node is attached with attribute characteristics, and edges between the nodes represent call relations between the functions.
And step S140, calling the attribute function into a neural network classification model of an input graph to obtain malicious attribute information of the software to be detected.
Inputting the attribute function call graph into the neural network classification model, firstly, the model generates program embedded vector representation according to the attribute function call graph, and the program embedded vector representation is a vector representation form generated based on the attribute function call graph, such as
Figure 804597DEST_PATH_IMAGE005
The program embedded vector representation combines semantic features and structural features of the program, and further, the model generates malicious attribute information such as 0 or 1 according to the program embedded vector representation, and after the malicious attribute information is obtained, the information such as whether the software to be detected is a malicious program or not can be known according to the setting of the malicious attribute information during model training (for example, 0 represents that the software to be detected is a normal program, and 1 represents that the software to be detected is a malicious program).
In one embodiment of the present application, the neural network classification model includes: the device comprises a graph neural network layer, a full connection layer and an activation layer, wherein the graph neural network layer is connected with the full connection layer in a rear mode, and the full connection layer is connected with the activation layer in a rear mode.
The step of inputting the attribute function call graph into a neural network classification model of the graph to obtain malicious attribute information of the software to be detected comprises the following steps:
inputting the attribute function call graph into a graph neural network layer to obtain program embedded vector representation;
the graph network is a general deep learning architecture which can run on graph-based data, and is a generalized neural network based on a graph structure. Graph networks typically use the underlying graph as a computational graph and learn neural network primitives by passing, transforming and aggregating node feature information across the graph to generate single-node embedded vectors.
The graph neural network layer adopts a Struc2vec graph neural network structure, an attention machine mechanism is added to each layer of the graph neural network, nodes can be aggregated by the attention machine mechanism during aggregation, and a program embedded vector containing semantic information and structural information is obtained through output through the graph neural network layer.
The structural characteristics include the association relationship in the execution process of each part in the program, and can represent the information related to the program functions, and because the malicious software often needs to execute specific functions different from those of the benign program, the structural characteristics of the malicious software are often different from those of the benign program, and the software of different malicious families also has different characteristics. But because various obfuscation techniques, or malware makers, may alter the code, a higher false alarm rate may result if separate structural features are employed. On the other hand, the simple semantic features are not enough to deal with various complicated malicious software.
According to the method, structural features need to be considered on the basis of considering the semantic features, the semantic features are combined with the structural features, the semantics of specific functions to be executed by a program are unchanged regardless of the change of codes, the structural features and the semantic features of the program are combined through an attribute function call graph and a graph neural network to generate program embedded vector expression for judgment, and compared with the method that the structural features and the semantic features are trained independently, richer and more accurate hidden information can be extracted, and the performance of malicious software detection is improved.
The full connection layer is used for judging the nonlinear relation between the program embedded vector representation and the malicious attribute;
and the activation layer is used for predicting malicious attribute information of the software to be detected according to the nonlinear relation.
The full connection layer is followed by the active layer, and the two layers jointly form a discriminator. The method comprises the steps of representing and inputting a program embedding vector into a full connection layer, learning the relationship between malicious attributes and input features by the full connection layer in a training stage, judging the nonlinear relationship between the current input features and the malicious attributes according to the learned relationship after receiving the program embedding vector input by a graph neural network layer, and activating the nonlinear relationship output by the full connection layer received by an activation layer so as to output malicious attribute information of software to be detected, wherein the attribute information is preset and can be used for representing various malicious information of the software to be detected: whether the software to be detected is malicious software or not and what malicious family the software to be detected is.
The discriminator can be a two-classification model or a multi-classification model, when the two-classification model is adopted, the software to be detected is judged to be normal software or malicious software, and for example, the graph neural network layer obtains program embedded vector representation output such as
Figure 327982DEST_PATH_IMAGE006
And then, the vectors are respectively input into a two-classification discriminator, the output of the two classifiers is 0 to indicate that the program is a normal program, and the output of the two classifiers is 1 to indicate that the program is a malicious program. When a multi-classification model is used, it determines the malicious family class of the software to be detected, e.g. by embedding programs into vector representations such as
Figure 757826DEST_PATH_IMAGE007
Inputting multiple classifiers, outputting the probability of each malicious family, such as (A: 0.1, B: 0.2, C: 0.6, D: 0.1), and selecting the malicious family with the highest probability as the prediction result, such as the malicious softwareClass C malicious family. Through the double classifiers, the method can complete both the malware detection task and the malware family classification task.
Through the flexible selection of the classifier, the method not only supports the malicious software detection task, but also can classify the malicious software family. The specific selection of the second category or the multiple categories in actual use should be selected according to specific situations. If the software is used in some complex scenes, if the malicious software family needs to be judged, the multi-classification is needed. Of course, the classification model should be trained for use regardless of the choice.
By adopting the malicious software detection method provided by the application, the detection object is a binary program, a program source code is not needed, a heavyweight program analysis technology is not relied on, the semantic feature vector and the function call graph of the binary program can be automatically extracted, the feature extraction speed is high, the extracted function feature vector and the extracted function call graph respectively represent the semantic feature of the program and the structural feature of the program, the two deep features of the semantic feature and the structural feature are combined and predicted through a graph neural network classification model to obtain malicious attribute information, the malicious software can be quickly and accurately detected, and the problems of incomplete feature representation, high false report rate and high false report rate existing in the existing detection method are solved.
In addition to the above advantages, the method of the present application has the following advantages: firstly, because the detection object is a binary program and does not depend on program source codes, the detection object not only meets the requirements in practical application, but also can simultaneously support PE files under a Windows system and ELF files under a Linux system, and can detect most of the actual binary files. Secondly, the method does not need professional knowledge in the binary field, can automatically extract the binary file features, saves a large amount of manpower and material resources, adopts a deep learning method to detect the malicious software, can further improve the model detection effect through an increment training method, and saves computing resources and has higher universality compared with the traditional malicious software detection method.
Based on the same inventive concept, an embodiment of the present application provides a malware detection apparatus. Referring to fig. 4, fig. 4 is a schematic diagram of a malware detection apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus includes:
the static analysis information extractor 410 is used for statically analyzing the binary file of the software to be detected to obtain the assembly code and the function call graph of the software to be detected;
the static analysis information extractor is responsible for extracting assembly code and function call graphs in the target binary program.
In an embodiment of the present application, the static analysis information extractor includes:
the shell checking unit is used for judging whether the binary file adopts a shell adding technology;
the shelling unit is used for processing the binary file by utilizing an automatic shelling technology when the binary file adopts a shelling technology;
the disassembling unit is used for disassembling the unshelled binary file to obtain an assembly code of the software to be detected;
and the function call graph extracting unit is used for constructing the function call graph of the software to be detected according to the assembly code.
The static analysis information extractor consists of a shell-checking and shelling unit, a disassembling unit and a function call graph extraction unit.
The shell checking and shelling unit is responsible for identifying and shelling the binary program adopting the protection technology for further analysis. Performing shell checking and shelling processing on a target binary program, if the binary program is found to adopt a shell adding technology, performing processing by using an automatic shelling technology, and otherwise, performing no processing; the disassembling unit is responsible for carrying out reverse operation on the binary program, and the assembly code of the binary program is extracted through a linear scanning algorithm and a recursive scanning algorithm. And the function call graph extracting unit extracts all functions by identifying boundary information such as the head part and the tail part of the function according to the extracted assembly code, acquires the call relation of the function according to the transfer instruction information and constructs a function call graph.
A semantic feature extractor 420, configured to convert the assembly code to obtain a semantic feature vector of each function in the to-be-detected software;
and the semantic feature extractor performs assembly instruction embedding conversion according to the assembly code information obtained from the static analysis information extractor to obtain an embedded vector corresponding to each assembly instruction.
And the semantic feature extractor is used for extracting the semantic features of the binary program according to the assembly instruction information obtained from the static analysis extractor. The method takes the assembly instructions as words, takes the function assembly codes as texts consisting of the assembly instructions, converts the assembly instructions into embedded vectors by a word embedding technology, and further obtains semantic feature vectors of the functions by aggregating the embedded vectors of the assembly instructions in the functions.
In one embodiment of the present application, the semantic feature extractor includes:
the normalization submodule is used for carrying out normalization processing on the assembly code to obtain a normalized assembly code;
the token conversion sub-module is used for converting the normalized assembly codes of each function into tokens with a plurality of format specifications;
a vector representation conversion submodule for mapping each token to a vector representation;
and the aggregation submodule is used for aggregating the vector representation of all tokens in the function to obtain the semantic feature vector of the function.
Standardizing the assembly instruction, wherein the standardized processing rule is as follows: replacing the numerical value constant in the assembly instruction with a character 'N', replacing the character string in the assembly instruction with a character 'M', connecting the operation code and the operand of the assembly instruction by using a symbol '-', and converting the assembly instruction into a token with a standard format after processing.
Constructing a corpus according to the token obtained by the token conversion submodule, and training a Word2Vec Word vector model; mapping each token to a vector representation based on the trained word vector model; aggregating the vector representations of all tokens within the function generates a vector representation of the function.
A structural feature combiner 430, configured to combine the semantic feature vector and the function call graph to generate an attribute function call graph;
the structural feature combiner combines the functional semantic features and the function call graph to generate a functional call graph AFCG with attribute information, wherein the program function call graph generated in the static analysis information extractor 410 is the structural feature according to the function call graph obtained in the static analysis extractor, and the functional semantic features are the functional semantic features obtained by the semantic feature extractor 620 and are the function vectors generated by the semantic feature extractor. The attribute function call graph is high-dimensional sparse graph structure data, nodes of the graph structure data represent functions, each node is attached with attribute characteristics, and edges between the nodes represent call relations between the functions. And inputting the constructed graph structure data into a graph neural network for learning, and obtaining a program embedded vector representation containing semantic information and structure information by outputting. Finally, the binary program can be classified by classifying the embedded vector representation.
And the graph neural network classification model 440 is used for detecting malicious attribute information of the software to be detected according to the attribute function call graph.
In one embodiment of the present application, the neural network classification model includes: the system comprises a graph neural network layer, a full connection layer and an activation layer, wherein the graph neural network layer is connected with the full connection layer in a rear mode, and the full connection layer is connected with the activation layer in a rear mode;
the graph neural network layer is used for receiving the attribute function call graph and acquiring program embedded vector representation;
the full connection layer is used for receiving the program embedded vector representation transmitted by the graph neural network layer and judging the nonlinear relation between the program embedded vector representation and the malicious attribute;
and the activation layer is used for receiving the nonlinear relation transmitted by the full connection layer and predicting malicious attribute information of the software to be detected according to the nonlinear relation.
Inputting the AFCG generated by the structural feature combiner 430 into a trained Struc2vec graph neural network model, and performing aggregation by adopting an attention mechanism during function aggregation;
the neural network is followed by a fully connected layer and an activation layer to perform classification tasks.
The classification task comprises the following steps: the method comprises the following steps of (1) classifying malicious software detection and a malicious software family, wherein the malicious software detection constructs a two-classification model which judges whether a binary program is normal software or malicious software; the malware family classification will build a multi-classification model that outputs the malware family class of the binary.
Therefore, the malware classifier composed of the full connection layer and the activation layer can be a malware detection unit or a malicious family classification unit. And the malware classifier classifies program embedded vector representations generated by the graph neural network layer by constructing a deep learning model and outputs classification results of the binary programs. The detection of the malicious software is essentially a two-classification task, the classification of the malicious family is a multi-classification task, and the number of output classes of a classification model can be set to select a specific task in practical application.
Before practical application, a large number of binary process sequences can be collected to train the graph neural network classification model, so that the graph neural network and the discriminator can be suitable for a binary program prediction scene, then the pre-trained model is deployed in a practical detection environment, and because the model is trained, the processing time of each sample is very short during practical detection, and the efficiency of malicious software detection is effectively improved. In addition, the deep learning model can be incrementally trained to further improve the accuracy of malware detection.
The working flow of the malicious software detection device is as follows: after the binary target program is determined, the static analysis information extractor can convert the target binary program into assembly codes, and then extract assembly instruction information and function call information contained in the assembly codes. The assembly instruction information is extracted to obtain the semantic features of the binary program through a semantic feature extractor, and the features and the function calling information are used as the input of the structural feature combiner. The structural feature combiner combines the functional semantic features with the structural features of the functional call graph and generates an embedded vector representation of the binary program using graph neural network training. The malware classifier completes the classification of the binary file by classifying the embedded vector.
Based on the same inventive concept, another embodiment of the present application provides a readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the malware detection method according to any of the above embodiments of the present application.
Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the electronic device implements the malware detection method according to any of the above embodiments of the present application.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The method, the device, the equipment and the storage medium for detecting the malicious software provided by the application are introduced in detail, a specific example is applied in the description to explain the principle and the implementation mode of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A malware detection method, the method comprising:
statically analyzing a binary file of the software to be detected to obtain an assembly code and a function call graph of the software to be detected;
converting the assembly code to obtain a semantic feature vector of each function in the software to be detected;
combining the semantic feature vector and the function call graph to generate an attribute function call graph;
and inputting the attribute function call graph into a neural network classification model of the graph to obtain malicious attribute information of the software to be detected.
2. The method according to claim 1, wherein the statically analyzing the binary file of the software to be detected to obtain the assembly code and the function call graph of the software to be detected comprises:
judging whether the binary file adopts a shell adding technology or not;
when the binary file adopts a shelling technology, processing the binary file by using an automatic shelling technology;
disassembling the unshelled binary file to obtain an assembly code of the software to be detected;
and constructing a function call graph of the software to be detected according to the assembly code.
3. The method according to claim 1, wherein the converting the assembly code to obtain the semantic feature vector of each function in the software to be detected comprises:
normalizing the assembly code to obtain a normalized assembly code;
converting the normalized assembly code of each function into a plurality of lexical units token with standard formats;
mapping each token to a vector representation;
and aggregating the vector representations of all tokens in the function to obtain the semantic feature vector of the function.
4. The method of claim 1, wherein the graph neural network classification model comprises: the system comprises a graph neural network layer, a full connection layer and an activation layer, wherein the graph neural network layer is connected with the full connection layer in a rear mode, and the full connection layer is connected with the activation layer in a rear mode;
the step of inputting the attribute function call graph into a neural network classification model of the graph to obtain malicious attribute information of the software to be detected comprises the following steps:
inputting the attribute function call graph into a graph neural network layer to obtain program embedded vector representation;
the full connection layer is used for judging the nonlinear relation between the program embedded vector representation and the malicious attribute;
and the activation layer is used for predicting malicious attribute information of the software to be detected according to the nonlinear relation.
5. An apparatus for malware detection, the apparatus comprising:
the static analysis information extractor is used for statically analyzing the binary file of the software to be detected to obtain the assembly code and the function call graph of the software to be detected;
the semantic feature extractor is used for converting the assembly code to obtain a semantic feature vector of each function in the software to be detected;
the structural feature combiner is used for combining the semantic feature vector and the function call graph to generate an attribute function call graph;
and the graph neural network classification model is used for detecting the malicious attribute information of the software to be detected according to the attribute function call graph.
6. The apparatus of claim 5, wherein the static analysis information extractor comprises:
the shell checking unit is used for judging whether the binary file adopts a shell adding technology;
the shelling unit is used for processing the binary file by utilizing an automatic shelling technology when the binary file adopts a shelling technology;
the disassembling unit is used for disassembling the unshelled binary file to obtain an assembly code of the software to be detected;
and the function call graph extracting unit is used for constructing the function call graph of the software to be detected according to the assembly code.
7. The apparatus of claim 5, wherein the semantic feature extractor comprises:
the normalization submodule is used for carrying out normalization processing on the assembly code to obtain a normalized assembly code;
the lexical unit token conversion sub-module is used for converting the normalized assembly codes of each function into a plurality of lexical unit tokens with standard formats;
a vector representation conversion submodule for mapping each token to a vector representation;
and the aggregation submodule is used for aggregating the vector representation of all tokens in the function to obtain the semantic feature vector of the function.
8. The apparatus of claim 5, wherein the graph neural network classification model comprises: the system comprises a graph neural network layer, a full connection layer and an activation layer, wherein the graph neural network layer is connected with the full connection layer in a rear mode, and the full connection layer is connected with the activation layer in a rear mode;
the graph neural network layer is used for receiving the attribute function call graph and acquiring program embedded vector representation;
the full connection layer is used for receiving the program embedded vector representation transmitted by the graph neural network layer and judging the nonlinear relation between the program embedded vector representation and the malicious attribute;
and the activation layer is used for receiving the nonlinear relation transmitted by the full connection layer and predicting malicious attribute information of the software to be detected according to the nonlinear relation.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 4.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 4 when executing the computer program.
CN202110905738.2A 2021-08-09 2021-08-09 Malicious software detection method, device, equipment and storage medium Pending CN113360912A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110905738.2A CN113360912A (en) 2021-08-09 2021-08-09 Malicious software detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110905738.2A CN113360912A (en) 2021-08-09 2021-08-09 Malicious software detection method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113360912A true CN113360912A (en) 2021-09-07

Family

ID=77540696

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110905738.2A Pending CN113360912A (en) 2021-08-09 2021-08-09 Malicious software detection method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113360912A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688393A (en) * 2021-10-27 2021-11-23 南京聚铭网络科技有限公司 Malicious software type detection method and device and storage medium
CN113918171A (en) * 2021-10-19 2022-01-11 哈尔滨理工大学 Novel disassembling method using extended control flow graph
CN114077741A (en) * 2021-11-01 2022-02-22 清华大学 Software supply chain safety detection method and device, electronic equipment and storage medium
CN114139153A (en) * 2021-11-02 2022-03-04 武汉大学 Graph representation learning-based malware interpretability classification method
CN114817925A (en) * 2022-05-19 2022-07-29 电子科技大学 Android malicious software detection method and system based on multi-modal graph features
CN114817924A (en) * 2022-05-19 2022-07-29 电子科技大学 AST (AST) and cross-layer analysis based android malicious software detection method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107908963A (en) * 2018-01-08 2018-04-13 北京工业大学 A kind of automatic detection malicious code core feature method
CN111538989A (en) * 2020-04-22 2020-08-14 四川大学 Malicious code homology analysis method based on graph convolution network and topic model
CN112329013A (en) * 2019-08-05 2021-02-05 四川大学 Malicious code classification method based on graph convolution network and topic model
CN112800423A (en) * 2021-01-26 2021-05-14 北京航空航天大学 Binary code authorization vulnerability detection method
CN112966271A (en) * 2021-03-18 2021-06-15 中山大学 Malicious software detection method based on graph convolution network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107908963A (en) * 2018-01-08 2018-04-13 北京工业大学 A kind of automatic detection malicious code core feature method
CN112329013A (en) * 2019-08-05 2021-02-05 四川大学 Malicious code classification method based on graph convolution network and topic model
CN111538989A (en) * 2020-04-22 2020-08-14 四川大学 Malicious code homology analysis method based on graph convolution network and topic model
CN112800423A (en) * 2021-01-26 2021-05-14 北京航空航天大学 Binary code authorization vulnerability detection method
CN112966271A (en) * 2021-03-18 2021-06-15 中山大学 Malicious software detection method based on graph convolution network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
凌云等: "《智能信息检索》", 31 December 2006, 中国科学技术出版社 *
杨频等: "基于汇编指令词向量特征的恶意软件检测研究", 《信息安全研究》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113918171A (en) * 2021-10-19 2022-01-11 哈尔滨理工大学 Novel disassembling method using extended control flow graph
CN113688393A (en) * 2021-10-27 2021-11-23 南京聚铭网络科技有限公司 Malicious software type detection method and device and storage medium
CN114077741A (en) * 2021-11-01 2022-02-22 清华大学 Software supply chain safety detection method and device, electronic equipment and storage medium
CN114139153A (en) * 2021-11-02 2022-03-04 武汉大学 Graph representation learning-based malware interpretability classification method
CN114817925A (en) * 2022-05-19 2022-07-29 电子科技大学 Android malicious software detection method and system based on multi-modal graph features
CN114817924A (en) * 2022-05-19 2022-07-29 电子科技大学 AST (AST) and cross-layer analysis based android malicious software detection method and system
CN114817924B (en) * 2022-05-19 2023-04-07 电子科技大学 AST (AST) and cross-layer analysis based android malicious software detection method and system

Similar Documents

Publication Publication Date Title
Vinayakumar et al. Robust intelligent malware detection using deep learning
CN113360912A (en) Malicious software detection method, device, equipment and storage medium
CN111460446B (en) Malicious file detection method and device based on model
KR102093275B1 (en) Malicious code infection inducing information discrimination system, storage medium in which program is recorded and method
KR101858620B1 (en) Device and method for analyzing javascript using machine learning
US20200159925A1 (en) Automated malware analysis that automatically clusters sandbox reports of similar malware samples
CN112989358B (en) Method and device for improving robustness of source code vulnerability detection based on deep learning
CN110362995B (en) Malicious software detection and analysis system based on reverse direction and machine learning
Kakisim et al. Sequential opcode embedding-based malware detection method
Dewanje et al. A new malware detection model using emerging machine learning algorithms
Li et al. Deep learning algorithms for cyber security applications: A survey
CN111400713A (en) Malicious software family classification method based on operation code adjacency graph characteristics
CN113222053B (en) Malicious software family classification method, system and medium based on RGB image and Stacking multi-model fusion
CN113468524B (en) RASP-based machine learning model security detection method
Masabo et al. Improvement of malware classification using hybrid feature engineering
Pranav et al. Detection of botnets in IoT networks using graph theory and machine learning
Khan et al. OP2VEC: an opcode embedding technique and dataset design for end-to-end detection of android malware
Kalyan et al. Detection of malware using cnn
KR102437278B1 (en) Document malware detection device and method combining machine learning and signature matching
CN113420293A (en) Android malicious application detection method and system based on deep learning
Santoso et al. Malware Detection using Hybrid Autoencoder Approach for Better Security in Educational Institutions
CN111753290B (en) Software type detection method and related equipment
Venkata Ramana et al. Enhancing Cybersecurity: A Deep Learning CNN Approach to Malware Detection
Sai Adhinesh Reddy et al. Windows Malware Detection Using CNN and AlexNet Learning Models
Patil et al. Impact of PCA Feature Extraction Method used in Malware Detection for Security Enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210907

RJ01 Rejection of invention patent application after publication