CN112989347A - Method, device and equipment for identifying malicious software - Google Patents

Method, device and equipment for identifying malicious software Download PDF

Info

Publication number
CN112989347A
CN112989347A CN202110404930.3A CN202110404930A CN112989347A CN 112989347 A CN112989347 A CN 112989347A CN 202110404930 A CN202110404930 A CN 202110404930A CN 112989347 A CN112989347 A CN 112989347A
Authority
CN
China
Prior art keywords
node
software
feature
acquiring
calling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110404930.3A
Other languages
Chinese (zh)
Other versions
CN112989347B (en
Inventor
周庆
杨盾
葛亮
仲元红
黄智勇
钟代笛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202110404930.3A priority Critical patent/CN112989347B/en
Publication of CN112989347A publication Critical patent/CN112989347A/en
Application granted granted Critical
Publication of CN112989347B publication Critical patent/CN112989347B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to the technical field of artificial intelligence, and discloses a method for identifying malicious software, which comprises the following steps: acquiring a symbolic feature map of software to be identified; inputting the symbolic feature map into a malicious software identification model to obtain an identification index; the graph convolution layer of the malicious software identification model is used for acquiring an input feature graph, acquiring a target node, an inflow node subset and an outflow node subset of each node in the input feature graph, and then performing convolution operation to acquire a convolution feature graph; the input feature map is a symbolic feature map or a convolution feature map; and identifying whether the software to be identified is malicious software according to the identification index. The method and the device consider the direction information among the nodes, so that the software to be recognized is recognized more accurately according to the recognition indexes obtained by the malicious software recognition model, and the accuracy rate of recognizing the malicious software is improved. The application also discloses a device and equipment for identifying the malicious software.

Description

Method, device and equipment for identifying malicious software
Technical Field
The present application relates to the technical field of artificial intelligence, and for example, to a method, an apparatus, and a device for identifying malware.
Background
At present, internet equipment such as a tablet personal computer and an intelligent television is deep into the public, and software in the equipment affects the life of the public all the time. However, malicious code writers can utilize malicious software to launch malicious attacks on equipment and users, so that user data leakage is caused, and internet security is threatened, so that how to identify and prevent the malicious software is of great significance in the current life.
In the process of implementing the embodiments of the present disclosure, it is found that at least the following problems exist in the related art:
in the prior art, the method for identifying the malicious software by using the neural network model does not consider direction information among calling information, so that the accuracy rate of identifying the malicious software is low.
Disclosure of Invention
The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview nor is intended to identify key/critical elements or to delineate the scope of such embodiments but rather as a prelude to the more detailed description that is presented later.
The embodiment of the disclosure provides a method, a device and equipment for identifying malicious software, so that the accuracy rate of identifying the malicious software can be improved.
In some embodiments, a method for identifying malware includes:
acquiring a calling feature map of software to be identified;
symbolizing the calling feature diagram to obtain a symbolized feature diagram;
inputting the symbolic feature map into a preset malicious software recognition model to obtain a recognition index; wherein the malware identification model comprises a graph convolutional layer; the graph convolution layer is used for acquiring an input feature graph, acquiring a node direction set of each node in the input feature graph, and then performing convolution operation according to the node direction set and the input feature graph to acquire a convolution feature graph; the node direction set comprises a target node, an inflow node subset and an outflow node subset; the input feature map is a symbolic feature map or a convolution feature map;
and identifying whether the software to be identified is malicious software according to the identification index.
In some embodiments, the means for identifying malware comprises: a processor and a memory storing program instructions, the processor being configured to, upon execution of the program instructions, perform the above-described method for identifying malware.
In some embodiments, an apparatus includes the above-described means for identifying malware.
The method, the device and the equipment for identifying the malicious software provided by the embodiment of the disclosure can realize the following technical effects: the method comprises the steps of symbolizing a calling feature diagram of software to be recognized to obtain a symbolized feature diagram, inputting the symbolized feature diagram into a preset malicious software recognition model to obtain a recognition index, and recognizing the software to be recognized according to the recognition index. The graph convolution layer of the malware identification model obtains a symbolic feature graph or a convolutional feature graph, and obtains a node direction set of each node in the symbolic feature graph or the convolutional feature graph, wherein the node direction set comprises an inflow node subset, an outflow node subset and a target node.
The foregoing general description and the following description are exemplary and explanatory only and are not restrictive of the application.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the accompanying drawings and not in limitation thereof, in which elements having the same reference numeral designations are shown as like elements and not in limitation thereof, and wherein:
FIG. 1 is a schematic diagram of a method for identifying malware according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of an input feature map provided by embodiments of the present disclosure;
fig. 3 is a schematic diagram of an apparatus for identifying malware according to an embodiment of the present disclosure.
Detailed Description
So that the manner in which the features and elements of the disclosed embodiments can be understood in detail, a more particular description of the disclosed embodiments, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. In the following description of the technology, for purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the disclosed embodiments. However, one or more embodiments may be practiced without these details. In other instances, well-known structures and devices may be shown in simplified form in order to simplify the drawing.
The terms "first," "second," and the like in the description and in the claims, and the above-described drawings of embodiments of the present disclosure, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the present disclosure described herein may be made. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions.
The term "plurality" means two or more unless otherwise specified.
In the embodiment of the present disclosure, the character "/" indicates that the preceding and following objects are in an or relationship. For example, A/B represents: a or B.
The term "and/or" is an associative relationship that describes objects, meaning that three relationships may exist. For example, a and/or B, represents: a or B, or A and B.
As shown in fig. 1, an embodiment of the present disclosure provides a method for identifying malware, including:
s01, acquiring a calling feature map of the software to be identified;
s02, symbolizing the calling feature diagram to obtain a symbolized feature diagram;
s03, inputting the symbolic feature map into a preset malicious software recognition model to obtain a recognition index; the malicious software identification model comprises a graph volume layer; the graph convolution layer is used for acquiring an input feature graph, acquiring a node direction set of each node in the input feature graph, and then performing convolution operation according to the node direction set and the input feature graph to acquire a convolution feature graph; the node direction set comprises a target node, an inflow node subset and an outflow node subset; the input feature map is a symbolic feature map or a convolution feature map;
and S04, identifying whether the software to be identified is malicious software according to the identification index.
By adopting the method for identifying the malicious software provided by the embodiment of the disclosure, the call characteristic diagram of the software to be identified is symbolized to obtain the symbolized characteristic diagram, the symbolized characteristic diagram is input into a preset malicious software identification model to obtain the identification index, and the software to be identified is identified according to the identification index. The graph convolution layer of the malware identification model obtains a symbolic feature graph or a convolutional feature graph, and obtains a node direction set of each node in the symbolic feature graph or the convolutional feature graph, wherein the node direction set comprises an inflow node subset, an outflow node subset and a target node.
In some embodiments, such as the schematic diagram of the input feature map shown in fig. 2, the node direction set of the node δ includes: a target node δ, an ingress node subset α and an egress node subset γ.
Optionally, the obtaining of the calling feature map of the software to be identified includes: acquiring a component list of software to be identified; acquiring a system call list according to the component list; and acquiring a calling characteristic diagram of the software to be identified according to the system calling list. In some embodiments, the software to be identified is Android (Android) platform software.
Optionally, the obtaining the component list of the software to be identified includes: decompressing an installation file of software to be identified to obtain an access list file; acquiring a component list of software to be identified according to the access list file; the component list of the software to be identified includes the package name and the executable component name in the access manifest file.
Optionally, decompressing the installation file of the software to be identified, and obtaining the access list file, includes: and performing reverse compilation on the installation file of the software to be identified through a decompilation tool to obtain an access list file. In some embodiments, an APK (Android Application Package) installation file of software to be identified is reversely compiled through an APKTool (Android Application Package decompiling tool), so as to obtain an access list file of the software to be identified; retrieving package names and runnable component names from Services and Activities accessing the manifest file; and determining all the retrieved package names and the names of the runnable components as a component list.
Optionally, the obtaining the system call list according to the component list of the software to be recognized includes: and traversing and executing each component of the software to be identified according to the component list of the software to be identified to obtain a system call list.
Optionally, the obtaining the system call list according to the component list of the software to be recognized includes: under the condition that software to be recognized is loaded through a virtual machine tool, system calls of all components of the component list are extracted through a tracking tool, and all the extracted system calls are determined to be the system call list. In some embodiments, in the case of loading software to be identified by Genymotion (Android virtual machine), a system call of each component of the component list is extracted by a strand (trace back tool), all the extracted system calls are determined as a system call list using ADB (Android Debug Bridge) and visual studio file pipes, and the system call list is downloaded to the host. In some embodiments, the system call is obtained by the first thread and the second thread; the first thread is used for executing the loading of the software to be identified through the virtual machine tool; the second thread is for executing a system call that extracts components of the component list through the trace tool. Through two threads, the first thread loads software, and the second thread records system call, so that the software can be automatically executed without depending on user interaction or a random event generator, and the efficiency of obtaining a system call list is improved.
Optionally, after determining all the extracted system calls as a system call list, the method further includes: the virtual machine tool is reset. Therefore, the malicious software is prevented from attacking, and the failure of extracting the content is prevented.
Optionally, the system call list comprises several Linux kernel system calls. In the Android platform, a Linux system interacts with hardware of a device, an Application Programming Interface (API) runs on the Linux system, and a malicious code writer can avoid identification of malicious software by replacing an API framework, but the corresponding request service still depends on system call provided by a Linux kernel. Compared with the prior art, malicious software is identified through API calling, and the accuracy of identifying the malicious software is improved through Linux kernel system calling.
Optionally, the calling feature graph includes a plurality of nodes and directed edges between the nodes; acquiring a calling feature map of software to be identified according to a system calling list, wherein the method comprises the following steps: acquiring a system call ID, a system call frequency and a system call sequence according to the system call list; determining a node of the calling feature graph according to the system calling ID and the system calling frequency; and acquiring directed edges among all nodes in the calling feature graph according to the system calling sequence. Optionally, the calling feature graph is obtained according to all nodes and directed edges between all nodes.
In some embodiments, the system call IDs are used as nodes, and the system call frequency corresponding to each system call ID is determined as the size of an integer value of the corresponding node.
Optionally, the calling feature map G ═ (V, E); g is a calling feature graph, V is a node set of the calling feature graph, and E is a directed edge set of the calling feature graph. V ═ Vi,i=1,...,n};
E={ej,j=1,...,l};viIs the ith node of the node set, ejIs the jth directed edge of the directed edge set; n isThe number of nodes in the node set, and l is the number of directed edges of the directed edge set. Optionally, the weight of the directed edge is the number of calls between the corresponding two nodes, for example: v. ofpNode call vqThe weight of a directed edge of a node is vpNode call vqThe number of times of calling of the node; wherein p is more than or equal to 1 and less than or equal to n, q is more than or equal to 1 and less than or equal to n, and p and q are positive integers.
In some embodiments, the system call list includes 5 system call IDs, which are read, clockgettime, getuid32, ioctl, and fcntl64 in sequence according to the system call order, and the system call frequencies corresponding to the 5 system call IDs are 4, 1, 4, and 1, respectively; determining that the integer value of a node read is 1, the integer value of a node clockgettime is 4, the integer value of a node getuid32 is 4, the integer value of a node ioctl is 4, and the integer value of a node fcntl64 is 1; according to the sequence of each system call, the number of times of calling the clockgettime called by the read is 1; then the weight of the directed edge that read calls clockgettime is determined to be 1.
Optionally, symbolizing the calling feature map to obtain a symbolized feature map, where the symbolized feature map includes: the calling feature map G ═ (V, E) is symbolized, and a symbolized feature map Γ ═ (V, E) is obtainedΓ,AΓ,MΓ,XΓ) (ii) a Wherein, VΓFor the set of system calls, AΓTo call the relationship matrix, MΓTo call the time matrix, XΓA set of node feature representations.
Optionally, the set of nodes V that call the feature graph is determined as the set of system calls V of the tokenized feature graphΓ
Optionally, the relationship matrix A is calledΓIs a two-dimensional matrix. Optionally, the relationship matrix A is calledΓ∈(axy)N×N(ii) a Wherein x and y are positive integers, and are both less than or equal to N, and N is a system call set VΓThe number of the system calls is the number of the nodes; a isxyAnd calling the calling relation of the y node for the x node. Optionally, in the case that the x-th node calls the y-th node, the directed edge belongs to the directed edge set E of the calling feature graph, axyIs determined to be 1. Optionally, inA, when the directed edge of the x-th node calling the y-th node does not belong to the directed edge set ExyIs determined to be 0.
Optionally, a matrix of number of calls MΓIs a two-dimensional matrix. Optionally, a matrix of number of calls MΓ∈(mxy)N×N(ii) a Wherein x and y are positive integers, and are both less than or equal to N and mxyAnd calling the y node for the x node. Optionally, the number of calls between two nodes in the symbolic feature graph is a weight of a corresponding directed edge of the two nodes in the corresponding call feature graph.
Optionally, the set of node feature representations
Figure BDA0003021876420000061
Wherein the content of the first and second substances,
Figure BDA0003021876420000062
for the ith node V in the node set ViIs represented by the node characteristics of (1). Optionally, a feature vector with length N is determined as a node V in the node set ViNode feature representation of
Figure BDA0003021876420000071
Wherein, the ith element of the feature vector is 1, and the rest elements are 0. Optionally, the node feature representation set is used for performing update calculation on the node features of each node in the symbolic feature graph through a neural network model. The node characteristics of each node in the symbolic characteristic diagram are updated and calculated through the node characteristic representation set, the obtained identification indexes have distinctiveness, malicious software can be conveniently evaluated, and the accuracy of the method is improved.
Optionally, the calling time matrix M is adjusted according to a preset real number set RΓ∈(mxy)N×NMapping to obtain MΓ∈RN ×N. Optionally, according to a preset real number set R pairs
Figure BDA0003021876420000072
Mapping to obtain XΓ∈RN×D(ii) a Wherein the content of the first and second substances,
Figure BDA0003021876420000073
d is a node feature representation
Figure BDA0003021876420000074
D ═ N. The real number set is mapped through the real number set R, so that the node characteristics of the nodes can be updated and calculated conveniently by using the neural network model, and the efficiency of the model is improved. Optionally, the set of real numbers R includes several elements, all of which are real numbers.
Determining a system call ID of a system call list as a node, determining a system call sequence in the system call list as a directed edge between the nodes so as to obtain a call characteristic graph, performing symbolization processing on the call characteristic graph to obtain a symbolized characteristic graph, enabling the symbolized characteristic graph to contain call sequence information between the system calls, inputting the symbolized characteristic graph into a preset malicious software identification model, acquiring the symbolized characteristic graph or a convolution characteristic graph by a graph volume layer of the malicious software identification model, and acquiring a node direction set of each node in the symbolized characteristic graph or the convolution characteristic graph, wherein the node direction set comprises an inflow node subset, an outflow node subset and a target node, and the direction information between the nodes is considered, namely the call sequence information between the system calls is considered, so that the identification of software to be identified is more accurate according to an identification index obtained by the malicious software identification model, thereby improving the accuracy of identifying malware.
Optionally, inputting the symbolic feature map into a preset malware recognition model to obtain a recognition index, including: inputting the symbolic feature map into a preset number of map convolutional layers for map convolutional operation, and determining a convolutional feature map output by the last map convolutional layer as a first feature representation; adjusting the dimensionality of the first feature representation to a preset dimensionality number through a full-connection layer of the malicious software identification model to obtain a second feature representation; and carrying out two polarization on the second feature representation through a softmax function of the malicious software identification model to obtain an identification index. Optionally, the preset number of layers is greater than 2 and less than 11. Optionally, the preset dimension number is 2.
Optionally, performing convolution operation according to the node direction set and the input feature map to obtain a convolution feature map, including: respectively acquiring adjacent matrixes corresponding to a target node, an inflow node subset and an outflow node subset in a node direction set, and acquiring a target node adjacent matrix, an inflow node adjacent matrix and an outflow node adjacent matrix; and fusing the input characteristic diagram, the target node adjacency matrix, the inflow node adjacency matrix and the outflow node adjacency matrix to obtain the convolution characteristic diagram.
Optionally, an adjacency matrix corresponding to the target node in the node direction set is obtained, and a target node adjacency matrix is obtained.
Optionally, obtaining an adjacency matrix corresponding to the inflow node subset in the node direction set, and obtaining an inflow node adjacency matrix includes: by passing
Figure BDA0003021876420000081
Acquiring an inflow node adjacency matrix; wherein A isinIn order to flow into the node adjacency matrix,
Figure BDA0003021876420000082
elements of the y 'th column of the x' th row in the adjacency matrix for the inflow node; determining when there is a directed edge between the x 'th node to the y' th node of the inflow node subset
Figure BDA0003021876420000083
Has a value of 1; determining when there is no directed edge between the x 'th node to the y' th node of the inflow node subset
Figure BDA0003021876420000084
The value of (d) is 0.
Optionally, the obtaining an adjacency matrix corresponding to the outflow node subset in the node direction set, and the obtaining an outflow node adjacency matrix includes: by passing
Figure BDA0003021876420000085
Acquiring an outflow node adjacency matrix; wherein A isoutIn order to flow out of the node adjacency matrix,
Figure BDA0003021876420000086
the element of the x 'th row and the y' th column in the adjacent matrix of the outflow node; determining when there is a directed edge between the x "th node to the y" th node of the egress node subset
Figure BDA0003021876420000087
Has a value of 1; determining when there is no directed edge between the x "th node to the y" th node of the egress node subset
Figure BDA0003021876420000088
The value of (d) is 0.
Optionally by calculation
Figure BDA0003021876420000089
Obtaining a convolution characteristic graph; wherein, Xl+1Convolution signatures, X, output for graph convolution layerslFor the input profile of the input map convolution layer, Relu is the nonlinear activation function, KBNumber of subsets, K, representing a set of node directionsBIs 3, MkCalling a time matrix for a subset corresponding to the kth subset in the node direction set, wherein k is a positive integer and is less than or equal to 3, B, EkIs a normalized representation of the adjacency matrix corresponding to the kth subset in the node direction set, WkIs a trainable parameter of a preset convolution function.
Alternatively, in the case where k has a value of 1, MkA matrix of times is invoked for a subset corresponding to the subset of incoming nodes,
Figure BDA0003021876420000091
is a normalized representation of the incoming node adjacency matrix. Optionally, by
Figure BDA0003021876420000092
Acquiring a subset calling time matrix corresponding to an inflow node subset; wherein M isinA matrix of times is invoked for a subset corresponding to the subset of incoming nodes,
Figure BDA0003021876420000093
number of calls for calling the q-th node for the p-th node, NinThe p-th node and the q-th node are both nodes in the subset of ingress nodes as the number of nodes in the subset of ingress nodes.
Alternatively, in the case where k has a value of 2, MkA matrix of times is invoked for the subset corresponding to the target node,
Figure BDA0003021876420000094
is a normalized representation of the target node adjacency matrix. Alternatively, MselfCalling a time matrix for the subset corresponding to the target node; wherein M isself={0}。
Alternatively, in the case where k has a value of 3, MkA matrix of times is invoked for the subset corresponding to the subset of egress nodes,
Figure BDA0003021876420000095
is a normalized representation of the egress node adjacency matrix. Optionally by
Figure BDA0003021876420000096
Acquiring a subset calling time matrix corresponding to the outflow node subset; wherein M isoutA matrix of times is invoked for the subset corresponding to the subset of egress nodes,
Figure BDA0003021876420000097
number of calls to call the q 'th node for the p' th node, NoutFor the number of nodes in the egress node subset, the p 'th node and the q' th node are both nodes in the egress node subset.
Optionally by
Figure BDA0003021876420000098
Acquiring a normalized representation of an adjacency matrix corresponding to a kth subset in a node direction set; wherein the content of the first and second substances,
Figure BDA0003021876420000099
for the k-th subset pair in the node direction setNormalized representation of the corresponding adjacency matrix, BkFor the adjacency matrix corresponding to the kth subset in the set of node directions, Dk∈RN'×N'Is BkCorresponding diagonal matrix, I ∈ RN'×N'And N' is the node number of the kth subset in the node direction set. Alternatively, in the case where k has a value of 1, BkAn adjacency matrix is formed for the incoming nodes in the set of node directions. Alternatively, in the case where k has a value of 2, BkA target node adjacency matrix in the node direction set. Alternatively, in the case where k has a value of 3, BkAn adjacency matrix is formed for the outgoing nodes in the set of node directions.
Optionally by
Figure BDA0003021876420000101
Obtaining
Figure BDA0003021876420000102
A corresponding diagonal matrix; wherein the content of the first and second substances,
Figure BDA0003021876420000103
is a diagonal matrix.
Inputting the symbolic feature map into a first layer of map convolutional layer, acquiring a node direction set of the symbolic feature map by the first layer of map convolutional layer, fusing the node direction set, and outputting a convolutional feature map; inputting the convolution characteristic graph output by the first layer graph convolution layer into a second layer graph convolution layer, acquiring a node direction set of the input convolution characteristic graph by the second layer graph convolution layer, fusing the node direction set, and outputting a convolution characteristic graph; inputting the convolution characteristic graph output by the second layer of graph convolution layer into the next layer of graph convolution layer; and after the graph convolution layer with the preset number of layers passes, determining the convolution characteristic graph output by the last layer as a first characteristic representation. Through the superposition of multilayer graph convolution layers, the modeling of complex interaction information between nodes is realized, and meanwhile, the direction information between the nodes is considered, so that the identification of the software to be identified is more accurate according to the identification indexes obtained by the malicious software identification model, and the accuracy rate of identifying the malicious software is improved.
Optionally, bipolarizing the second feature representation by a softmax (flexible maximum) function of the malware recognition model to obtain a recognition index, including: by calculation of
Figure BDA0003021876420000104
Obtaining an identification index; wherein the content of the first and second substances,
Figure BDA0003021876420000105
to identify the index, XoutFor the second characterization, W is a preset parameter matrix.
Optionally, identifying whether the software to be identified is malware according to the identification index includes: by passing
Figure BDA0003021876420000106
Acquiring an identification result; wherein, r is the recognition result,
Figure BDA0003021876420000107
is the probability that the software under evaluation is malware. Determining that the software to be identified is malicious software under the condition that the identification result is J; and under the condition that the identification result is not J, determining that the software to be identified is the common software.
Optionally, the method for obtaining the malware recognition model includes: collecting a plurality of software samples; acquiring a sample calling feature map of a software sample; symbolizing the sample calling feature diagram to obtain a sample symbolized feature diagram; inputting the sample symbolic feature map with the sample label into a preset neural network model for training to obtain a malicious software identification model; the sample tags include normal software tags and malware tags. The neural network model comprises a graph volume layer; the graph convolution layer is used for obtaining a sample input characteristic graph, obtaining a sample node direction set of each sample node in the sample input characteristic graph, and then carrying out convolution operation according to the sample node direction set and the sample input characteristic graph to obtain a sample convolution characteristic graph; the sample node direction set comprises a sample target node, a sample inflow node subset and a sample outflow node subset; the sample input feature map is a sample symbolized feature map or a sample convolution feature map.
Optionally, obtaining the sample call feature map of the software sample includes: reversely compiling the installation file of the software sample by a decompilation tool to obtain a sample access list file; acquiring a sample component list of the software sample according to the sample access list file; the sample component list of the software sample comprises a package name and an operable component name in the sample access manifest file; under the condition that a software sample is loaded through a virtual machine tool, extracting sample system calls of all components of a sample component list through a tracking tool, and determining all the extracted sample system calls as a sample system call list; obtaining a sample system calling ID, a sample system calling frequency and a sample system calling sequence according to the sample system calling list; determining a sample node according to the sample system calling ID and the sample system calling frequency; and obtaining sample directed edges among all sample nodes according to the sample system calling sequence, and obtaining a sample calling feature graph according to all the sample nodes and the sample directed edges among all the sample nodes.
Optionally, inputting the sample symbolic feature map with the sample label into a preset neural network model for training, so as to obtain a malware recognition model, including: inputting a sample symbolic feature diagram with a sample label into a preset neural network model, and recording a training loss value of each training period; and when the training loss values of the continuous preset number are not lower than the lowest value of all the training loss values, stopping model training, and determining the model obtained by training in the last period as the malicious software identification model. Optionally, the preset number is a positive integer greater than 3. Optionally, the obtaining of the training loss value of one training period includes: inputting the sample symbolic feature map into a preset number of map convolutional layers for map convolutional operation, and determining a sample convolutional feature map output by the last map convolutional layer as a first sample feature representation; adjusting the dimensionality of the first sample characteristic representation to a preset dimensionality number through a full connection layer of a preset neural network model to obtain a second sample characteristic representation; performing dual polarization on the second sample characteristic representation through a softmax function of a preset neural network model to obtain a sample index; and obtaining a training loss value of one period according to the sample index and the loss function.
Optionally, obtaining a sample label of the sample symbolic feature map includes: analyzing a software sample corresponding to the sample symbolic feature map through an anti-malware engine or manual identification to obtain an analysis result; determining a malicious software label as a sample label of the sample symbolic feature map under the condition that the analysis result of the software sample is malicious software; and determining the common software label as the sample label of the sample symbolized feature map when the analysis result of the software sample is common software.
As shown in fig. 3, an apparatus for identifying malware according to an embodiment of the present disclosure includes a processor (processor)100 and a memory (memory) 101. Optionally, the apparatus may also include a Communication Interface (Communication Interface)102 and a bus 103. The processor 100, the communication interface 102, and the memory 101 may communicate with each other via a bus 103. The communication interface 102 may be used for information transfer. The processor 100 may call logic instructions in the memory 101 to perform the method for identifying malware of the above-described embodiments.
In addition, the logic instructions in the memory 101 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products.
The memory 101, which is a computer-readable storage medium, may be used for storing software programs, computer-executable programs, such as program instructions/modules corresponding to the methods in the embodiments of the present disclosure. The processor 100 executes functional applications and data processing, i.e., implements the method for identifying malware in the above-described embodiments, by executing program instructions/modules stored in the memory 101.
The memory 101 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. In addition, the memory 101 may include a high-speed random access memory, and may also include a nonvolatile memory.
By adopting the device for identifying the malicious software, the symbolic feature diagram is obtained by symbolizing the calling feature diagram of the software to be identified, the symbolic feature diagram is input into a preset malicious software identification model to obtain the identification index, and the software to be identified is identified according to the identification index. The graph convolution layer of the malware identification model obtains a symbolic feature graph or a convolutional feature graph, and obtains a node direction set of each node in the symbolic feature graph or the convolutional feature graph, wherein the node direction set comprises an inflow node subset, an outflow node subset and a target node.
The embodiment of the disclosure provides a device, which comprises the above device for identifying malicious software. Optionally, the apparatus comprises: a smart phone, a tablet, a computer or a server, etc.
By adopting the device provided by the embodiment of the disclosure, the symbolic feature diagram is obtained by symbolizing the calling feature diagram of the software to be recognized, the symbolic feature diagram is input into a preset malicious software recognition model to obtain the recognition index, and the software to be recognized is recognized according to the recognition index. The graph convolution layer of the malware identification model obtains a symbolic feature graph or a convolutional feature graph, and obtains a node direction set of each node in the symbolic feature graph or the convolutional feature graph, wherein the node direction set comprises an inflow node subset, an outflow node subset and a target node.
Embodiments of the present disclosure provide a computer-readable storage medium storing computer-executable instructions configured to perform the above-described method for identifying malware.
Embodiments of the present disclosure provide a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the above-described method for identifying malware.
The computer-readable storage medium described above may be a transitory computer-readable storage medium or a non-transitory computer-readable storage medium.
The technical solution of the embodiments of the present disclosure may be embodied in the form of a software product, where the computer software product is stored in a storage medium and includes one or more instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present disclosure. And the aforementioned storage medium may be a non-transitory storage medium comprising: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes, and may also be a transient storage medium.
The above description and drawings sufficiently illustrate embodiments of the disclosure to enable those skilled in the art to practice them. Other embodiments may incorporate structural, logical, electrical, process, and other changes. The examples merely typify possible variations. Individual components and functions are optional unless explicitly required, and the sequence of operations may vary. Portions and features of some embodiments may be included in or substituted for those of others. Furthermore, the words used in the specification are words of description only and are not intended to limit the claims. As used in the description of the embodiments and the claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Similarly, the term "and/or" as used in this application is meant to encompass any and all possible combinations of one or more of the associated listed. Furthermore, the terms "comprises" and/or "comprising," when used in this application, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Without further limitation, an element defined by the phrase "comprising an …" does not exclude the presence of other like elements in a process, method or apparatus that comprises the element. In this document, each embodiment may be described with emphasis on differences from other embodiments, and the same and similar parts between the respective embodiments may be referred to each other. For methods, products, etc. of the embodiment disclosures, reference may be made to the description of the method section for relevance if it corresponds to the method section of the embodiment disclosure.
Those of skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software may depend upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments. It can be clearly understood by the skilled person that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments disclosed herein, the disclosed methods, products (including but not limited to devices, apparatuses, etc.) may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units may be merely a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to implement the present embodiment. In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. In the description corresponding to the flowcharts and block diagrams in the figures, operations or steps corresponding to different blocks may also occur in different orders than disclosed in the description, and sometimes there is no specific order between the different operations or steps. For example, two sequential operations or steps may in fact be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. Each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Claims (10)

1. A method for identifying malware, comprising:
acquiring a calling feature map of software to be identified;
symbolizing the calling feature diagram to obtain a symbolized feature diagram;
inputting the symbolic feature map into a preset malicious software recognition model to obtain a recognition index; wherein the malware identification model comprises a graph convolutional layer; the graph convolution layer is used for acquiring an input feature graph, acquiring a node direction set of each node in the input feature graph, and then performing convolution operation according to the node direction set and the input feature graph to acquire a convolution feature graph; the node direction set comprises a target node, an inflow node subset and an outflow node subset; the input feature map is a symbolic feature map or a convolution feature map;
and identifying whether the software to be identified is malicious software according to the identification index.
2. The method of claim 1, wherein obtaining the call feature map of the software to be identified comprises:
acquiring a component list of the software to be identified;
acquiring a system call list according to the component list;
and acquiring the calling characteristic diagram of the software to be identified according to the system calling list.
3. The method of claim 2, wherein obtaining a list of components of software to be identified comprises:
decompressing an installation file of software to be identified to obtain an access list file;
acquiring a component list of the software to be identified according to the access list file; the component list includes package names and executable component names in the access manifest file.
4. The method of claim 2, wherein obtaining a list of system calls from the list of components comprises:
and traversing and executing each component of the software to be identified according to the component list to obtain a system call list.
5. The method of claim 2, wherein the calling feature graph comprises a number of nodes and directed edges between the nodes; acquiring a calling feature map of the software to be identified according to a system calling list, wherein the method comprises the following steps:
acquiring a system call ID, a system call frequency and a system call sequence according to the system call list;
determining the nodes of the calling feature graph according to the system calling ID and the system calling frequency;
and acquiring directed edges among the nodes in the calling feature graph according to the system calling sequence.
6. The method of claim 1, wherein inputting the symbolic feature map into a preset malware recognition model to obtain a recognition index comprises:
inputting the symbolic feature map into a preset number of map convolutional layers for map convolutional operation, and determining a convolutional feature map output by the last map convolutional layer as a first feature representation;
adjusting the dimensionality of the first feature representation to a preset dimensionality number through a full-connection layer of the malware identification model to obtain a second feature representation;
and carrying out two polarization on the second feature representation through a softmax function of the malicious software identification model to obtain an identification index.
7. The method of claim 1, wherein obtaining a malware recognition model comprises:
collecting a plurality of software samples;
acquiring a sample calling feature map of the software sample;
symbolizing the sample calling feature diagram to obtain a sample symbolized feature diagram;
inputting the sample symbolic feature map with the sample label into a preset neural network model for training to obtain a malicious software identification model; the sample tags include normal software tags and malware tags.
8. The method according to any one of claims 1 to 7, wherein performing a convolution operation on the set of node directions and the input feature map to obtain a convolution feature map comprises:
respectively acquiring adjacent matrixes corresponding to a target node, an inflow node subset and an outflow node subset in a node direction set, and acquiring a target node adjacent matrix, an inflow node adjacent matrix and an outflow node adjacent matrix;
and fusing the input feature graph, the target node adjacency matrix, the inflow node adjacency matrix and the outflow node adjacency matrix to obtain a convolution feature graph.
9. An apparatus for identifying malware comprising a processor and a memory storing program instructions, wherein the processor is configured to perform the method for identifying malware according to any one of claims 1 to 8 when executing the program instructions.
10. A device comprising the means for identifying malware of claim 9.
CN202110404930.3A 2021-04-15 2021-04-15 Method, device and equipment for identifying malicious software Active CN112989347B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110404930.3A CN112989347B (en) 2021-04-15 2021-04-15 Method, device and equipment for identifying malicious software

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110404930.3A CN112989347B (en) 2021-04-15 2021-04-15 Method, device and equipment for identifying malicious software

Publications (2)

Publication Number Publication Date
CN112989347A true CN112989347A (en) 2021-06-18
CN112989347B CN112989347B (en) 2023-06-09

Family

ID=76340590

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110404930.3A Active CN112989347B (en) 2021-04-15 2021-04-15 Method, device and equipment for identifying malicious software

Country Status (1)

Country Link
CN (1) CN112989347B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013452A (en) * 2007-02-05 2007-08-08 江苏大学 Symbolized model detection method
CN103839005A (en) * 2013-11-22 2014-06-04 北京智谷睿拓技术服务有限公司 Malware detection method and malware detection system of mobile operating system
CN104931263A (en) * 2015-06-18 2015-09-23 东南大学 Bearing fault diagnosis method based on symbolic probabilistic finite state machine
CN109829306A (en) * 2019-02-20 2019-05-31 哈尔滨工程大学 A kind of Malware classification method optimizing feature extraction
CN110135157A (en) * 2019-04-04 2019-08-16 国家计算机网络与信息安全管理中心 Malware homology analysis method, system, electronic equipment and storage medium
CN111259388A (en) * 2020-01-09 2020-06-09 中山大学 Malicious software API (application program interface) calling sequence detection method based on graph convolution
CN111382428A (en) * 2018-12-29 2020-07-07 北京奇虎科技有限公司 Malicious software recognition model training method, malicious software recognition method and device
CN112163222A (en) * 2020-10-10 2021-01-01 哈尔滨工业大学(深圳) Malicious software detection method and device
CN112651024A (en) * 2020-12-29 2021-04-13 重庆大学 Method, device and equipment for malicious code detection

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013452A (en) * 2007-02-05 2007-08-08 江苏大学 Symbolized model detection method
CN103839005A (en) * 2013-11-22 2014-06-04 北京智谷睿拓技术服务有限公司 Malware detection method and malware detection system of mobile operating system
CN104931263A (en) * 2015-06-18 2015-09-23 东南大学 Bearing fault diagnosis method based on symbolic probabilistic finite state machine
CN111382428A (en) * 2018-12-29 2020-07-07 北京奇虎科技有限公司 Malicious software recognition model training method, malicious software recognition method and device
CN109829306A (en) * 2019-02-20 2019-05-31 哈尔滨工程大学 A kind of Malware classification method optimizing feature extraction
CN110135157A (en) * 2019-04-04 2019-08-16 国家计算机网络与信息安全管理中心 Malware homology analysis method, system, electronic equipment and storage medium
CN111259388A (en) * 2020-01-09 2020-06-09 中山大学 Malicious software API (application program interface) calling sequence detection method based on graph convolution
CN112163222A (en) * 2020-10-10 2021-01-01 哈尔滨工业大学(深圳) Malicious software detection method and device
CN112651024A (en) * 2020-12-29 2021-04-13 重庆大学 Method, device and equipment for malicious code detection

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
NIKOLOPOULOS SD: "A graph-based model for malware detection and classification", 《JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES》, vol. 1, no. 13, pages 29 - 46 *
李舟军: "软件安全漏洞检测技术", 《计算机学报》, vol. 38, no. 4, pages 717 - 732 *
汪润: "DeepRD:基于Siamese LSTM网络 的Android重打包应用检测方法", 《通信学报》, vol. 39, no. 8, pages 69 - 82 *
王志强: "一种Android恶意行为检测算法", 《西安电子科技大学学报(自然科学版)》, vol. 42, no. 3, pages 8 - 14 *

Also Published As

Publication number Publication date
CN112989347B (en) 2023-06-09

Similar Documents

Publication Publication Date Title
Warnecke et al. Evaluating explanation methods for deep learning in security
Bagdasaryan et al. Blind backdoors in deep learning models
CN110135157B (en) Malicious software homology analysis method and system, electronic device and storage medium
CN110008703B (en) System and method for statically detecting malicious software in container
EP4002174A1 (en) Utilizing orchestration and augmented vulnerability triage for software security testing
CN108563951B (en) Virus detection method and device
Zhao et al. Android malware identification through visual exploration of disassembly files
CN114419363A (en) Target classification model training method and device based on label-free sample data
Motiur Rahman et al. StackDroid: Evaluation of a multi-level approach for detecting the malware on android using stacked generalization
CN114693192A (en) Wind control decision method and device, computer equipment and storage medium
Hota et al. Deep Neural Networks for Android Malware Detection.
CN113919497A (en) Attack and defense method based on feature manipulation for continuous learning ability system
CN108920929A (en) Proof diagram processing method, device, computer equipment and storage medium
CN114064506A (en) Binary program fuzzy test method and system based on deep neural network
CN108985052A (en) A kind of rogue program recognition methods, device and storage medium
CN112651024A (en) Method, device and equipment for malicious code detection
CN108717511A (en) A kind of Android applications Threat assessment models method for building up, appraisal procedure and system
CN110532773A (en) Malicious access Activity recognition method, data processing method, device and equipment
Pranav et al. Detection of botnets in IoT networks using graph theory and machine learning
CN113111346A (en) Multi-engine WebShell script file detection method and system
CN111400708A (en) Method and device for malicious code detection
CN112989347A (en) Method, device and equipment for identifying malicious software
CN110321883A (en) Method for recognizing verification code and device, readable storage medium storing program for executing
CN113259369B (en) Data set authentication method and system based on machine learning member inference attack
US11868473B2 (en) Method for constructing behavioural software signatures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant