CN112989347A

CN112989347A - Method, device and equipment for identifying malicious software

Info

Publication number: CN112989347A
Application number: CN202110404930.3A
Authority: CN
Inventors: 周庆; 杨盾; 葛亮; 仲元红; 黄智勇; 钟代笛
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2021-06-18
Anticipated expiration: 2041-04-15
Also published as: CN112989347B

Abstract

The application relates to the technical field of artificial intelligence, and discloses a method for identifying malicious software, which comprises the following steps: acquiring a symbolic feature map of software to be identified; inputting the symbolic feature map into a malicious software identification model to obtain an identification index; the graph convolution layer of the malicious software identification model is used for acquiring an input feature graph, acquiring a target node, an inflow node subset and an outflow node subset of each node in the input feature graph, and then performing convolution operation to acquire a convolution feature graph; the input feature map is a symbolic feature map or a convolution feature map; and identifying whether the software to be identified is malicious software according to the identification index. The method and the device consider the direction information among the nodes, so that the software to be recognized is recognized more accurately according to the recognition indexes obtained by the malicious software recognition model, and the accuracy rate of recognizing the malicious software is improved. The application also discloses a device and equipment for identifying the malicious software.

Description

Method, device and equipment for identifying malicious software

Technical Field

The present application relates to the technical field of artificial intelligence, and for example, to a method, an apparatus, and a device for identifying malware.

Background

At present, internet equipment such as a tablet personal computer and an intelligent television is deep into the public, and software in the equipment affects the life of the public all the time. However, malicious code writers can utilize malicious software to launch malicious attacks on equipment and users, so that user data leakage is caused, and internet security is threatened, so that how to identify and prevent the malicious software is of great significance in the current life.

In the process of implementing the embodiments of the present disclosure, it is found that at least the following problems exist in the related art:

in the prior art, the method for identifying the malicious software by using the neural network model does not consider direction information among calling information, so that the accuracy rate of identifying the malicious software is low.

Disclosure of Invention

The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview nor is intended to identify key/critical elements or to delineate the scope of such embodiments but rather as a prelude to the more detailed description that is presented later.

The embodiment of the disclosure provides a method, a device and equipment for identifying malicious software, so that the accuracy rate of identifying the malicious software can be improved.

In some embodiments, a method for identifying malware includes:

acquiring a calling feature map of software to be identified;

symbolizing the calling feature diagram to obtain a symbolized feature diagram;

inputting the symbolic feature map into a preset malicious software recognition model to obtain a recognition index; wherein the malware identification model comprises a graph convolutional layer; the graph convolution layer is used for acquiring an input feature graph, acquiring a node direction set of each node in the input feature graph, and then performing convolution operation according to the node direction set and the input feature graph to acquire a convolution feature graph; the node direction set comprises a target node, an inflow node subset and an outflow node subset; the input feature map is a symbolic feature map or a convolution feature map;

and identifying whether the software to be identified is malicious software according to the identification index.

In some embodiments, the means for identifying malware comprises: a processor and a memory storing program instructions, the processor being configured to, upon execution of the program instructions, perform the above-described method for identifying malware.

In some embodiments, an apparatus includes the above-described means for identifying malware.

The method, the device and the equipment for identifying the malicious software provided by the embodiment of the disclosure can realize the following technical effects: the method comprises the steps of symbolizing a calling feature diagram of software to be recognized to obtain a symbolized feature diagram, inputting the symbolized feature diagram into a preset malicious software recognition model to obtain a recognition index, and recognizing the software to be recognized according to the recognition index. The graph convolution layer of the malware identification model obtains a symbolic feature graph or a convolutional feature graph, and obtains a node direction set of each node in the symbolic feature graph or the convolutional feature graph, wherein the node direction set comprises an inflow node subset, an outflow node subset and a target node.

The foregoing general description and the following description are exemplary and explanatory only and are not restrictive of the application.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the accompanying drawings and not in limitation thereof, in which elements having the same reference numeral designations are shown as like elements and not in limitation thereof, and wherein:

FIG. 1 is a schematic diagram of a method for identifying malware according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an input feature map provided by embodiments of the present disclosure;

fig. 3 is a schematic diagram of an apparatus for identifying malware according to an embodiment of the present disclosure.

Detailed Description

So that the manner in which the features and elements of the disclosed embodiments can be understood in detail, a more particular description of the disclosed embodiments, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. In the following description of the technology, for purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the disclosed embodiments. However, one or more embodiments may be practiced without these details. In other instances, well-known structures and devices may be shown in simplified form in order to simplify the drawing.

The terms "first," "second," and the like in the description and in the claims, and the above-described drawings of embodiments of the present disclosure, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the present disclosure described herein may be made. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions.

The term "plurality" means two or more unless otherwise specified.

In the embodiment of the present disclosure, the character "/" indicates that the preceding and following objects are in an or relationship. For example, A/B represents: a or B.

The term "and/or" is an associative relationship that describes objects, meaning that three relationships may exist. For example, a and/or B, represents: a or B, or A and B.

As shown in fig. 1, an embodiment of the present disclosure provides a method for identifying malware, including:

s01, acquiring a calling feature map of the software to be identified;

s02, symbolizing the calling feature diagram to obtain a symbolized feature diagram;

s03, inputting the symbolic feature map into a preset malicious software recognition model to obtain a recognition index; the malicious software identification model comprises a graph volume layer; the graph convolution layer is used for acquiring an input feature graph, acquiring a node direction set of each node in the input feature graph, and then performing convolution operation according to the node direction set and the input feature graph to acquire a convolution feature graph; the node direction set comprises a target node, an inflow node subset and an outflow node subset; the input feature map is a symbolic feature map or a convolution feature map;

and S04, identifying whether the software to be identified is malicious software according to the identification index.

By adopting the method for identifying the malicious software provided by the embodiment of the disclosure, the call characteristic diagram of the software to be identified is symbolized to obtain the symbolized characteristic diagram, the symbolized characteristic diagram is input into a preset malicious software identification model to obtain the identification index, and the software to be identified is identified according to the identification index. The graph convolution layer of the malware identification model obtains a symbolic feature graph or a convolutional feature graph, and obtains a node direction set of each node in the symbolic feature graph or the convolutional feature graph, wherein the node direction set comprises an inflow node subset, an outflow node subset and a target node.

In some embodiments, such as the schematic diagram of the input feature map shown in fig. 2, the node direction set of the node δ includes: a target node δ, an ingress node subset α and an egress node subset γ.

Optionally, the obtaining of the calling feature map of the software to be identified includes: acquiring a component list of software to be identified; acquiring a system call list according to the component list; and acquiring a calling characteristic diagram of the software to be identified according to the system calling list. In some embodiments, the software to be identified is Android (Android) platform software.

Optionally, the obtaining the component list of the software to be identified includes: decompressing an installation file of software to be identified to obtain an access list file; acquiring a component list of software to be identified according to the access list file; the component list of the software to be identified includes the package name and the executable component name in the access manifest file.

Optionally, decompressing the installation file of the software to be identified, and obtaining the access list file, includes: and performing reverse compilation on the installation file of the software to be identified through a decompilation tool to obtain an access list file. In some embodiments, an APK (Android Application Package) installation file of software to be identified is reversely compiled through an APKTool (Android Application Package decompiling tool), so as to obtain an access list file of the software to be identified; retrieving package names and runnable component names from Services and Activities accessing the manifest file; and determining all the retrieved package names and the names of the runnable components as a component list.

Optionally, the obtaining the system call list according to the component list of the software to be recognized includes: and traversing and executing each component of the software to be identified according to the component list of the software to be identified to obtain a system call list.

Optionally, the obtaining the system call list according to the component list of the software to be recognized includes: under the condition that software to be recognized is loaded through a virtual machine tool, system calls of all components of the component list are extracted through a tracking tool, and all the extracted system calls are determined to be the system call list. In some embodiments, in the case of loading software to be identified by Genymotion (Android virtual machine), a system call of each component of the component list is extracted by a strand (trace back tool), all the extracted system calls are determined as a system call list using ADB (Android Debug Bridge) and visual studio file pipes, and the system call list is downloaded to the host. In some embodiments, the system call is obtained by the first thread and the second thread; the first thread is used for executing the loading of the software to be identified through the virtual machine tool; the second thread is for executing a system call that extracts components of the component list through the trace tool. Through two threads, the first thread loads software, and the second thread records system call, so that the software can be automatically executed without depending on user interaction or a random event generator, and the efficiency of obtaining a system call list is improved.

Optionally, after determining all the extracted system calls as a system call list, the method further includes: the virtual machine tool is reset. Therefore, the malicious software is prevented from attacking, and the failure of extracting the content is prevented.

Optionally, the system call list comprises several Linux kernel system calls. In the Android platform, a Linux system interacts with hardware of a device, an Application Programming Interface (API) runs on the Linux system, and a malicious code writer can avoid identification of malicious software by replacing an API framework, but the corresponding request service still depends on system call provided by a Linux kernel. Compared with the prior art, malicious software is identified through API calling, and the accuracy of identifying the malicious software is improved through Linux kernel system calling.

Optionally, the calling feature graph includes a plurality of nodes and directed edges between the nodes; acquiring a calling feature map of software to be identified according to a system calling list, wherein the method comprises the following steps: acquiring a system call ID, a system call frequency and a system call sequence according to the system call list; determining a node of the calling feature graph according to the system calling ID and the system calling frequency; and acquiring directed edges among all nodes in the calling feature graph according to the system calling sequence. Optionally, the calling feature graph is obtained according to all nodes and directed edges between all nodes.

In some embodiments, the system call IDs are used as nodes, and the system call frequency corresponding to each system call ID is determined as the size of an integer value of the corresponding node.

Optionally, the calling feature map G ═ (V, E); g is a calling feature graph, V is a node set of the calling feature graph, and E is a directed edge set of the calling feature graph. V ═ V_i,i＝1,...,n}；

E＝{e_j,j＝1,...,l}；v_iIs the ith node of the node set, e_jIs the jth directed edge of the directed edge set; n isThe number of nodes in the node set, and l is the number of directed edges of the directed edge set. Optionally, the weight of the directed edge is the number of calls between the corresponding two nodes, for example: v. of_pNode call v_qThe weight of a directed edge of a node is v_pNode call v_qThe number of times of calling of the node; wherein p is more than or equal to 1 and less than or equal to n, q is more than or equal to 1 and less than or equal to n, and p and q are positive integers.

In some embodiments, the system call list includes 5 system call IDs, which are read, clockgettime, getuid32, ioctl, and fcntl64 in sequence according to the system call order, and the system call frequencies corresponding to the 5 system call IDs are 4, 1, 4, and 1, respectively; determining that the integer value of a node read is 1, the integer value of a node clockgettime is 4, the integer value of a node getuid32 is 4, the integer value of a node ioctl is 4, and the integer value of a node fcntl64 is 1; according to the sequence of each system call, the number of times of calling the clockgettime called by the read is 1; then the weight of the directed edge that read calls clockgettime is determined to be 1.

Optionally, symbolizing the calling feature map to obtain a symbolized feature map, where the symbolized feature map includes: the calling feature map G ═ (V, E) is symbolized, and a symbolized feature map Γ ═ (V, E) is obtained_Γ,A_Γ,M_Γ,X_Γ) (ii) a Wherein, V_ΓFor the set of system calls, A_ΓTo call the relationship matrix, M_ΓTo call the time matrix, X_ΓA set of node feature representations.

Optionally, the set of nodes V that call the feature graph is determined as the set of system calls V of the tokenized feature graph_Γ。

Optionally, the relationship matrix A is called_ΓIs a two-dimensional matrix. Optionally, the relationship matrix A is called_Γ∈(a_xy)^N×N(ii) a Wherein x and y are positive integers, and are both less than or equal to N, and N is a system call set V_ΓThe number of the system calls is the number of the nodes; a is_xyAnd calling the calling relation of the y node for the x node. Optionally, in the case that the x-th node calls the y-th node, the directed edge belongs to the directed edge set E of the calling feature graph, a_xyIs determined to be 1. Optionally, inA, when the directed edge of the x-th node calling the y-th node does not belong to the directed edge set E_xyIs determined to be 0.

Optionally, a matrix of number of calls M_ΓIs a two-dimensional matrix. Optionally, a matrix of number of calls M_Γ∈(m_xy)^N×N(ii) a Wherein x and y are positive integers, and are both less than or equal to N and m_xyAnd calling the y node for the x node. Optionally, the number of calls between two nodes in the symbolic feature graph is a weight of a corresponding directed edge of the two nodes in the corresponding call feature graph.

Optionally, the set of node feature representations

Wherein the content of the first and second substances,

for the ith node V in the node set V_iIs represented by the node characteristics of (1). Optionally, a feature vector with length N is determined as a node V in the node set V_iNode feature representation of

Wherein, the ith element of the feature vector is 1, and the rest elements are 0. Optionally, the node feature representation set is used for performing update calculation on the node features of each node in the symbolic feature graph through a neural network model. The node characteristics of each node in the symbolic characteristic diagram are updated and calculated through the node characteristic representation set, the obtained identification indexes have distinctiveness, malicious software can be conveniently evaluated, and the accuracy of the method is improved.

Optionally, the calling time matrix M is adjusted according to a preset real number set R_Γ∈(m_xy)^N×NMapping to obtain M_Γ∈R^N ^×N. Optionally, according to a preset real number set R pairs

Mapping to obtain X_Γ∈R^N×D(ii) a Wherein the content of the first and second substances,

d is a node feature representation

D ═ N. The real number set is mapped through the real number set R, so that the node characteristics of the nodes can be updated and calculated conveniently by using the neural network model, and the efficiency of the model is improved. Optionally, the set of real numbers R includes several elements, all of which are real numbers.

Determining a system call ID of a system call list as a node, determining a system call sequence in the system call list as a directed edge between the nodes so as to obtain a call characteristic graph, performing symbolization processing on the call characteristic graph to obtain a symbolized characteristic graph, enabling the symbolized characteristic graph to contain call sequence information between the system calls, inputting the symbolized characteristic graph into a preset malicious software identification model, acquiring the symbolized characteristic graph or a convolution characteristic graph by a graph volume layer of the malicious software identification model, and acquiring a node direction set of each node in the symbolized characteristic graph or the convolution characteristic graph, wherein the node direction set comprises an inflow node subset, an outflow node subset and a target node, and the direction information between the nodes is considered, namely the call sequence information between the system calls is considered, so that the identification of software to be identified is more accurate according to an identification index obtained by the malicious software identification model, thereby improving the accuracy of identifying malware.

Optionally, inputting the symbolic feature map into a preset malware recognition model to obtain a recognition index, including: inputting the symbolic feature map into a preset number of map convolutional layers for map convolutional operation, and determining a convolutional feature map output by the last map convolutional layer as a first feature representation; adjusting the dimensionality of the first feature representation to a preset dimensionality number through a full-connection layer of the malicious software identification model to obtain a second feature representation; and carrying out two polarization on the second feature representation through a softmax function of the malicious software identification model to obtain an identification index. Optionally, the preset number of layers is greater than 2 and less than 11. Optionally, the preset dimension number is 2.

Optionally, performing convolution operation according to the node direction set and the input feature map to obtain a convolution feature map, including: respectively acquiring adjacent matrixes corresponding to a target node, an inflow node subset and an outflow node subset in a node direction set, and acquiring a target node adjacent matrix, an inflow node adjacent matrix and an outflow node adjacent matrix; and fusing the input characteristic diagram, the target node adjacency matrix, the inflow node adjacency matrix and the outflow node adjacency matrix to obtain the convolution characteristic diagram.

Optionally, an adjacency matrix corresponding to the target node in the node direction set is obtained, and a target node adjacency matrix is obtained.

Optionally, obtaining an adjacency matrix corresponding to the inflow node subset in the node direction set, and obtaining an inflow node adjacency matrix includes: by passing

Acquiring an inflow node adjacency matrix; wherein A is_inIn order to flow into the node adjacency matrix,

elements of the y 'th column of the x' th row in the adjacency matrix for the inflow node; determining when there is a directed edge between the x 'th node to the y' th node of the inflow node subset

Has a value of 1; determining when there is no directed edge between the x 'th node to the y' th node of the inflow node subset

The value of (d) is 0.

Optionally, the obtaining an adjacency matrix corresponding to the outflow node subset in the node direction set, and the obtaining an outflow node adjacency matrix includes: by passing

Acquiring an outflow node adjacency matrix; wherein A is_outIn order to flow out of the node adjacency matrix,

the element of the x 'th row and the y' th column in the adjacent matrix of the outflow node; determining when there is a directed edge between the x "th node to the y" th node of the egress node subset

Has a value of 1; determining when there is no directed edge between the x "th node to the y" th node of the egress node subset

The value of (d) is 0.

Optionally by calculation

Obtaining a convolution characteristic graph; wherein, X^l+1Convolution signatures, X, output for graph convolution layers^lFor the input profile of the input map convolution layer, Relu is the nonlinear activation function, K_BNumber of subsets, K, representing a set of node directions_BIs 3, M_kCalling a time matrix for a subset corresponding to the kth subset in the node direction set, wherein k is a positive integer and is less than or equal to 3, B, E_kIs a normalized representation of the adjacency matrix corresponding to the kth subset in the node direction set, W_kIs a trainable parameter of a preset convolution function.

Alternatively, in the case where k has a value of 1, M_kA matrix of times is invoked for a subset corresponding to the subset of incoming nodes,

is a normalized representation of the incoming node adjacency matrix. Optionally, by

Acquiring a subset calling time matrix corresponding to an inflow node subset; wherein M is_inA matrix of times is invoked for a subset corresponding to the subset of incoming nodes,

number of calls for calling the q-th node for the p-th node, N_inThe p-th node and the q-th node are both nodes in the subset of ingress nodes as the number of nodes in the subset of ingress nodes.

Alternatively, in the case where k has a value of 2, M_kA matrix of times is invoked for the subset corresponding to the target node,

is a normalized representation of the target node adjacency matrix. Alternatively, M_selfCalling a time matrix for the subset corresponding to the target node; wherein M is_self＝{0}。

Alternatively, in the case where k has a value of 3, M_kA matrix of times is invoked for the subset corresponding to the subset of egress nodes,

is a normalized representation of the egress node adjacency matrix. Optionally by

Acquiring a subset calling time matrix corresponding to the outflow node subset; wherein M is_outA matrix of times is invoked for the subset corresponding to the subset of egress nodes,

number of calls to call the q 'th node for the p' th node, N_outFor the number of nodes in the egress node subset, the p 'th node and the q' th node are both nodes in the egress node subset.

Optionally by

Acquiring a normalized representation of an adjacency matrix corresponding to a kth subset in a node direction set; wherein the content of the first and second substances,

for the k-th subset pair in the node direction setNormalized representation of the corresponding adjacency matrix, B_kFor the adjacency matrix corresponding to the kth subset in the set of node directions, D_k∈R^N'×N'Is B_kCorresponding diagonal matrix, I ∈ R^N'×N'And N' is the node number of the kth subset in the node direction set. Alternatively, in the case where k has a value of 1, B_kAn adjacency matrix is formed for the incoming nodes in the set of node directions. Alternatively, in the case where k has a value of 2, B_kA target node adjacency matrix in the node direction set. Alternatively, in the case where k has a value of 3, B_kAn adjacency matrix is formed for the outgoing nodes in the set of node directions.

Optionally by

Obtaining

A corresponding diagonal matrix; wherein the content of the first and second substances,

is a diagonal matrix.

Inputting the symbolic feature map into a first layer of map convolutional layer, acquiring a node direction set of the symbolic feature map by the first layer of map convolutional layer, fusing the node direction set, and outputting a convolutional feature map; inputting the convolution characteristic graph output by the first layer graph convolution layer into a second layer graph convolution layer, acquiring a node direction set of the input convolution characteristic graph by the second layer graph convolution layer, fusing the node direction set, and outputting a convolution characteristic graph; inputting the convolution characteristic graph output by the second layer of graph convolution layer into the next layer of graph convolution layer; and after the graph convolution layer with the preset number of layers passes, determining the convolution characteristic graph output by the last layer as a first characteristic representation. Through the superposition of multilayer graph convolution layers, the modeling of complex interaction information between nodes is realized, and meanwhile, the direction information between the nodes is considered, so that the identification of the software to be identified is more accurate according to the identification indexes obtained by the malicious software identification model, and the accuracy rate of identifying the malicious software is improved.

Optionally, bipolarizing the second feature representation by a softmax (flexible maximum) function of the malware recognition model to obtain a recognition index, including: by calculation of

Obtaining an identification index; wherein the content of the first and second substances,

to identify the index, X_outFor the second characterization, W is a preset parameter matrix.

Optionally, identifying whether the software to be identified is malware according to the identification index includes: by passing

Acquiring an identification result; wherein, r is the recognition result,

is the probability that the software under evaluation is malware. Determining that the software to be identified is malicious software under the condition that the identification result is J; and under the condition that the identification result is not J, determining that the software to be identified is the common software.

Optionally, the method for obtaining the malware recognition model includes: collecting a plurality of software samples; acquiring a sample calling feature map of a software sample; symbolizing the sample calling feature diagram to obtain a sample symbolized feature diagram; inputting the sample symbolic feature map with the sample label into a preset neural network model for training to obtain a malicious software identification model; the sample tags include normal software tags and malware tags. The neural network model comprises a graph volume layer; the graph convolution layer is used for obtaining a sample input characteristic graph, obtaining a sample node direction set of each sample node in the sample input characteristic graph, and then carrying out convolution operation according to the sample node direction set and the sample input characteristic graph to obtain a sample convolution characteristic graph; the sample node direction set comprises a sample target node, a sample inflow node subset and a sample outflow node subset; the sample input feature map is a sample symbolized feature map or a sample convolution feature map.

Optionally, obtaining the sample call feature map of the software sample includes: reversely compiling the installation file of the software sample by a decompilation tool to obtain a sample access list file; acquiring a sample component list of the software sample according to the sample access list file; the sample component list of the software sample comprises a package name and an operable component name in the sample access manifest file; under the condition that a software sample is loaded through a virtual machine tool, extracting sample system calls of all components of a sample component list through a tracking tool, and determining all the extracted sample system calls as a sample system call list; obtaining a sample system calling ID, a sample system calling frequency and a sample system calling sequence according to the sample system calling list; determining a sample node according to the sample system calling ID and the sample system calling frequency; and obtaining sample directed edges among all sample nodes according to the sample system calling sequence, and obtaining a sample calling feature graph according to all the sample nodes and the sample directed edges among all the sample nodes.

Optionally, inputting the sample symbolic feature map with the sample label into a preset neural network model for training, so as to obtain a malware recognition model, including: inputting a sample symbolic feature diagram with a sample label into a preset neural network model, and recording a training loss value of each training period; and when the training loss values of the continuous preset number are not lower than the lowest value of all the training loss values, stopping model training, and determining the model obtained by training in the last period as the malicious software identification model. Optionally, the preset number is a positive integer greater than 3. Optionally, the obtaining of the training loss value of one training period includes: inputting the sample symbolic feature map into a preset number of map convolutional layers for map convolutional operation, and determining a sample convolutional feature map output by the last map convolutional layer as a first sample feature representation; adjusting the dimensionality of the first sample characteristic representation to a preset dimensionality number through a full connection layer of a preset neural network model to obtain a second sample characteristic representation; performing dual polarization on the second sample characteristic representation through a softmax function of a preset neural network model to obtain a sample index; and obtaining a training loss value of one period according to the sample index and the loss function.

Optionally, obtaining a sample label of the sample symbolic feature map includes: analyzing a software sample corresponding to the sample symbolic feature map through an anti-malware engine or manual identification to obtain an analysis result; determining a malicious software label as a sample label of the sample symbolic feature map under the condition that the analysis result of the software sample is malicious software; and determining the common software label as the sample label of the sample symbolized feature map when the analysis result of the software sample is common software.

As shown in fig. 3, an apparatus for identifying malware according to an embodiment of the present disclosure includes a processor (processor)100 and a memory (memory) 101. Optionally, the apparatus may also include a Communication Interface (Communication Interface)102 and a bus 103. The processor 100, the communication interface 102, and the memory 101 may communicate with each other via a bus 103. The communication interface 102 may be used for information transfer. The processor 100 may call logic instructions in the memory 101 to perform the method for identifying malware of the above-described embodiments.

In addition, the logic instructions in the memory 101 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products.

The memory 101, which is a computer-readable storage medium, may be used for storing software programs, computer-executable programs, such as program instructions/modules corresponding to the methods in the embodiments of the present disclosure. The processor 100 executes functional applications and data processing, i.e., implements the method for identifying malware in the above-described embodiments, by executing program instructions/modules stored in the memory 101.

The memory 101 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. In addition, the memory 101 may include a high-speed random access memory, and may also include a nonvolatile memory.

By adopting the device for identifying the malicious software, the symbolic feature diagram is obtained by symbolizing the calling feature diagram of the software to be identified, the symbolic feature diagram is input into a preset malicious software identification model to obtain the identification index, and the software to be identified is identified according to the identification index. The graph convolution layer of the malware identification model obtains a symbolic feature graph or a convolutional feature graph, and obtains a node direction set of each node in the symbolic feature graph or the convolutional feature graph, wherein the node direction set comprises an inflow node subset, an outflow node subset and a target node.

The embodiment of the disclosure provides a device, which comprises the above device for identifying malicious software. Optionally, the apparatus comprises: a smart phone, a tablet, a computer or a server, etc.

By adopting the device provided by the embodiment of the disclosure, the symbolic feature diagram is obtained by symbolizing the calling feature diagram of the software to be recognized, the symbolic feature diagram is input into a preset malicious software recognition model to obtain the recognition index, and the software to be recognized is recognized according to the recognition index. The graph convolution layer of the malware identification model obtains a symbolic feature graph or a convolutional feature graph, and obtains a node direction set of each node in the symbolic feature graph or the convolutional feature graph, wherein the node direction set comprises an inflow node subset, an outflow node subset and a target node.

Embodiments of the present disclosure provide a computer-readable storage medium storing computer-executable instructions configured to perform the above-described method for identifying malware.

Embodiments of the present disclosure provide a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the above-described method for identifying malware.

The computer-readable storage medium described above may be a transitory computer-readable storage medium or a non-transitory computer-readable storage medium.

The technical solution of the embodiments of the present disclosure may be embodied in the form of a software product, where the computer software product is stored in a storage medium and includes one or more instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present disclosure. And the aforementioned storage medium may be a non-transitory storage medium comprising: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes, and may also be a transient storage medium.

The above description and drawings sufficiently illustrate embodiments of the disclosure to enable those skilled in the art to practice them. Other embodiments may incorporate structural, logical, electrical, process, and other changes. The examples merely typify possible variations. Individual components and functions are optional unless explicitly required, and the sequence of operations may vary. Portions and features of some embodiments may be included in or substituted for those of others. Furthermore, the words used in the specification are words of description only and are not intended to limit the claims. As used in the description of the embodiments and the claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Similarly, the term "and/or" as used in this application is meant to encompass any and all possible combinations of one or more of the associated listed. Furthermore, the terms "comprises" and/or "comprising," when used in this application, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Without further limitation, an element defined by the phrase "comprising an …" does not exclude the presence of other like elements in a process, method or apparatus that comprises the element. In this document, each embodiment may be described with emphasis on differences from other embodiments, and the same and similar parts between the respective embodiments may be referred to each other. For methods, products, etc. of the embodiment disclosures, reference may be made to the description of the method section for relevance if it corresponds to the method section of the embodiment disclosure.

Those of skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software may depend upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments. It can be clearly understood by the skilled person that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments disclosed herein, the disclosed methods, products (including but not limited to devices, apparatuses, etc.) may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units may be merely a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to implement the present embodiment. In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. In the description corresponding to the flowcharts and block diagrams in the figures, operations or steps corresponding to different blocks may also occur in different orders than disclosed in the description, and sometimes there is no specific order between the different operations or steps. For example, two sequential operations or steps may in fact be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. Each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Claims

1. A method for identifying malware, comprising:

acquiring a calling feature map of software to be identified;

symbolizing the calling feature diagram to obtain a symbolized feature diagram;

2. The method of claim 1, wherein obtaining the call feature map of the software to be identified comprises:

acquiring a component list of the software to be identified;

acquiring a system call list according to the component list;

and acquiring the calling characteristic diagram of the software to be identified according to the system calling list.

3. The method of claim 2, wherein obtaining a list of components of software to be identified comprises:

decompressing an installation file of software to be identified to obtain an access list file;

acquiring a component list of the software to be identified according to the access list file; the component list includes package names and executable component names in the access manifest file.

4. The method of claim 2, wherein obtaining a list of system calls from the list of components comprises:

and traversing and executing each component of the software to be identified according to the component list to obtain a system call list.

5. The method of claim 2, wherein the calling feature graph comprises a number of nodes and directed edges between the nodes; acquiring a calling feature map of the software to be identified according to a system calling list, wherein the method comprises the following steps:

acquiring a system call ID, a system call frequency and a system call sequence according to the system call list;

determining the nodes of the calling feature graph according to the system calling ID and the system calling frequency;

and acquiring directed edges among the nodes in the calling feature graph according to the system calling sequence.

6. The method of claim 1, wherein inputting the symbolic feature map into a preset malware recognition model to obtain a recognition index comprises:

inputting the symbolic feature map into a preset number of map convolutional layers for map convolutional operation, and determining a convolutional feature map output by the last map convolutional layer as a first feature representation;

adjusting the dimensionality of the first feature representation to a preset dimensionality number through a full-connection layer of the malware identification model to obtain a second feature representation;

and carrying out two polarization on the second feature representation through a softmax function of the malicious software identification model to obtain an identification index.

7. The method of claim 1, wherein obtaining a malware recognition model comprises:

collecting a plurality of software samples;

acquiring a sample calling feature map of the software sample;

symbolizing the sample calling feature diagram to obtain a sample symbolized feature diagram;

inputting the sample symbolic feature map with the sample label into a preset neural network model for training to obtain a malicious software identification model; the sample tags include normal software tags and malware tags.

8. The method according to any one of claims 1 to 7, wherein performing a convolution operation on the set of node directions and the input feature map to obtain a convolution feature map comprises:

respectively acquiring adjacent matrixes corresponding to a target node, an inflow node subset and an outflow node subset in a node direction set, and acquiring a target node adjacent matrix, an inflow node adjacent matrix and an outflow node adjacent matrix;

and fusing the input feature graph, the target node adjacency matrix, the inflow node adjacency matrix and the outflow node adjacency matrix to obtain a convolution feature graph.

9. An apparatus for identifying malware comprising a processor and a memory storing program instructions, wherein the processor is configured to perform the method for identifying malware according to any one of claims 1 to 8 when executing the program instructions.

10. A device comprising the means for identifying malware of claim 9.