CN117574370A - Malicious code detection system - Google Patents

Malicious code detection system Download PDF

Info

Publication number
CN117574370A
CN117574370A CN202311599310.5A CN202311599310A CN117574370A CN 117574370 A CN117574370 A CN 117574370A CN 202311599310 A CN202311599310 A CN 202311599310A CN 117574370 A CN117574370 A CN 117574370A
Authority
CN
China
Prior art keywords
feature
component
graph
sensitive
manifest file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311599310.5A
Other languages
Chinese (zh)
Other versions
CN117574370B (en
Inventor
皮锋
陈鹏
王欣
田生伟
裴新军
农卫涛
王晓炜
龚军超
马丽娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinjiang Entry Exit Border Inspection Station Of People's Republic Of China Border Management Team Of Xinjiang Uygur Autonomous Region Public Security Department
Original Assignee
Xinjiang Entry Exit Border Inspection Station Of People's Republic Of China Border Management Team Of Xinjiang Uygur Autonomous Region Public Security Department
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinjiang Entry Exit Border Inspection Station Of People's Republic Of China Border Management Team Of Xinjiang Uygur Autonomous Region Public Security Department filed Critical Xinjiang Entry Exit Border Inspection Station Of People's Republic Of China Border Management Team Of Xinjiang Uygur Autonomous Region Public Security Department
Priority to CN202311599310.5A priority Critical patent/CN117574370B/en
Publication of CN117574370A publication Critical patent/CN117574370A/en
Application granted granted Critical
Publication of CN117574370B publication Critical patent/CN117574370B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/53Decompilation; Disassembly
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Virology (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention provides a malicious code detection system, comprising: the decompilation module is used for obtaining a manifest file and a source code according to the application program to be detected; the calling sub-graph analysis module is used for extracting a calling sub-graph feature set according to the source code; the component analysis module is used for extracting a component feature set according to the manifest file; the feature fusion module is used for fusing the calling sub-graph feature set and the component feature set into a feature matrix suitable for the deep learning model; and the malicious software detection module is used for detecting through a bidirectional independent circulating neural network based on a class-sensitive mechanism according to the feature matrix to obtain a malicious software detection result. According to the method, on the basis of analyzing program operation logic and extracting program semantic information based on calling sub-graph features, related information of malicious software components is fused and analyzed, the detection precision of the malicious application of the Internet of things is achieved, and the safety of the Internet of things system is improved.

Description

Malicious code detection system
Technical Field
The invention relates to the field of software code detection and analysis, in particular to a malicious code detection system.
Background
Along with the rapid development of the internet of things technology, mobile internet of things equipment such as smart phones and tablet computers are widely applied to the internet of things. Its widespread implementation and acceptance have greatly increased the convenience and comfort of people. In addition, the proliferation of mobile internet of things devices has enabled more and more internet of things services to be accessed remotely through the internet of things network. The rapid popularization of internet of things mobile devices also brings great security challenges, especially in terms of malicious internet of things application software attacks. While the Internet of things application software provides convenient services for users, lawbreakers utilize vulnerabilities of the Internet of things application software or spread malicious Internet of things application software due to benefit driving so as to achieve the purposes of stealing personal information of the users, stealing money of the users and the like.
The internet of things era, mobile platforms have involved processing large amounts of sensitive data, including personal email and communications, voice telephony, text messaging, corporate and financial data. A mobile internet of things device typically has GPS functionality, meaning that it can know the physical location of the holder at any time. Mobile internet of things devices also contain built-in cameras, microphones, accelerometers, magnetometers, etc., and thus can be hijacked by malware and used to eavesdrop on their environment. Thus, malware developers have also targeted these internet of things intelligent mobile devices.
Traditional malware detection methods rely on signature-based detection, i.e., matching known malware signatures with a signature library. However, the limitation of the method is that the method can only detect known malicious software, is seriously dependent on the size and the perfection degree of a feature library, has the problems of high false alarm rate and the like, and cannot cope with the latest malicious software variety threat. Behavior-based malware detection systems focus on detecting suspicious behavior, rather than specific features, including monitoring the behavior of software and finding patterns indicative of malicious activity. However, such a scheme consumes a lot of computing resources and time, and cannot meet the real-time requirement of the malware detection task in practical application. According to the method, the program running logic is analyzed and the program semantic information is extracted based on the program calling interface API characteristics, and meanwhile, the extraction component analysis characteristics capture more malicious behaviors, so that the detection precision of malicious applications of the Internet of things is effectively improved, and the safety of the Internet of things system is further improved. Furthermore, consider that in the real world, malware is only a small fraction of all software compared to benign software. Thus, this poses a serious class imbalance problem that can mislead the malware detection model to misclassify during classification. In order to solve the problem, the invention provides a class-sensitive bidirectional independent cyclic neural network model, and the model creatively provides a cross entropy function based on a class-sensitive mechanism to process the data migration problem caused by unbalanced data sets. The proposed strategy helps to improve the performance of the method and alleviate the problems of excessive and insufficient fitting, and can effectively detect malicious software.
Disclosure of Invention
In order to solve the defects in the prior art, the invention provides a novel malicious software detection technology based on call subgraph and component communication analysis in the application scene of the Internet of things.
According to a first aspect of the present invention, there is provided a malicious code detection system comprising:
and the decompilation module is used for obtaining the manifest file and the source code according to the application program to be detected.
And the calling sub-graph analysis module is used for extracting and calling the sub-graph feature set according to the source code.
The calling sub-graph analysis module comprises a program calling graph generation module and a sensitive API feature generation module.
The program call graph generating module constructs a program call graph FCG according to input source codes, extracts all sensitive API nodes by using a Pscout tool, wherein the FCG is expressed as G= { V, E }, and is composed of a group of nodesAnd a set of edgesComposition; each node in V represents a function in the application program, including sensitive APIs, the sensitive API set is +.>Each edge in E represents a calling relationship between a calling party and a called party.
The sensitive API feature generation module analyzes and extracts a sensitive function call sub-graph feature set according to the FCG, and comprises the following steps: allocating different malicious degree values for each sensitive API in the sensitive API set by using a TF-IDF method to obtain the sensitive API malicious degree setWherein the i-th sensitive API->The malicious degree of (a):
wherein Q is total Representing the number of all benign software in the training set, R total Representing the amount of all malware in the training set,representing training set call->Is>Representing training set call->Is a malware amount; FCG corresponds to a sensitive function call sub-graph set +.>Wherein a sensitive function call subgraph +.>By a sensitive API and its adjacent nodes u 1 ,…,u k Composition by distance function->Control sensitive function call subgraph->Wherein k is a custom control parameter for defining a sensitivity function call sub-graph comprising +.>Is the number of neighbor nodes; computing sensitive function call subgraph feature sets
And the component analysis module is used for extracting the component feature set according to the manifest file.
The component analysis module extracts a plurality of categories of component features from the manifest file, including: an activity component feature, a service component feature, a content provider component feature, a broadcast receiver component feature.
Extracting activity component features includes: querying activity tag information activity from a manifest file, and taking all names containing activity tag information in the manifest file as activity component characteristics T= { T 1 ,…,t nT Where t represents some activity, which is responsible for presenting a user interface in an application.
Extracting service component features includes: querying service tag information service from a manifest file, and taking all names containing the service tag information in the manifest file as service component characteristics R= { R 1 ,…,r nR Where r denotes a certain service, which is responsible for background processing in the application.
Extracting content provider component features includes: querying Content Provider label information Content Provider from the manifest file, and taking all names containing Content Provider label information in the manifest file as Content Provider component characteristics P= { P 1 ,…,p nP Where p denotes a certain content provider, which is responsible for sharing structured data in the application.
Extracting broadcast receiver component features includes: querying the broadcast receiver tag information activity from the manifest file, and taking all names containing the broadcast receiver tag information in the manifest file as the broadcast receiver component feature C= { C 1 ,…,c nC Where c denotes a certain broadcast receiver, which is responsible for providing the capability to receive information in the application.
Component feature sets com= { T, C, P, R } are constructed from the plurality of component features.
And the feature fusion module is used for fusing the calling sub-graph feature set and the component feature set into a feature matrix suitable for the deep learning model.
And the malicious software detection module is used for detecting through a bidirectional independent circulating neural network based on a class-sensitive mechanism according to the feature matrix to obtain a malicious software detection result.
The bidirectional independent circulating neural network based on the class sensitive mechanism in the malicious software detection module comprises: a bidirectional independent cyclic neural network layer, a full-connection network layer and a Sigmoid function output layer; the bidirectional independent cyclic neural network layer simultaneously extracts information from the input feature matrix from the previous direction and the future direction and then splices the information; the fully connected network layer processes the output from the bidirectional independent circulating neural network; the Sigmoid function output layer processes the output of the fully connected network layer through a Sigmoid function.
Further, in the system provided by the invention, the decompilation module inputs the APK file of the application program to be detected, and the decompilation tool android is used for extracting the manifest file and the DEX file and analyzing the source code according to the DEX file.
Further, in the system provided by the invention, the feature fusion module combines the calling sub-graph feature set and the component feature set, and the Word2vec tool is used for extracting semantic feature vectors of all elements in the combined set by using a Word embedding method to form a feature matrix FV= { x 1 ,…,x t ,…,x nF },x t Is a semantic feature vector.
Further, the system provided by the invention, the bidirectional independent circulating neural network layer further comprises:
extracting features from previous to future directionsWherein, the semantic feature vector x t Is the current input of the network, < >>Representing the previously hidden state of the network,/->Is in the direction from the front to the futureWeight matrix (W/W)>Is the cyclic unit weight of the previous to future direction,/-, and>is the offset from the previous to future direction, +.>Representing the matrix operation of the hadamard product.
Extracting features from never before coming directionIndicating the future hidden state of the network,/->Is a weight matrix never coming to the previous direction, < ->Is the cyclic unit weight from never coming to the previous direction,/-, for example>Is the offset from the previous direction.
Splicing the extracted features in two directions to obtain the output of the bidirectional independent circulating neural network layer Wherein W is f And W is b Two different weight matrices.
Full connectivity network layer output h FCN =W FCN h Blnd +b FCN ,W FCN Is the weight matrix of the neurons of the full connection layer, b FCN Is the offset of the full link layer neurons.
The Sigmoid function output layer outputs a classification result y=sigmoid (h FCN ) As a result of malware detection, one class represents malware and the other class represents benign software.
Further, the system provided by the invention is formed by multiple iterative training based on the bidirectional independent loop neural network of the class-sensitive mechanism, and the iterative training process is based on a loss function:
wherein y is k The actual output of the kth class of the fully connected network; t is t k Is the k-th expected output of the fully connected network; c represents a cost term; sam epsilon dataset represents each sample in the computational training set; p is the true label of the sample and k is the predicted label of the sample.
According to a second aspect of the present invention, there is provided a computer device characterized by comprising:
a memory for storing instructions; and
and a processor for invoking the instructions stored in the memory to implement the system of the first aspect.
Compared with the prior art, the technical scheme of the invention has at least the following beneficial effects:
1. invoking the subgraph feature (CSG) in malware detection, an application programming interface may be used to simulate a malicious pattern and detect malware instances, e.g., malware in the real world often leverages existing APIs in the android system to simulate normal behavior of benign applications to evade security checks of the malicious behavior detection system. Therefore, the invention constructs a calling sub-graph feature extraction module which extracts the sensitive application programming interface API frequently utilized by malicious software to construct the calling sub-graph feature, and can effectively reflect the relationship between the caller and the callee in the application program.
2. Component feature (COM): the invention constructs a feature extraction module based on the component features, and systematically analyzes the component calling modes presented by benign application programs and malicious application programs. Such a set of features represents one of the most common attack vectors used by malware.
3. The feature fusion method is improved: on one hand, invoking the sub-graph feature (CSG) can effectively reflect the malicious behavior pattern of the malicious software of the Internet of things, and provides great help for detecting the malicious software instance, but ignores the related information of interaction among different components. On the other hand, the component feature (COM) fully covers interactions between different components in an application and between different applications, such as broadcast receivers and content providers, providing great flexibility and versatility. In view of the complementary representation capabilities of the two types of features, the invention provides a feature fusion scheme which combines the advantages of the two types of features to provide a more comprehensive malicious behavior pattern representation.
4. Internet of things malicious software detection classifier based on bidirectional independent cyclic neural network model: the invention provides a new class-sensitive cross entropy function, which effectively solves the problem of data migration caused by unbalanced data sets by introducing cost items into an independent cyclic neural network model to a back propagation learning process to consider the importance of identification among classes.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic diagram of an overall frame design, shown according to an exemplary embodiment.
FIG. 2 is a schematic diagram of a class-sensitive based two-way independent-loop neural network, according to an example embodiment.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The invention provides a novel malicious software detection technology based on calling subgraph and component communication analysis under the application scene of the Internet of things, and provides a malicious code detection system. The frame comprises five main modules:
and the decompilation module is used for obtaining the manifest file and the source code according to the application program to be detected.
And the calling sub-graph analysis module is used for extracting and calling the sub-graph feature set according to the source code.
And the component analysis module is used for extracting the component feature set according to the manifest file.
And the feature fusion module is used for fusing the calling sub-graph feature set and the component feature set into a feature matrix suitable for the deep learning model.
And the malicious software detection module is used for detecting through a bidirectional independent circulating neural network based on a class-sensitive mechanism according to the feature matrix to obtain a malicious software detection result.
A first part: decompiling module. And the decompilation module inputs the APK file of the application program to be detected, extracts the manifest file and the DEX file by using a decompilation tool android, and analyzes the source code according to the DEX file.
At this stage we decompil APK files to extract resources, manifest files and class. The internet of things application is typically developed using the Java programming language and then compiled into Dalvik code (DEX) that is stored in a class DEX file. The compiled code and necessary resources are packaged into APK files. During decompilation, we can extract Dalvik code from the APK by recreating the source code and class files from each application using an open source tool (android tool). Thereafter, we extract call sub-graph features (CSG) and component features (COM).
A second part: and calling a subgraph analysis module. The calling sub-graph analysis module comprises a program calling graph generation module and a sensitive API feature generation module.
The program call graph generating module constructs a program call graph FCG according to input source codes, extracts all sensitive API nodes by using a Pscout tool, wherein the FCG is expressed as G= { V, E }, and is composed of a group of nodesAnd a set of edgesComposition; each node in V represents a function in the application program, including sensitive APIs, the sensitive API set is +.>Each edge in E represents a calling relationship between a calling party and a called party.
The sensitive API feature generation module analyzes and extracts a sensitive function call sub-graph feature set according to the FCG, and comprises the following steps: allocating different malicious degree values for each sensitive API in the sensitive API set by using a TF-IDF method to obtain the sensitive API malicious degree setWherein the i-th sensitive API->The malicious degree of (a):
wherein Q is total Representing the number of all benign software in the training set, R total Representing the amount of all malware in the training set,representing training set call->Is>Representing training set call->Is a malware amount; FCG corresponds to a sensitive function call sub-graph set +.>Wherein a sensitive function call subgraph +.>By a sensitive API and its adjacent nodes u 1 ,…,u k Composition by distance function->Control sensitive function call subgraph->Wherein k is a custom control parameter for defining a sensitivity function call sub-graph comprising +.>Is the number of neighbor nodes; computing sensitive function call sub-graph feature set>
An Application Programming Interface (API) acts as a high-level programming language interface for encapsulating system calls and reflecting the behavior of code segments in a program. The internet of things application utilizes various APIs to implement various functions. Malicious programs need to communicate with external programs and often utilize sensitive APIs to implement malicious behavior, such as obtaining personal user information, manipulating device functions, or engaging in illegal operations. The most common combination of sensitive API calls in malware reveals the presence of malicious activity, thereby enabling malicious activity such as credential theft and user data sales. Thus, these API calls may suggest potentially malicious intent, while also revealing vulnerabilities of the application.
All sensitive API nodes contained in the APK installation package to be analyzed are extracted using an open source Pscout tool. The process of the Pscout tool to resolve sensitive APIs can be referred to the Kathy Wain Yee Au et al paper: PScout Analyzing the Android Permission Specification (In the Proceedings of the 19th ACM Conference on Computer and Communications Security (CCS 2012). October 2012).
Third section: and a component analysis module. And the component analysis module is used for extracting the component feature set according to the manifest file.
The component analysis module extracts a plurality of categories of component features from the manifest file, including: an activity component feature, a service component feature, a content provider component feature, a broadcast receiver component feature.
Extracting activity component features includes: querying activity tag information activity from a manifest file, and taking all names containing activity tag information in the manifest file as activity component characteristics T= { T 1 ,…,t nT Where t represents some activity, which is responsible for presenting a user interface in an application.
Extracting service component features includes: querying service tag information service from a manifest file, and taking all names containing the service tag information in the manifest file as service component characteristics R= { R 1 ,…,r nR Where r denotes a certain service, which is responsible for background processing in the application.
Extracting content provider component features includes: querying Content Provider label information Content Provider from the manifest file, and taking all names containing Content Provider label information in the manifest file as Content Provider component characteristics P= { P 1 ,…,p nP Where p denotes a certain content provider, which is responsible for sharing structured data in the application.
Extracting broadcast receiver component features includes: querying the broadcast receiver tag information activity from the manifest file, and taking all names containing the broadcast receiver tag information in the manifest file as the broadcast receiver component feature C= { C 1 ,…,c nC Where c denotes a certain broadcast receiver, which is responsible for providing the capability to receive information in the application.
Component feature sets com= { T, C, P, R } are constructed from the plurality of component features.
At this stage, we extract component features, which are critical to understanding the malware mode of operation, significantly enhancing the ability of the detection system to detect complex malware.
Behavior patterns of benign applications and malware are systematically analyzed because components are the core of an application and cannot be easily modified or deleted. In addition, the component feature extraction scheme is more flexible and universal, is not limited by specific API calls, can identify and capture more malicious behaviors, and enhances the reliability and robustness of the detection system.
The basic building blocks of the internet of things application include activities, services, content providers, and broadcast receivers. The activity is responsible for presenting the user interface. The service is responsible for background processing. The content provider facilitates sharing of structured data. The broadcast receiver provides the ability to receive information. Development of malware typically involves code reuse, resulting in shared malicious behavior and component reuse among the same family of malware.
The basic building blocks of the internet of things application include activities, services, content providers, and broadcast receivers. We extract several classes of features:
an activity (activity) is responsible for presenting a user interface, for example: the Activity may set an "android.intent.category" attribute, which means that this Activity may be opened by the browser using a custom protocol, and an override call may be made to the app by the browser. Here, android.intent.category.BROWSABLE is a component feature t1. Therefore, we regard the names of all activities as service features.
Services (services) are responsible for background processing, such as: the developer needs to declare all services in the application configuration file. The malware may start a service using a startService () method when the service is in a started state. Therefore, we regard the names of all services as service features.
Content providers (Content providers) facilitate sharing of structured data, such as: the android platform provides Content Provider to have a specified dataset for one application provided to other applications. Other applications may obtain or store data from the content provider through the contentdesolver class. Therefore, we regard the names of all Content Provider as service features.
The broadcast receiver (broadcast receiver) provides the ability to receive information, such as: malware may define the broadcast receiver class to be mBroadcastreceiver, which may be used to receive the broadcast "android.net.conn.CONNECTIVITY_CHANGE" issued when the network state CHANGEs. Here, "android.net.conn.connection_change" is a feature c1 of the broadcast receiver. Therefore, we regard all broadcast receiver names as service features.
Fourth part: and a feature fusion module. The feature fusion module combines the calling sub-graph feature set and the component feature set, extracts semantic feature vectors of all elements in the combined set by using a Word2vec tool through a Word embedding method, and forms a feature matrix FV= { x 1 ,…,x t ,…,x nF },x t Is a semantic feature vector.
In the feature fusion stage, we construct feature vectors for each internet of things application by embedding these features into feature space in an attempt to extract potential semantic patterns. After the feature vectors are generated, we use a neural network model to classify malware, modeling the high-level concepts and facts in malware.
Given a feature set F, we use word embedding methods to extract semantic vectors, reflecting potential behavioral patterns of malware. This approach is spatially insensitive, independent of the order of words or local patterns.
In our example, the semantic features of each malware are mapped into a fixed two-dimensional matrix FV. We set the maximum number of semantic features to be M and the embedding size of the vector to be K. Finally, we convert the fusion feature F into semantic features FV and use it as input to a detection model for learning semantic knowledge. These semantic patterns are aggregated at a lower level, helping to better identify patterns of features and facilitating the representation of high-level domain knowledge.
Fifth part: a malware detection module. The bidirectional independent circulating neural network based on the class sensitive mechanism in the malicious software detection module comprises: a bidirectional independent cyclic neural network layer, a full-connection network layer and a Sigmoid function output layer; the bidirectional independent cyclic neural network layer simultaneously extracts information from the input feature matrix from the previous direction and the future direction and then splices the information; the fully connected network layer processes the output from the bidirectional independent circulating neural network; the Sigmoid function output layer processes the output of the fully connected network layer through a Sigmoid function.
At this stage, the fusion features are injected into the feature space, building semantic feature vectors. Then, the semantic feature vector is input into a malicious software detection model based on a class-sensitive independent cyclic neural network, and the model adopts a bidirectional information extraction mode to extract sequence features from the previous aspect and the future aspect.
In this section, we propose a class-sensitive bi-directional independent-loop neural network based malware classifier to describe the behavior of malware and model semantic knowledge. An independent recurrent neural network is a variant of a recurrent neural network. In a conventional recurrent neural network, a weight matrix of different time steps of an input time sequence is shared. Whereas for an independent recurrent neural network, which sets individual weights at each time step of the recurrent element, this strategy enables it to handle longer time sequences, thus reducing the effects of gradient extinction or gradient explosion.
The two-way independent cyclic neural network model performs information extraction on input features from both previous and future aspects.
Extracting features from previous to future directionsWherein, the semantic feature vector x t Is the current input of the network, < >>Representing the previously hidden state of the network,/->Is a weight matrix of previous to future directions, < ->Is the cyclic unit weight of the previous to future direction,/-, and>is the offset from the previous to future direction, +.>Representing the matrix operation of the hadamard product.
Extracting features from never before coming directionIndicating the future hidden state of the network,/->Is a weight matrix never coming to the previous direction, < ->Is the cyclic unit weight from never coming to the previous direction,/-, for example>Is the offset from the previous direction.
Splicing the extracted features in two directions to obtainOutput of two-way independent circulating neural network layer Wherein W is f And W is b Two different weight matrices.
We have introduced a Fully Connected Network (FCN) as an integral part of our architecture. The FCN layer is configured to process output vectors from the bi-directional independent recurrent neural network. The use of FFNs complements a bi-directional independent recurrent neural network, with the model being able to capture local and global information in a coordinated manner.
Full connectivity network layer output h FCN =W FCN h Blnd +b FCN ,W FCN Is the weight matrix of the neurons of the full connection layer, b FCN Is the offset of the full link layer neurons.
The Sigmoid function output layer outputs a classification result y=sigmoid (h FCN ) As a result of malware detection, one class represents malware and the other class represents benign software.
Typically, the output y= {1,0} represents malware and y= {0,1} represents benign software.
The bidirectional independent loop neural network based on the sensitivity-like mechanism is formed by multiple iterative training, and the iterative training process is based on a loss function:
wherein y is k The actual output of the kth class of the fully connected network; t is t k Is the k-th expected output of the fully connected network; c represents a cost term; sam epsilon dataset represents each sample in the computational training set; p is the true label of the sample and k is the predicted label of the sample.
Most of the existing work is focused on improving the detection performance of malicious software, and neglecting the fact that the malicious sample in the real world is small in proportion, the serious unbalanced problem causes that a detection model is easy to be over-fitted, and the generalization capability is reduced.
One existing work utilizes class-sensitive cross entropy functions to solve the problem of multiple classes of imbalance in botnet detection.
The invention refers to the work, and introduces a class-sensitive cross entropy function to process the data migration problem caused by unbalanced data sets in the detection of malicious software. The original independent circulating neural network is adjusted to have cost sensitivity, and the model introduces cost items into a back propagation learning process to consider the importance of identification among classes, so that the model has excellent capability in identifying potential malicious software threats, and promotes comprehensive understanding of malicious software behaviors.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (6)

1. A malicious code detection system, comprising:
the decompilation module is used for obtaining a manifest file and a source code according to the application program to be detected;
the calling sub-graph analysis module is used for extracting a calling sub-graph feature set according to the source code;
the calling sub-graph analysis module comprises a program calling graph generation module and a sensitive API feature generation module;
the program call graph generating module generates a graph according to the input source generationThe code build program invokes the graph FCG, denoted g= { V, E }, by a set of nodes, and extracts all sensitive API nodes using the Pscout toolAnd a set of edgesComposition; each node in V represents a function in the application program, including sensitive APIs, the sensitive API set is +.>Each edge in E represents a calling relationship between a calling party and a called party;
the sensitive API feature generation module analyzes and extracts a sensitive function call sub-graph feature set according to the FCG, and comprises the following steps: allocating different malicious degree values for each sensitive API in the sensitive API set by using a TF-IDF method to obtain the sensitive API malicious degree setWherein the i-th sensitive API->The malicious degree of (a):
wherein Q is total Representing the number of all benign software in the training set, R total Representing the amount of all malware in the training set,representing training set call->Is>Representing training set call->Is a malware amount; FCG corresponds to a sensitive function call sub-graph set +.>Wherein a sensitive function call subgraph +.>By a sensitive API and its adjacent nodes u 1 ,…,u k Composition by distance function->Control sensitive function call subgraph->Wherein k is a custom control parameter for defining a sensitivity function call sub-graph comprising +.>Is the number of neighbor nodes; computing sensitive function call subgraph feature sets
The component analysis module is used for extracting a component feature set according to the manifest file;
the component analysis module extracts a plurality of categories of component features from the manifest file, including: a campaign component feature, a service component feature, a content provider component feature, a broadcast receiver component feature;
extracting activity component features includes: querying activity tag information activity from a manifest file, and taking all names containing activity tag information in the manifest file as activity component characteristics T= { T 1 ,…,t nT Where t represents an activity that is responsible for presenting the user in the applicationAn interface;
extracting service component features includes: querying service tag information service from a manifest file, and taking all names containing the service tag information in the manifest file as service component characteristics R= { R 1 ,…,r nR R represents a certain service, which is responsible for background processing in the application;
extracting content provider component features includes: querying Content Provider label information Content Provider from the manifest file, and taking all names containing Content Provider label information in the manifest file as Content Provider component characteristics P= { P 1 ,…,p nP Where p represents a content provider, the content provider being responsible for sharing structured data in the application;
extracting broadcast receiver component features includes: querying the broadcast receiver tag information activity from the manifest file, and taking all names containing the broadcast receiver tag information in the manifest file as the broadcast receiver component feature C= { C 1 ,…,c nC Where c denotes a certain broadcast receiver, which is responsible for providing the capability of receiving information in an application;
constructing a component feature set COM= { T, C, P, R } according to the plurality of component features;
the feature fusion module is used for fusing the calling sub-graph feature set and the component feature set into a feature matrix suitable for the deep learning model;
the malicious software detection module is used for detecting through a bidirectional independent circulating neural network based on a class-sensitive mechanism according to the feature matrix to obtain a malicious software detection result;
the bidirectional independent circulating neural network based on the class sensitive mechanism in the malicious software detection module comprises: a bidirectional independent cyclic neural network layer, a full-connection network layer and a Sigmoid function output layer; the bidirectional independent cyclic neural network layer simultaneously extracts information from the input feature matrix from the previous direction and the future direction and then splices the information; the fully connected network layer processes the output from the bidirectional independent circulating neural network; the Sigmoid function output layer processes the output of the fully connected network layer through a Sigmoid function.
2. The system of claim 1, wherein the decompilation module inputs an APK file of the application to be detected, extracts the manifest file and the DEX file using a decompilation tool android and parses the source code from the DEX file.
3. The system of claim 1 wherein the feature fusion module combines the call sub-graph feature set and the component feature set, and extracts semantic feature vectors of all elements in the combined set by Word2vec tool using Word embedding method to form feature matrix fv= { x 1 ,…,x t ,…,x nF },x t Is a semantic feature vector.
4. The system of claim 3, wherein the bi-directional independent recurrent neural network layer further comprises:
extracting features from previous to future directionsWherein, the semantic feature vector x t Is the current input of the network, < >>Representing the previously hidden state of the network,/->Is a weight matrix of previous to future directions, < ->Is the cyclic unit weight of the previous to future direction,/-, and>is the offset from the previous to future direction, +.>Matrix operations representing hadamard products;
extracting features from never before coming directionIndicating the future hidden state of the network,/->Is a weight matrix never coming to the previous direction, < ->Is the cyclic unit weight from never coming to the previous direction,/-, for example>Is the offset from the never coming previous direction;
splicing the extracted features in two directions to obtain the output of the bidirectional independent circulating neural network layer Wherein W is f And W is b Two different weight matrices;
full connectivity network layer output h FCN =W FCN h BInd +b FCN ,W FCN Is the weight matrix of the neurons of the full connection layer, b FCN Is the offset of the full link layer neurons;
the Sigmoid function output layer outputs a classification result y=sigmoid (h FCN ) As a result of malware detection, one class represents malware and the other class represents benign software.
5. The system of claim 4, wherein the sensitivity-like mechanism based bi-directional independent loop neural network is formed from a plurality of iterative training processes, the iterative training process being based on a loss function:
wherein y is k The actual output of the kth class of the fully connected network; t is t k Is the k-th expected output of the fully connected network; c represents a cost term; sam epsilon dataset represents each sample in the computational training set; p is the true label of the sample and k is the predicted label of the sample.
6. A computer device, comprising:
a memory for storing instructions;
a processor for invoking execution of instructions stored in the memory to implement the system of any of claims 1-4.
CN202311599310.5A 2023-11-28 2023-11-28 Malicious code detection system Active CN117574370B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311599310.5A CN117574370B (en) 2023-11-28 2023-11-28 Malicious code detection system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311599310.5A CN117574370B (en) 2023-11-28 2023-11-28 Malicious code detection system

Publications (2)

Publication Number Publication Date
CN117574370A true CN117574370A (en) 2024-02-20
CN117574370B CN117574370B (en) 2024-05-31

Family

ID=89893471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311599310.5A Active CN117574370B (en) 2023-11-28 2023-11-28 Malicious code detection system

Country Status (1)

Country Link
CN (1) CN117574370B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
US20200137083A1 (en) * 2018-10-24 2020-04-30 Nec Laboratories America, Inc. Unknown malicious program behavior detection using a graph neural network
US20200364334A1 (en) * 2019-05-16 2020-11-19 Cisco Technology, Inc. Detection of malicious executable files using hierarchical models
CN113935033A (en) * 2021-09-13 2022-01-14 北京邮电大学 Feature-fused malicious code family classification method and device and storage medium
CN114817924A (en) * 2022-05-19 2022-07-29 电子科技大学 AST (AST) and cross-layer analysis based android malicious software detection method and system
CN116305125A (en) * 2023-03-09 2023-06-23 中国工商银行股份有限公司 Malicious code detection method and device, electronic equipment and storage medium
CN116611063A (en) * 2023-05-15 2023-08-18 西北工业大学 Graph convolution neural network malicious software detection method based on multi-feature fusion
CN116702143A (en) * 2023-06-15 2023-09-05 北京泛网互联科技有限责任公司 Intelligent malicious software detection method based on API (application program interface) characteristics
CN117113163A (en) * 2023-06-15 2023-11-24 中国人民解放军空军工程大学 Malicious code classification method based on bidirectional time domain convolution network and feature fusion

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
US20200137083A1 (en) * 2018-10-24 2020-04-30 Nec Laboratories America, Inc. Unknown malicious program behavior detection using a graph neural network
US20200364334A1 (en) * 2019-05-16 2020-11-19 Cisco Technology, Inc. Detection of malicious executable files using hierarchical models
CN113935033A (en) * 2021-09-13 2022-01-14 北京邮电大学 Feature-fused malicious code family classification method and device and storage medium
CN114817924A (en) * 2022-05-19 2022-07-29 电子科技大学 AST (AST) and cross-layer analysis based android malicious software detection method and system
CN116305125A (en) * 2023-03-09 2023-06-23 中国工商银行股份有限公司 Malicious code detection method and device, electronic equipment and storage medium
CN116611063A (en) * 2023-05-15 2023-08-18 西北工业大学 Graph convolution neural network malicious software detection method based on multi-feature fusion
CN116702143A (en) * 2023-06-15 2023-09-05 北京泛网互联科技有限责任公司 Intelligent malicious software detection method based on API (application program interface) characteristics
CN117113163A (en) * 2023-06-15 2023-11-24 中国人民解放军空军工程大学 Malicious code classification method based on bidirectional time domain convolution network and feature fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杜建斌 等: "基于图神经网络的恶意软件分类方法", 互联网周刊, 15 September 2023 (2023-09-15), pages 93 - 95 *

Also Published As

Publication number Publication date
CN117574370B (en) 2024-05-31

Similar Documents

Publication Publication Date Title
Xiao et al. Malware detection based on deep learning of behavior graphs
Li et al. Deeppayload: Black-box backdoor attack on deep learning models through neural payload injection
CN109753800B (en) Android malicious application detection method and system fusing frequent item set and random forest algorithm
CN111433775B (en) Security enhancement method and electronic device thereof
Wu et al. A survey of android malware static detection technology based on machine learning
CN105426760A (en) Detection method and apparatus for malicious android application
CN109614795B (en) Event-aware android malicious software detection method
CN107194251A (en) Android platform malicious application detection method and device
WO2019237362A1 (en) Privacy-preserving content classification
EP3028203A1 (en) Signal tokens indicative of malware
Wang et al. LSCDroid: Malware detection based on local sensitive API invocation sequences
CN107256357A (en) The detection of Android malicious application based on deep learning and analysis method
CN106228067A (en) Malicious code dynamic testing method and device
CN106250761B (en) Equipment, device and method for identifying web automation tool
CN112149124A (en) Android malicious program detection method and system based on heterogeneous information network
Bhatia et al. Tipped Off by Your Memory Allocator: Device-Wide User Activity Sequencing from Android Memory Images.
Hou et al. Disentangled representation learning in heterogeneous information network for large-scale android malware detection in the COVID-19 era and beyond
Kandukuru et al. Android malicious application detection using permission vector and network traffic analysis
Abubaker et al. Exploring permissions in android applications using ensemble-based extra tree feature selection
Lubuva et al. A review of static malware detection for Android apps permission based on deep learning
CN117574370B (en) Malicious code detection system
CN103093147A (en) Method and electronic device for identifying information
Congyi et al. Method for detecting Android malware based on ensemble learning
Ren et al. DEMISTIFY: Identifying On-device Machine Learning Models Stealing and Reuse Vulnerabilities in Mobile Apps
Joraviya et al. DL-HIDS: deep learning-based host intrusion detection system using system calls-to-image for containerized cloud environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant