CN117574370A - Malicious code detection system - Google Patents
Malicious code detection system Download PDFInfo
- Publication number
- CN117574370A CN117574370A CN202311599310.5A CN202311599310A CN117574370A CN 117574370 A CN117574370 A CN 117574370A CN 202311599310 A CN202311599310 A CN 202311599310A CN 117574370 A CN117574370 A CN 117574370A
- Authority
- CN
- China
- Prior art keywords
- feature
- component
- graph
- sensitive
- manifest file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 28
- 238000013528 artificial neural network Methods 0.000 claims abstract description 34
- 238000000034 method Methods 0.000 claims abstract description 29
- 239000011159 matrix material Substances 0.000 claims abstract description 28
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 24
- 230000004927 fusion Effects 0.000 claims abstract description 12
- 230000007246 mechanism Effects 0.000 claims abstract description 11
- 238000013136 deep learning model Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 42
- 230000000694 effects Effects 0.000 claims description 35
- 239000000284 extract Substances 0.000 claims description 21
- 238000012549 training Methods 0.000 claims description 21
- 125000004122 cyclic group Chemical group 0.000 claims description 17
- 239000013598 vector Substances 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 16
- 239000008186 active pharmaceutical agent Substances 0.000 claims description 15
- 230000000306 recurrent effect Effects 0.000 claims description 8
- 239000000203 mixture Substances 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 230000035945 sensitivity Effects 0.000 claims description 4
- 230000006399 behavior Effects 0.000 description 15
- 238000000605 extraction Methods 0.000 description 6
- 238000003062 neural network model Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000013508 migration Methods 0.000 description 3
- 230000005012 migration Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008033 biological extinction Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/563—Static detection by source code analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/53—Decompilation; Disassembly
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Security & Cryptography (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computer Hardware Design (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Virology (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention provides a malicious code detection system, comprising: the decompilation module is used for obtaining a manifest file and a source code according to the application program to be detected; the calling sub-graph analysis module is used for extracting a calling sub-graph feature set according to the source code; the component analysis module is used for extracting a component feature set according to the manifest file; the feature fusion module is used for fusing the calling sub-graph feature set and the component feature set into a feature matrix suitable for the deep learning model; and the malicious software detection module is used for detecting through a bidirectional independent circulating neural network based on a class-sensitive mechanism according to the feature matrix to obtain a malicious software detection result. According to the method, on the basis of analyzing program operation logic and extracting program semantic information based on calling sub-graph features, related information of malicious software components is fused and analyzed, the detection precision of the malicious application of the Internet of things is achieved, and the safety of the Internet of things system is improved.
Description
Technical Field
The invention relates to the field of software code detection and analysis, in particular to a malicious code detection system.
Background
Along with the rapid development of the internet of things technology, mobile internet of things equipment such as smart phones and tablet computers are widely applied to the internet of things. Its widespread implementation and acceptance have greatly increased the convenience and comfort of people. In addition, the proliferation of mobile internet of things devices has enabled more and more internet of things services to be accessed remotely through the internet of things network. The rapid popularization of internet of things mobile devices also brings great security challenges, especially in terms of malicious internet of things application software attacks. While the Internet of things application software provides convenient services for users, lawbreakers utilize vulnerabilities of the Internet of things application software or spread malicious Internet of things application software due to benefit driving so as to achieve the purposes of stealing personal information of the users, stealing money of the users and the like.
The internet of things era, mobile platforms have involved processing large amounts of sensitive data, including personal email and communications, voice telephony, text messaging, corporate and financial data. A mobile internet of things device typically has GPS functionality, meaning that it can know the physical location of the holder at any time. Mobile internet of things devices also contain built-in cameras, microphones, accelerometers, magnetometers, etc., and thus can be hijacked by malware and used to eavesdrop on their environment. Thus, malware developers have also targeted these internet of things intelligent mobile devices.
Traditional malware detection methods rely on signature-based detection, i.e., matching known malware signatures with a signature library. However, the limitation of the method is that the method can only detect known malicious software, is seriously dependent on the size and the perfection degree of a feature library, has the problems of high false alarm rate and the like, and cannot cope with the latest malicious software variety threat. Behavior-based malware detection systems focus on detecting suspicious behavior, rather than specific features, including monitoring the behavior of software and finding patterns indicative of malicious activity. However, such a scheme consumes a lot of computing resources and time, and cannot meet the real-time requirement of the malware detection task in practical application. According to the method, the program running logic is analyzed and the program semantic information is extracted based on the program calling interface API characteristics, and meanwhile, the extraction component analysis characteristics capture more malicious behaviors, so that the detection precision of malicious applications of the Internet of things is effectively improved, and the safety of the Internet of things system is further improved. Furthermore, consider that in the real world, malware is only a small fraction of all software compared to benign software. Thus, this poses a serious class imbalance problem that can mislead the malware detection model to misclassify during classification. In order to solve the problem, the invention provides a class-sensitive bidirectional independent cyclic neural network model, and the model creatively provides a cross entropy function based on a class-sensitive mechanism to process the data migration problem caused by unbalanced data sets. The proposed strategy helps to improve the performance of the method and alleviate the problems of excessive and insufficient fitting, and can effectively detect malicious software.
Disclosure of Invention
In order to solve the defects in the prior art, the invention provides a novel malicious software detection technology based on call subgraph and component communication analysis in the application scene of the Internet of things.
According to a first aspect of the present invention, there is provided a malicious code detection system comprising:
and the decompilation module is used for obtaining the manifest file and the source code according to the application program to be detected.
And the calling sub-graph analysis module is used for extracting and calling the sub-graph feature set according to the source code.
The calling sub-graph analysis module comprises a program calling graph generation module and a sensitive API feature generation module.
The program call graph generating module constructs a program call graph FCG according to input source codes, extracts all sensitive API nodes by using a Pscout tool, wherein the FCG is expressed as G= { V, E }, and is composed of a group of nodesAnd a set of edgesComposition; each node in V represents a function in the application program, including sensitive APIs, the sensitive API set is +.>Each edge in E represents a calling relationship between a calling party and a called party.
The sensitive API feature generation module analyzes and extracts a sensitive function call sub-graph feature set according to the FCG, and comprises the following steps: allocating different malicious degree values for each sensitive API in the sensitive API set by using a TF-IDF method to obtain the sensitive API malicious degree setWherein the i-th sensitive API->The malicious degree of (a):
wherein Q is total Representing the number of all benign software in the training set, R total Representing the amount of all malware in the training set,representing training set call->Is>Representing training set call->Is a malware amount; FCG corresponds to a sensitive function call sub-graph set +.>Wherein a sensitive function call subgraph +.>By a sensitive API and its adjacent nodes u 1 ,…,u k Composition by distance function->Control sensitive function call subgraph->Wherein k is a custom control parameter for defining a sensitivity function call sub-graph comprising +.>Is the number of neighbor nodes; computing sensitive function call subgraph feature sets
And the component analysis module is used for extracting the component feature set according to the manifest file.
The component analysis module extracts a plurality of categories of component features from the manifest file, including: an activity component feature, a service component feature, a content provider component feature, a broadcast receiver component feature.
Extracting activity component features includes: querying activity tag information activity from a manifest file, and taking all names containing activity tag information in the manifest file as activity component characteristics T= { T 1 ,…,t nT Where t represents some activity, which is responsible for presenting a user interface in an application.
Extracting service component features includes: querying service tag information service from a manifest file, and taking all names containing the service tag information in the manifest file as service component characteristics R= { R 1 ,…,r nR Where r denotes a certain service, which is responsible for background processing in the application.
Extracting content provider component features includes: querying Content Provider label information Content Provider from the manifest file, and taking all names containing Content Provider label information in the manifest file as Content Provider component characteristics P= { P 1 ,…,p nP Where p denotes a certain content provider, which is responsible for sharing structured data in the application.
Extracting broadcast receiver component features includes: querying the broadcast receiver tag information activity from the manifest file, and taking all names containing the broadcast receiver tag information in the manifest file as the broadcast receiver component feature C= { C 1 ,…,c nC Where c denotes a certain broadcast receiver, which is responsible for providing the capability to receive information in the application.
Component feature sets com= { T, C, P, R } are constructed from the plurality of component features.
And the feature fusion module is used for fusing the calling sub-graph feature set and the component feature set into a feature matrix suitable for the deep learning model.
And the malicious software detection module is used for detecting through a bidirectional independent circulating neural network based on a class-sensitive mechanism according to the feature matrix to obtain a malicious software detection result.
The bidirectional independent circulating neural network based on the class sensitive mechanism in the malicious software detection module comprises: a bidirectional independent cyclic neural network layer, a full-connection network layer and a Sigmoid function output layer; the bidirectional independent cyclic neural network layer simultaneously extracts information from the input feature matrix from the previous direction and the future direction and then splices the information; the fully connected network layer processes the output from the bidirectional independent circulating neural network; the Sigmoid function output layer processes the output of the fully connected network layer through a Sigmoid function.
Further, in the system provided by the invention, the decompilation module inputs the APK file of the application program to be detected, and the decompilation tool android is used for extracting the manifest file and the DEX file and analyzing the source code according to the DEX file.
Further, in the system provided by the invention, the feature fusion module combines the calling sub-graph feature set and the component feature set, and the Word2vec tool is used for extracting semantic feature vectors of all elements in the combined set by using a Word embedding method to form a feature matrix FV= { x 1 ,…,x t ,…,x nF },x t Is a semantic feature vector.
Further, the system provided by the invention, the bidirectional independent circulating neural network layer further comprises:
extracting features from previous to future directionsWherein, the semantic feature vector x t Is the current input of the network, < >>Representing the previously hidden state of the network,/->Is in the direction from the front to the futureWeight matrix (W/W)>Is the cyclic unit weight of the previous to future direction,/-, and>is the offset from the previous to future direction, +.>Representing the matrix operation of the hadamard product.
Extracting features from never before coming directionIndicating the future hidden state of the network,/->Is a weight matrix never coming to the previous direction, < ->Is the cyclic unit weight from never coming to the previous direction,/-, for example>Is the offset from the previous direction.
Splicing the extracted features in two directions to obtain the output of the bidirectional independent circulating neural network layer Wherein W is f And W is b Two different weight matrices.
Full connectivity network layer output h FCN =W FCN h Blnd +b FCN ,W FCN Is the weight matrix of the neurons of the full connection layer, b FCN Is the offset of the full link layer neurons.
The Sigmoid function output layer outputs a classification result y=sigmoid (h FCN ) As a result of malware detection, one class represents malware and the other class represents benign software.
Further, the system provided by the invention is formed by multiple iterative training based on the bidirectional independent loop neural network of the class-sensitive mechanism, and the iterative training process is based on a loss function:
wherein y is k The actual output of the kth class of the fully connected network; t is t k Is the k-th expected output of the fully connected network; c represents a cost term; sam epsilon dataset represents each sample in the computational training set; p is the true label of the sample and k is the predicted label of the sample.
According to a second aspect of the present invention, there is provided a computer device characterized by comprising:
a memory for storing instructions; and
and a processor for invoking the instructions stored in the memory to implement the system of the first aspect.
Compared with the prior art, the technical scheme of the invention has at least the following beneficial effects:
1. invoking the subgraph feature (CSG) in malware detection, an application programming interface may be used to simulate a malicious pattern and detect malware instances, e.g., malware in the real world often leverages existing APIs in the android system to simulate normal behavior of benign applications to evade security checks of the malicious behavior detection system. Therefore, the invention constructs a calling sub-graph feature extraction module which extracts the sensitive application programming interface API frequently utilized by malicious software to construct the calling sub-graph feature, and can effectively reflect the relationship between the caller and the callee in the application program.
2. Component feature (COM): the invention constructs a feature extraction module based on the component features, and systematically analyzes the component calling modes presented by benign application programs and malicious application programs. Such a set of features represents one of the most common attack vectors used by malware.
3. The feature fusion method is improved: on one hand, invoking the sub-graph feature (CSG) can effectively reflect the malicious behavior pattern of the malicious software of the Internet of things, and provides great help for detecting the malicious software instance, but ignores the related information of interaction among different components. On the other hand, the component feature (COM) fully covers interactions between different components in an application and between different applications, such as broadcast receivers and content providers, providing great flexibility and versatility. In view of the complementary representation capabilities of the two types of features, the invention provides a feature fusion scheme which combines the advantages of the two types of features to provide a more comprehensive malicious behavior pattern representation.
4. Internet of things malicious software detection classifier based on bidirectional independent cyclic neural network model: the invention provides a new class-sensitive cross entropy function, which effectively solves the problem of data migration caused by unbalanced data sets by introducing cost items into an independent cyclic neural network model to a back propagation learning process to consider the importance of identification among classes.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic diagram of an overall frame design, shown according to an exemplary embodiment.
FIG. 2 is a schematic diagram of a class-sensitive based two-way independent-loop neural network, according to an example embodiment.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The invention provides a novel malicious software detection technology based on calling subgraph and component communication analysis under the application scene of the Internet of things, and provides a malicious code detection system. The frame comprises five main modules:
and the decompilation module is used for obtaining the manifest file and the source code according to the application program to be detected.
And the calling sub-graph analysis module is used for extracting and calling the sub-graph feature set according to the source code.
And the component analysis module is used for extracting the component feature set according to the manifest file.
And the feature fusion module is used for fusing the calling sub-graph feature set and the component feature set into a feature matrix suitable for the deep learning model.
And the malicious software detection module is used for detecting through a bidirectional independent circulating neural network based on a class-sensitive mechanism according to the feature matrix to obtain a malicious software detection result.
A first part: decompiling module. And the decompilation module inputs the APK file of the application program to be detected, extracts the manifest file and the DEX file by using a decompilation tool android, and analyzes the source code according to the DEX file.
At this stage we decompil APK files to extract resources, manifest files and class. The internet of things application is typically developed using the Java programming language and then compiled into Dalvik code (DEX) that is stored in a class DEX file. The compiled code and necessary resources are packaged into APK files. During decompilation, we can extract Dalvik code from the APK by recreating the source code and class files from each application using an open source tool (android tool). Thereafter, we extract call sub-graph features (CSG) and component features (COM).
A second part: and calling a subgraph analysis module. The calling sub-graph analysis module comprises a program calling graph generation module and a sensitive API feature generation module.
The program call graph generating module constructs a program call graph FCG according to input source codes, extracts all sensitive API nodes by using a Pscout tool, wherein the FCG is expressed as G= { V, E }, and is composed of a group of nodesAnd a set of edgesComposition; each node in V represents a function in the application program, including sensitive APIs, the sensitive API set is +.>Each edge in E represents a calling relationship between a calling party and a called party.
The sensitive API feature generation module analyzes and extracts a sensitive function call sub-graph feature set according to the FCG, and comprises the following steps: allocating different malicious degree values for each sensitive API in the sensitive API set by using a TF-IDF method to obtain the sensitive API malicious degree setWherein the i-th sensitive API->The malicious degree of (a):
wherein Q is total Representing the number of all benign software in the training set, R total Representing the amount of all malware in the training set,representing training set call->Is>Representing training set call->Is a malware amount; FCG corresponds to a sensitive function call sub-graph set +.>Wherein a sensitive function call subgraph +.>By a sensitive API and its adjacent nodes u 1 ,…,u k Composition by distance function->Control sensitive function call subgraph->Wherein k is a custom control parameter for defining a sensitivity function call sub-graph comprising +.>Is the number of neighbor nodes; computing sensitive function call sub-graph feature set>
An Application Programming Interface (API) acts as a high-level programming language interface for encapsulating system calls and reflecting the behavior of code segments in a program. The internet of things application utilizes various APIs to implement various functions. Malicious programs need to communicate with external programs and often utilize sensitive APIs to implement malicious behavior, such as obtaining personal user information, manipulating device functions, or engaging in illegal operations. The most common combination of sensitive API calls in malware reveals the presence of malicious activity, thereby enabling malicious activity such as credential theft and user data sales. Thus, these API calls may suggest potentially malicious intent, while also revealing vulnerabilities of the application.
All sensitive API nodes contained in the APK installation package to be analyzed are extracted using an open source Pscout tool. The process of the Pscout tool to resolve sensitive APIs can be referred to the Kathy Wain Yee Au et al paper: PScout Analyzing the Android Permission Specification (In the Proceedings of the 19th ACM Conference on Computer and Communications Security (CCS 2012). October 2012).
Third section: and a component analysis module. And the component analysis module is used for extracting the component feature set according to the manifest file.
The component analysis module extracts a plurality of categories of component features from the manifest file, including: an activity component feature, a service component feature, a content provider component feature, a broadcast receiver component feature.
Extracting activity component features includes: querying activity tag information activity from a manifest file, and taking all names containing activity tag information in the manifest file as activity component characteristics T= { T 1 ,…,t nT Where t represents some activity, which is responsible for presenting a user interface in an application.
Extracting service component features includes: querying service tag information service from a manifest file, and taking all names containing the service tag information in the manifest file as service component characteristics R= { R 1 ,…,r nR Where r denotes a certain service, which is responsible for background processing in the application.
Extracting content provider component features includes: querying Content Provider label information Content Provider from the manifest file, and taking all names containing Content Provider label information in the manifest file as Content Provider component characteristics P= { P 1 ,…,p nP Where p denotes a certain content provider, which is responsible for sharing structured data in the application.
Extracting broadcast receiver component features includes: querying the broadcast receiver tag information activity from the manifest file, and taking all names containing the broadcast receiver tag information in the manifest file as the broadcast receiver component feature C= { C 1 ,…,c nC Where c denotes a certain broadcast receiver, which is responsible for providing the capability to receive information in the application.
Component feature sets com= { T, C, P, R } are constructed from the plurality of component features.
At this stage, we extract component features, which are critical to understanding the malware mode of operation, significantly enhancing the ability of the detection system to detect complex malware.
Behavior patterns of benign applications and malware are systematically analyzed because components are the core of an application and cannot be easily modified or deleted. In addition, the component feature extraction scheme is more flexible and universal, is not limited by specific API calls, can identify and capture more malicious behaviors, and enhances the reliability and robustness of the detection system.
The basic building blocks of the internet of things application include activities, services, content providers, and broadcast receivers. The activity is responsible for presenting the user interface. The service is responsible for background processing. The content provider facilitates sharing of structured data. The broadcast receiver provides the ability to receive information. Development of malware typically involves code reuse, resulting in shared malicious behavior and component reuse among the same family of malware.
The basic building blocks of the internet of things application include activities, services, content providers, and broadcast receivers. We extract several classes of features:
an activity (activity) is responsible for presenting a user interface, for example: the Activity may set an "android.intent.category" attribute, which means that this Activity may be opened by the browser using a custom protocol, and an override call may be made to the app by the browser. Here, android.intent.category.BROWSABLE is a component feature t1. Therefore, we regard the names of all activities as service features.
Services (services) are responsible for background processing, such as: the developer needs to declare all services in the application configuration file. The malware may start a service using a startService () method when the service is in a started state. Therefore, we regard the names of all services as service features.
Content providers (Content providers) facilitate sharing of structured data, such as: the android platform provides Content Provider to have a specified dataset for one application provided to other applications. Other applications may obtain or store data from the content provider through the contentdesolver class. Therefore, we regard the names of all Content Provider as service features.
The broadcast receiver (broadcast receiver) provides the ability to receive information, such as: malware may define the broadcast receiver class to be mBroadcastreceiver, which may be used to receive the broadcast "android.net.conn.CONNECTIVITY_CHANGE" issued when the network state CHANGEs. Here, "android.net.conn.connection_change" is a feature c1 of the broadcast receiver. Therefore, we regard all broadcast receiver names as service features.
Fourth part: and a feature fusion module. The feature fusion module combines the calling sub-graph feature set and the component feature set, extracts semantic feature vectors of all elements in the combined set by using a Word2vec tool through a Word embedding method, and forms a feature matrix FV= { x 1 ,…,x t ,…,x nF },x t Is a semantic feature vector.
In the feature fusion stage, we construct feature vectors for each internet of things application by embedding these features into feature space in an attempt to extract potential semantic patterns. After the feature vectors are generated, we use a neural network model to classify malware, modeling the high-level concepts and facts in malware.
Given a feature set F, we use word embedding methods to extract semantic vectors, reflecting potential behavioral patterns of malware. This approach is spatially insensitive, independent of the order of words or local patterns.
In our example, the semantic features of each malware are mapped into a fixed two-dimensional matrix FV. We set the maximum number of semantic features to be M and the embedding size of the vector to be K. Finally, we convert the fusion feature F into semantic features FV and use it as input to a detection model for learning semantic knowledge. These semantic patterns are aggregated at a lower level, helping to better identify patterns of features and facilitating the representation of high-level domain knowledge.
Fifth part: a malware detection module. The bidirectional independent circulating neural network based on the class sensitive mechanism in the malicious software detection module comprises: a bidirectional independent cyclic neural network layer, a full-connection network layer and a Sigmoid function output layer; the bidirectional independent cyclic neural network layer simultaneously extracts information from the input feature matrix from the previous direction and the future direction and then splices the information; the fully connected network layer processes the output from the bidirectional independent circulating neural network; the Sigmoid function output layer processes the output of the fully connected network layer through a Sigmoid function.
At this stage, the fusion features are injected into the feature space, building semantic feature vectors. Then, the semantic feature vector is input into a malicious software detection model based on a class-sensitive independent cyclic neural network, and the model adopts a bidirectional information extraction mode to extract sequence features from the previous aspect and the future aspect.
In this section, we propose a class-sensitive bi-directional independent-loop neural network based malware classifier to describe the behavior of malware and model semantic knowledge. An independent recurrent neural network is a variant of a recurrent neural network. In a conventional recurrent neural network, a weight matrix of different time steps of an input time sequence is shared. Whereas for an independent recurrent neural network, which sets individual weights at each time step of the recurrent element, this strategy enables it to handle longer time sequences, thus reducing the effects of gradient extinction or gradient explosion.
The two-way independent cyclic neural network model performs information extraction on input features from both previous and future aspects.
Extracting features from previous to future directionsWherein, the semantic feature vector x t Is the current input of the network, < >>Representing the previously hidden state of the network,/->Is a weight matrix of previous to future directions, < ->Is the cyclic unit weight of the previous to future direction,/-, and>is the offset from the previous to future direction, +.>Representing the matrix operation of the hadamard product.
Extracting features from never before coming directionIndicating the future hidden state of the network,/->Is a weight matrix never coming to the previous direction, < ->Is the cyclic unit weight from never coming to the previous direction,/-, for example>Is the offset from the previous direction.
Splicing the extracted features in two directions to obtainOutput of two-way independent circulating neural network layer Wherein W is f And W is b Two different weight matrices.
We have introduced a Fully Connected Network (FCN) as an integral part of our architecture. The FCN layer is configured to process output vectors from the bi-directional independent recurrent neural network. The use of FFNs complements a bi-directional independent recurrent neural network, with the model being able to capture local and global information in a coordinated manner.
Full connectivity network layer output h FCN =W FCN h Blnd +b FCN ,W FCN Is the weight matrix of the neurons of the full connection layer, b FCN Is the offset of the full link layer neurons.
The Sigmoid function output layer outputs a classification result y=sigmoid (h FCN ) As a result of malware detection, one class represents malware and the other class represents benign software.
Typically, the output y= {1,0} represents malware and y= {0,1} represents benign software.
The bidirectional independent loop neural network based on the sensitivity-like mechanism is formed by multiple iterative training, and the iterative training process is based on a loss function:
wherein y is k The actual output of the kth class of the fully connected network; t is t k Is the k-th expected output of the fully connected network; c represents a cost term; sam epsilon dataset represents each sample in the computational training set; p is the true label of the sample and k is the predicted label of the sample.
Most of the existing work is focused on improving the detection performance of malicious software, and neglecting the fact that the malicious sample in the real world is small in proportion, the serious unbalanced problem causes that a detection model is easy to be over-fitted, and the generalization capability is reduced.
One existing work utilizes class-sensitive cross entropy functions to solve the problem of multiple classes of imbalance in botnet detection.
The invention refers to the work, and introduces a class-sensitive cross entropy function to process the data migration problem caused by unbalanced data sets in the detection of malicious software. The original independent circulating neural network is adjusted to have cost sensitivity, and the model introduces cost items into a back propagation learning process to consider the importance of identification among classes, so that the model has excellent capability in identifying potential malicious software threats, and promotes comprehensive understanding of malicious software behaviors.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
Claims (6)
1. A malicious code detection system, comprising:
the decompilation module is used for obtaining a manifest file and a source code according to the application program to be detected;
the calling sub-graph analysis module is used for extracting a calling sub-graph feature set according to the source code;
the calling sub-graph analysis module comprises a program calling graph generation module and a sensitive API feature generation module;
the program call graph generating module generates a graph according to the input source generationThe code build program invokes the graph FCG, denoted g= { V, E }, by a set of nodes, and extracts all sensitive API nodes using the Pscout toolAnd a set of edgesComposition; each node in V represents a function in the application program, including sensitive APIs, the sensitive API set is +.>Each edge in E represents a calling relationship between a calling party and a called party;
the sensitive API feature generation module analyzes and extracts a sensitive function call sub-graph feature set according to the FCG, and comprises the following steps: allocating different malicious degree values for each sensitive API in the sensitive API set by using a TF-IDF method to obtain the sensitive API malicious degree setWherein the i-th sensitive API->The malicious degree of (a):
wherein Q is total Representing the number of all benign software in the training set, R total Representing the amount of all malware in the training set,representing training set call->Is>Representing training set call->Is a malware amount; FCG corresponds to a sensitive function call sub-graph set +.>Wherein a sensitive function call subgraph +.>By a sensitive API and its adjacent nodes u 1 ,…,u k Composition by distance function->Control sensitive function call subgraph->Wherein k is a custom control parameter for defining a sensitivity function call sub-graph comprising +.>Is the number of neighbor nodes; computing sensitive function call subgraph feature sets
The component analysis module is used for extracting a component feature set according to the manifest file;
the component analysis module extracts a plurality of categories of component features from the manifest file, including: a campaign component feature, a service component feature, a content provider component feature, a broadcast receiver component feature;
extracting activity component features includes: querying activity tag information activity from a manifest file, and taking all names containing activity tag information in the manifest file as activity component characteristics T= { T 1 ,…,t nT Where t represents an activity that is responsible for presenting the user in the applicationAn interface;
extracting service component features includes: querying service tag information service from a manifest file, and taking all names containing the service tag information in the manifest file as service component characteristics R= { R 1 ,…,r nR R represents a certain service, which is responsible for background processing in the application;
extracting content provider component features includes: querying Content Provider label information Content Provider from the manifest file, and taking all names containing Content Provider label information in the manifest file as Content Provider component characteristics P= { P 1 ,…,p nP Where p represents a content provider, the content provider being responsible for sharing structured data in the application;
extracting broadcast receiver component features includes: querying the broadcast receiver tag information activity from the manifest file, and taking all names containing the broadcast receiver tag information in the manifest file as the broadcast receiver component feature C= { C 1 ,…,c nC Where c denotes a certain broadcast receiver, which is responsible for providing the capability of receiving information in an application;
constructing a component feature set COM= { T, C, P, R } according to the plurality of component features;
the feature fusion module is used for fusing the calling sub-graph feature set and the component feature set into a feature matrix suitable for the deep learning model;
the malicious software detection module is used for detecting through a bidirectional independent circulating neural network based on a class-sensitive mechanism according to the feature matrix to obtain a malicious software detection result;
the bidirectional independent circulating neural network based on the class sensitive mechanism in the malicious software detection module comprises: a bidirectional independent cyclic neural network layer, a full-connection network layer and a Sigmoid function output layer; the bidirectional independent cyclic neural network layer simultaneously extracts information from the input feature matrix from the previous direction and the future direction and then splices the information; the fully connected network layer processes the output from the bidirectional independent circulating neural network; the Sigmoid function output layer processes the output of the fully connected network layer through a Sigmoid function.
2. The system of claim 1, wherein the decompilation module inputs an APK file of the application to be detected, extracts the manifest file and the DEX file using a decompilation tool android and parses the source code from the DEX file.
3. The system of claim 1 wherein the feature fusion module combines the call sub-graph feature set and the component feature set, and extracts semantic feature vectors of all elements in the combined set by Word2vec tool using Word embedding method to form feature matrix fv= { x 1 ,…,x t ,…,x nF },x t Is a semantic feature vector.
4. The system of claim 3, wherein the bi-directional independent recurrent neural network layer further comprises:
extracting features from previous to future directionsWherein, the semantic feature vector x t Is the current input of the network, < >>Representing the previously hidden state of the network,/->Is a weight matrix of previous to future directions, < ->Is the cyclic unit weight of the previous to future direction,/-, and>is the offset from the previous to future direction, +.>Matrix operations representing hadamard products;
extracting features from never before coming directionIndicating the future hidden state of the network,/->Is a weight matrix never coming to the previous direction, < ->Is the cyclic unit weight from never coming to the previous direction,/-, for example>Is the offset from the never coming previous direction;
splicing the extracted features in two directions to obtain the output of the bidirectional independent circulating neural network layer Wherein W is f And W is b Two different weight matrices;
full connectivity network layer output h FCN =W FCN h BInd +b FCN ,W FCN Is the weight matrix of the neurons of the full connection layer, b FCN Is the offset of the full link layer neurons;
the Sigmoid function output layer outputs a classification result y=sigmoid (h FCN ) As a result of malware detection, one class represents malware and the other class represents benign software.
5. The system of claim 4, wherein the sensitivity-like mechanism based bi-directional independent loop neural network is formed from a plurality of iterative training processes, the iterative training process being based on a loss function:
wherein y is k The actual output of the kth class of the fully connected network; t is t k Is the k-th expected output of the fully connected network; c represents a cost term; sam epsilon dataset represents each sample in the computational training set; p is the true label of the sample and k is the predicted label of the sample.
6. A computer device, comprising:
a memory for storing instructions;
a processor for invoking execution of instructions stored in the memory to implement the system of any of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311599310.5A CN117574370B (en) | 2023-11-28 | 2023-11-28 | Malicious code detection system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311599310.5A CN117574370B (en) | 2023-11-28 | 2023-11-28 | Malicious code detection system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117574370A true CN117574370A (en) | 2024-02-20 |
CN117574370B CN117574370B (en) | 2024-05-31 |
Family
ID=89893471
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311599310.5A Active CN117574370B (en) | 2023-11-28 | 2023-11-28 | Malicious code detection system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117574370B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107180192A (en) * | 2017-05-09 | 2017-09-19 | 北京理工大学 | Android malicious application detection method and system based on multi-feature fusion |
US20200137083A1 (en) * | 2018-10-24 | 2020-04-30 | Nec Laboratories America, Inc. | Unknown malicious program behavior detection using a graph neural network |
US20200364334A1 (en) * | 2019-05-16 | 2020-11-19 | Cisco Technology, Inc. | Detection of malicious executable files using hierarchical models |
CN113935033A (en) * | 2021-09-13 | 2022-01-14 | 北京邮电大学 | Feature-fused malicious code family classification method and device and storage medium |
CN114817924A (en) * | 2022-05-19 | 2022-07-29 | 电子科技大学 | AST (AST) and cross-layer analysis based android malicious software detection method and system |
CN116305125A (en) * | 2023-03-09 | 2023-06-23 | 中国工商银行股份有限公司 | Malicious code detection method and device, electronic equipment and storage medium |
CN116611063A (en) * | 2023-05-15 | 2023-08-18 | 西北工业大学 | Graph convolution neural network malicious software detection method based on multi-feature fusion |
CN116702143A (en) * | 2023-06-15 | 2023-09-05 | 北京泛网互联科技有限责任公司 | Intelligent malicious software detection method based on API (application program interface) characteristics |
CN117113163A (en) * | 2023-06-15 | 2023-11-24 | 中国人民解放军空军工程大学 | Malicious code classification method based on bidirectional time domain convolution network and feature fusion |
-
2023
- 2023-11-28 CN CN202311599310.5A patent/CN117574370B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107180192A (en) * | 2017-05-09 | 2017-09-19 | 北京理工大学 | Android malicious application detection method and system based on multi-feature fusion |
US20200137083A1 (en) * | 2018-10-24 | 2020-04-30 | Nec Laboratories America, Inc. | Unknown malicious program behavior detection using a graph neural network |
US20200364334A1 (en) * | 2019-05-16 | 2020-11-19 | Cisco Technology, Inc. | Detection of malicious executable files using hierarchical models |
CN113935033A (en) * | 2021-09-13 | 2022-01-14 | 北京邮电大学 | Feature-fused malicious code family classification method and device and storage medium |
CN114817924A (en) * | 2022-05-19 | 2022-07-29 | 电子科技大学 | AST (AST) and cross-layer analysis based android malicious software detection method and system |
CN116305125A (en) * | 2023-03-09 | 2023-06-23 | 中国工商银行股份有限公司 | Malicious code detection method and device, electronic equipment and storage medium |
CN116611063A (en) * | 2023-05-15 | 2023-08-18 | 西北工业大学 | Graph convolution neural network malicious software detection method based on multi-feature fusion |
CN116702143A (en) * | 2023-06-15 | 2023-09-05 | 北京泛网互联科技有限责任公司 | Intelligent malicious software detection method based on API (application program interface) characteristics |
CN117113163A (en) * | 2023-06-15 | 2023-11-24 | 中国人民解放军空军工程大学 | Malicious code classification method based on bidirectional time domain convolution network and feature fusion |
Non-Patent Citations (1)
Title |
---|
杜建斌 等: "基于图神经网络的恶意软件分类方法", 互联网周刊, 15 September 2023 (2023-09-15), pages 93 - 95 * |
Also Published As
Publication number | Publication date |
---|---|
CN117574370B (en) | 2024-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xiao et al. | Malware detection based on deep learning of behavior graphs | |
Li et al. | Deeppayload: Black-box backdoor attack on deep learning models through neural payload injection | |
CN109753800B (en) | Android malicious application detection method and system fusing frequent item set and random forest algorithm | |
CN111433775B (en) | Security enhancement method and electronic device thereof | |
Wu et al. | A survey of android malware static detection technology based on machine learning | |
CN105426760A (en) | Detection method and apparatus for malicious android application | |
CN109614795B (en) | Event-aware android malicious software detection method | |
CN107194251A (en) | Android platform malicious application detection method and device | |
WO2019237362A1 (en) | Privacy-preserving content classification | |
EP3028203A1 (en) | Signal tokens indicative of malware | |
Wang et al. | LSCDroid: Malware detection based on local sensitive API invocation sequences | |
CN107256357A (en) | The detection of Android malicious application based on deep learning and analysis method | |
CN106228067A (en) | Malicious code dynamic testing method and device | |
CN106250761B (en) | Equipment, device and method for identifying web automation tool | |
CN112149124A (en) | Android malicious program detection method and system based on heterogeneous information network | |
Bhatia et al. | Tipped Off by Your Memory Allocator: Device-Wide User Activity Sequencing from Android Memory Images. | |
Hou et al. | Disentangled representation learning in heterogeneous information network for large-scale android malware detection in the COVID-19 era and beyond | |
Kandukuru et al. | Android malicious application detection using permission vector and network traffic analysis | |
Abubaker et al. | Exploring permissions in android applications using ensemble-based extra tree feature selection | |
Lubuva et al. | A review of static malware detection for Android apps permission based on deep learning | |
CN117574370B (en) | Malicious code detection system | |
CN103093147A (en) | Method and electronic device for identifying information | |
Congyi et al. | Method for detecting Android malware based on ensemble learning | |
Ren et al. | DEMISTIFY: Identifying On-device Machine Learning Models Stealing and Reuse Vulnerabilities in Mobile Apps | |
Joraviya et al. | DL-HIDS: deep learning-based host intrusion detection system using system calls-to-image for containerized cloud environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |