CN116361797A - Malicious code detection method and system based on multi-source collaboration and behavior analysis - Google Patents
Malicious code detection method and system based on multi-source collaboration and behavior analysis Download PDFInfo
- Publication number
- CN116361797A CN116361797A CN202310331389.7A CN202310331389A CN116361797A CN 116361797 A CN116361797 A CN 116361797A CN 202310331389 A CN202310331389 A CN 202310331389A CN 116361797 A CN116361797 A CN 116361797A
- Authority
- CN
- China
- Prior art keywords
- malicious
- samples
- malicious code
- code detection
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 56
- 238000004458 analytical method Methods 0.000 title claims abstract description 30
- 230000006399 behavior Effects 0.000 claims abstract description 43
- 238000000034 method Methods 0.000 claims abstract description 33
- 244000035744 Hura crepitans Species 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 26
- 238000003062 neural network model Methods 0.000 claims abstract description 21
- 238000007781 pre-processing Methods 0.000 claims abstract description 19
- 230000003068 static effect Effects 0.000 claims abstract description 9
- 238000002372 labelling Methods 0.000 claims abstract description 7
- 238000012360 testing method Methods 0.000 claims description 17
- 201000011243 gastrointestinal stromal tumor Diseases 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 15
- 239000013598 vector Substances 0.000 claims description 15
- 238000010586 diagram Methods 0.000 claims description 10
- 238000011156 evaluation Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 238000013136 deep learning model Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000004927 fusion Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000003542 behavioural effect Effects 0.000 claims 1
- 230000008569 process Effects 0.000 description 18
- 238000013527 convolutional neural network Methods 0.000 description 5
- 241000208422 Rhododendron Species 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000013507 mapping Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000026676 system process Effects 0.000 description 2
- 241000700605 Viruses Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004659 sterilization and disinfection Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/566—Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/145—Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/40—Network security protocols
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
- G06F2221/033—Test or assess software
Abstract
The invention discloses a malicious code detection method and system based on multi-source collaboration and behavior analysis, and relates to the technical field of malicious software detection. The method comprises the following steps: respectively acquiring and labeling a static executable benign sample and a static executable malicious sample; placing the marked benign samples and malicious samples into a sandbox for operation; converting the operated sample data into three gray images through preprocessing operation; respectively extracting the features of the three gray images, and fusing and converting the extracted features into color images; training the neural network model by utilizing the color image to generate a malicious code detection model, and detecting and classifying the malicious code by adopting the malicious code detection model. According to the method and the device for detecting the memory data, the detection can be realized under the condition that the operation of the target system is not stopped, the authenticity of the data can be ensured, the situation that the data confusion and the data encryption cannot be identified is avoided, and therefore the classification and the accurate detection of the malicious codes are realized.
Description
Technical Field
The invention relates to the technical field of malicious software detection, in particular to a malicious code detection method and system based on multi-source collaboration and behavior analysis.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Aiming at the network age, protecting the safety of a computer and ensuring the smooth progress of daily work is an indispensable work. In the traditional malicious code detection method based on the static file, the encrypted and shelled malicious program is not real malicious data, so that the data cannot be effectively identified as the malicious program. Dynamic analysis can solve the encryption encrusting problem to a certain extent, but is easily deceived by the kernel rootkit. Statistically, most of the newly appeared malicious codes are core segments of the existing malicious codes, and the core segments are modified to generate more threatening malicious codes. From the aspect of family characteristics, by analyzing the behavior characteristics of the malicious code race, a research method is provided for behavior detection of the increasing number of malicious codes.
However, in the face of various and huge number of malicious code varieties in the attack technology in the network, classification and identification of malicious code groups become difficult, and the existing malicious code detection technology needs to stop running a target system in the malicious code identification process, so that the detection efficiency is low, and the situations that data confusion and encrypted data cannot be identified easily occur, so that how to accurately and efficiently identify and classify the malicious code groups becomes an important defense line for protecting a computer from malicious intrusion.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention aims to provide a malicious code detection method and a malicious code detection system based on multi-source collaboration and behavior analysis, which can detect memory data without stopping running a target system, ensure the authenticity of the data due to the data in the memory, and avoid the situation that the data confusion and data encryption cannot be identified, thereby realizing classification and accurate detection of the malicious code.
In order to achieve the above object, the present invention is realized by the following technical scheme:
the first aspect of the invention provides a malicious code detection method based on multi-source collaboration and behavior analysis, which comprises the following steps:
respectively acquiring a static executable benign sample and a malicious sample; labeling benign samples and malicious samples according to types;
placing the marked benign samples and malicious samples into a sandbox for operation to obtain operated sample data and API sequences of the benign samples and the malicious samples;
converting the operated sample data into three gray images through preprocessing operation;
respectively extracting the features of the three gray images, fusing the extracted features, and converting the fused images into color images;
building a neural network model, training the neural network model by utilizing a color image, selecting the model weight with the highest evaluation index to generate a malicious code detection model, and detecting and classifying the malicious code by adopting the malicious code detection model.
Further, the specific steps of placing the marked benign samples and malicious samples into a sandbox for operation are as follows:
creating an isolated sandboxed environment;
the method comprises the steps of running executable files of samples in batches by utilizing a sandbox, and simultaneously dumping to obtain running sample data;
the API sequence is extracted by sandboxed analysis of the report.
Further, the running sample data is converted into three gray images of a Markov image, a entropy diagram and a GIST characteristic image through preprocessing operation.
Further, the specific steps of converting the running sample data into the Markov image through the preprocessing operation are as follows:
converting the sample data after operation into decimal data;
constructing a byte frequency table in a matrix form by using decimal data;
converting the byte frequency table into a byte probability table according to the matrix coordinate frequency;
a markov image is constructed from the byte frequency table and the byte probability table.
Further, the specific steps of converting the sample data after operation into a entropy diagram through a preprocessing operation are as follows:
decompiling the operated sample data into an operation behavior instruction;
dictionary coding is carried out on the operation behavior instruction;
calculating a shannon entropy value according to the encoded operation behavior instruction;
and constructing a line graph according to the shannon entropy value, and converting the line graph into a gray image through normalization.
Furthermore, the method for decompiling the running sample data into the operation behavior instruction comprises the following specific steps: the decompilation tool decompiles the binary file of the memory dump, filters other data, and only retains the assembly instructions.
Further, the specific steps of converting the running sample data into GIST feature images through preprocessing operation are as follows:
normalizing API sequences of the benign samples and the malicious samples to obtain normalized feature vectors;
and converting the normalized feature vector into a gray level image, and then carrying out feature extraction of the GIST to obtain the GIST feature image.
Further, the specific steps of normalizing the API sequences of the benign samples and the malicious samples are as follows:
scanning the whole API sequences of the benign samples and the malicious samples, constructing dictionary coding formats of the API sequences of the benign samples and the malicious samples, and converting the API sequences of the benign samples and the malicious samples into numerical vectors; the transformed sequences were then normalized.
Further, a neural network model is built, the neural network model is trained by utilizing a color image, and the specific steps of selecting the model weight with the highest evaluation index to generate a malicious code detection model are as follows:
dividing the color image into a training set and a testing set;
inputting the training set into a CSNN deep learning model, sharing weights through a twin neural network, and finally carrying out output classification through a softmax function;
and continuously training the model, storing the best training model, and testing by using the test set until the accuracy of the test set reaches a threshold value to obtain the malicious code detection model.
A second aspect of the present invention provides a malicious code detection system based on multi-source collaboration and behavior analysis, comprising:
a data acquisition module configured to acquire statically executable benign samples and malicious samples, respectively; labeling benign samples and malicious samples according to types;
the sandbox operation module is configured to put the marked benign samples and malicious samples into the sandbox for operation, so as to obtain the operated sample data and API sequences of the benign samples and the malicious samples;
the image conversion module is configured to convert the operated sample data into three gray images through preprocessing operation;
the feature fusion module is configured to extract features of the three gray images respectively, fuse the extracted features and convert the fused images into color images;
the model training module is configured to build a neural network model, train the neural network model by utilizing the color image, select the model weight with the highest evaluation index to generate a malicious code detection model, and detect and classify the malicious code by adopting the malicious code detection model.
The one or more of the above technical solutions have the following beneficial effects:
the invention discloses a malicious code detection method based on multi-source collaboration and behavior analysis. And decompiling the process data into a behavior instruction, filtering out data information, only retaining instruction information, and reconstructing the gray image. The three gray images are overlapped to form a color image, the data is preprocessed and trained through the MEG-CSNN model, finally a training result is output, and the weight parameters of the training model are saved for detecting and classifying malicious codes, so that efficient and accurate detection and classification of the malicious codes are realized. The invention can detect the memory data without stopping the operation of the target system, and can ensure the authenticity of the data due to the data in the memory, and the condition that the data confusion and the data encryption cannot be identified can not occur, thereby realizing the classification and the accurate detection of the malicious codes.
The invention discloses a malicious code detection system based on multi-source collaboration and behavior analysis, which is characterized in that collected data are executed in a virtual sandbox by collecting malicious data sets of multiple platforms and commonly used benign software, a running snapshot, namely a mirror image of a running system is obtained, and process data and behavior data of the running system are extracted by a technical means of memory evidence obtaining: the method comprises the calling actions of API interfaces such as starting and exiting processes, loading library files, calling system functions, running threads, registering and starting services, file operation, registry operation, network connection and the like. Because different behavior modes are realized under multiple platforms, semantic mapping between a behavior sequence and behavior functions is constructed, memory behavior data of heterogeneous platforms are converted into functional semantic mapping, and a cross-platform semantic mapping model is formed. In addition, the computer under the cloud platform is not only a physical host, but also comprises a plurality of virtual hosts, and the data extracted by the method can be analyzed by the plurality of virtual hosts, so that the escape behavior of the virtual machines can be effectively detected.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a flowchart of a malicious code detection method based on multi-source collaboration and behavior analysis in a first embodiment of the present invention;
fig. 2 is a CSNN structure diagram of a CNN-based twin neural network in accordance with an embodiment of the present invention.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It should be noted that, in the embodiments of the present invention, related data such as a statically executable benign sample and a malicious sample is related, when the embodiments of the present invention are applied to specific products or technologies, user permission or consent is required to be obtained, and the collection, use and processing of related data are required to comply with related laws and regulations and standards of related countries and regions.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof;
embodiment one:
the first embodiment of the invention provides a malicious code detection method based on multi-source collaboration and behavior analysis, which is shown in fig. 1 and comprises the following steps:
And 2, placing the marked benign samples and malicious samples into a sandbox for operation, and obtaining the operated sample data and API sequences of the benign samples and the malicious samples.
Step 2.1, creating an isolated sandboxed environment.
Installing a plurality of virtual systems in the sandbox environment, placing the marked benign samples and malicious samples into the virtual systems, and then saving a snapshot of the virtual systems.
And 2.2, running the executable files of the samples in batches by utilizing a sandbox, and simultaneously dumping to obtain the running sample data. Because the existing plug-in proccdump.exe cannot acquire the process memory data of the sample set in batches, the method and the device are embedded into the written codes to acquire the process memory data in batches.
The method for obtaining the progress in batches comprises the following steps:
step 2.2.1, saving the executable file name of the sample in a list.
Step 2.2.2, creating multithreading to run cmd instructions through os.system to run the collected data samples in bulk, wherein the reason for creating multithreading is that when a sample run is initiated, the program of the present invention will enter a suspended state until the initiated sample is shut down.
And 2.2.3, acquiring a process list of the system by calling the psuil library.
And 2.2.4, comparing the process list acquired by the system with the stored file name list to acquire a pid list of the system process list consistent with the file name.
Step 2.2.5, traversing the process list by using a proccdump.exe through the fetched pid.
Because the executable file does not call all data into the memory for execution during running, the invention selects ten threads to be started each time by dumping once every 12 minutes for three times during batch dumping, considering the memory space of the server. And then high-efficiency batch memory process data dump is realized, and the running sample data are obtained.
Step 2.3, respectively extracting API sequences of the benign sample and the malicious sample through a sandbox analysis report. In the embodiment, an azalea sandbox is adopted to submit an executable file to obtain an analysis report, and an API behavior sequence called by the analysis report is extracted.
And step 3, converting the operated sample data into three gray images through preprocessing operation. The running dataset is converted into three gray images of a Markov image, a entropy diagram and a GIST characteristic image through data preprocessing, and then is converted into a RGB three-channel color image, and the color image is easier to realize deep learning and migration learning, which is also one of the reasons for converting the invention into the RGB three-channel color image.
And 3.1, converting the operated sample data into Markov images through preprocessing operation.
And step 3.1.1, converting the sample data after operation into decimal data.
Since the process data in the data set of the present invention is binary data, i.e., byte stream data. The invention takes 8 binary bits as one byte, and the range of binary decimal values of 8 bits is 0x00-0xff. Thus, a 256×256 matrix is constructed, and coordinate points within the matrix are initialized to 0.
Step 3.1.2, constructing a byte frequency table in a matrix form by using decimal data.
The present invention sets the sliding window to 2, that is, two bytes represent the coordinate point of the matrix, for example, the sliding window 0x00,0x01 represents the first row, and the second column represents the coordinate point once, and the coordinate point is increased by 1. The sliding window continues until a single process data sample is traversed.
And 3.1.3, converting the byte frequency table into a byte probability table according to the matrix coordinate frequency.
The invention converts a byte frequency table into a byte probability table, wherein the frequency of each matrix coordinate is as follows:
wherein BPT ij BFT representing probability distribution of ith row and jth column ij Represents the ith row and the jth columnIs a frequency distribution of (a). Si represents the sum of frequencies of the i-th row. The byte frequency table is traversed to construct a byte probability table.
And 3.1.4, constructing the Markov image according to the byte frequency table and the byte probability table.
After the byte frequency table and the byte probability table are calculated, the invention further constructs a Markov image, thereby constructing a first gray image, and the formula for constructing the Markov image is as follows:
MI ij the image value representing the ith row and jth column builds a markov image by traversing the byte frequency table and byte probabilities.
And 3.2, converting the operated sample data into a entropy diagram through a preprocessing operation.
And 3.2.1, decompiling the running sample data into an operation behavior instruction. Specifically, the decompiling of the running sample data into the operation behavior instruction comprises the following specific steps: the decompilation tool decompiles the binary file of the memory dump, filters other data, and only retains the assembly instructions.
And 3.2.2, performing dictionary coding on the operation behavior instruction.
All assembler instructions in the dataset are scanned, all assembler instructions are encoded, dictionary encoding is performed on all assembler instructions, as there are 111 instructions in total, such as (MOV: 0; ADD:1, etc.), and then the instruction set is scanned, dictionary encoding is performed on it.
And 3.2.3, calculating the shannon entropy value according to the encoded operation behavior instruction.
The present embodiment divides 16 instructions into a group and then calculates shannon entropy values, wherein the abscissa represents what group and the ordinate represents entropy values. In combination with the instruction sequence of the present embodiment, a specific calculation formula of the entropy value is:
where Bi represents the number of groups of instructions divided, bki represents the kth instruction of the ith group, H (Bi) represents the entropy value, and P (bki) represents the probability of the kth instruction of the ith group occurring in the kth group.
And 3.2.4, constructing a line graph according to the shannon entropy value, and converting the line graph into a gray image through normalization.
In this embodiment, 16 instructions are divided into a group, then a shannon entropy value is calculated, a line graph is constructed according to the shannon entropy value, the obtained line graph is converted into a gray image, and then a third party library of python is utilized to convert the non-uniform-size line graph into a 256×256 gray image.
And 3.3, converting the operated sample data into a GIST characteristic image through preprocessing operation.
And 3.3.1, normalizing the API sequences of the benign samples and the malicious samples to obtain normalized feature vectors.
Scanning the whole API sequences of the benign samples and the malicious samples, constructing dictionary coding formats of the API sequences of the benign samples and the malicious samples, and converting the API sequences of the benign samples and the malicious samples into numerical vectors; and then normalizing the converted sequence to 0-255, wherein the normalized formula is as follows:
where S (x) represents the normalized value, x represents the dictionary encoded value, API size Representing the total number of different APIs.
And 3.3.2, converting the normalized feature vector into a gray level image with a fixed width of 256, then carrying out feature extraction of the GIST, and scaling the GIST into a gray level image GIST feature image with 256×256 by an image scaling technology. Gist is a global feature description manner, and can well capture overall features of a graph.
And 4, respectively extracting the features of the three gray images, fusing the extracted features, and converting the fused images into color images.
Because of the step 3.1, the step 3.2, the image size of the step 3.3 is 256 x 256, the invention superimposes the images, extracts single channel data of each gray level image, and then uses three gray level images as three channels of a color image through a merge function of openCV in python, thereby generating the color image. The three-channel color image is used for training and testing a neural network model.
And 5, building a neural network model, training the neural network model by utilizing a color image, selecting the model weight with the highest evaluation index to generate a malicious code detection model, and detecting and classifying the malicious code by adopting the malicious code detection model. The embodiment builds a CNN-based twin neural network model, which comprises a convolution layer, a pooling layer, a RELU nonlinear activation function and a full connection layer as shown in fig. 2, and finally realizes classification prediction through a softmax function.
And 5.1, building a neural network model.
And 5.1.1, setting specific parameters.
Since the color image size of this embodiment is fixed to 256×256, the parameters of the CSNN model of the present invention also select a fixed value, where two convolution kernels of the first layer of CNN are 4, the convolution kernel of the second layer is 5, and the parameters of the pooling layer are all 8.
And 5.1.2, verifying the classification index.
For detection of malicious code, the present embodiment uses four evaluation indexes of classification: accuracy, precision, composite score F-measure, and recall rate recovery. Wherein F-measure indicates that an index can reflect both accuracy and recall.
Wherein TP represents the number of samples among the true malicious samples predicted as malicious samples; FP represents the number of true benign samples predicted to be malicious samples; TN represents the number of true benign samples predicted as benign samples, FN represents that the true samples are malicious and the predicted samples are benign.
And 5.2, training the neural network model by utilizing the color image, and selecting the model weight with the highest evaluation index to generate a malicious code detection model.
Step 5.2.1, dividing the color image into a training set and a test set.
And 5.2.2, inputting the training set into a CSNN deep learning model, sharing weights through a twin neural network, and finally carrying out output classification through a softmax function.
And 5.2.3, continuously training the model, storing the best training model, and testing by using the test set until the accuracy of the test set reaches a threshold value to obtain the malicious code detection model.
In this embodiment, steps 1 to 5 will be described in detail by taking a malicious sample A1 (storing a hash value of a sample) collected by a malicious sample website and a benign sample B1 obtained by downloading common software of a master edition, such as office software, video, audio software, game software, etc., through a browser as an example.
In step 1, the hash value of the malicious sample is input to the VirusTotal API interface, and the malicious sample is returned to contain malicious code of that type. If the type of the malicious code is not returned, the malicious sample can be directly transmitted into the VirusTotal platform for detection, and the malicious type in the detection report is marked. The file name of the malicious sample is changed into a malicious label in the detection report, and the malicious sample is marked and is marked as A2. The file name of the benign sample is labeled (benign 1, benign2, … benignN). And the benign sample was designated B2.
In step 2, A2 and B2 are run in a sandbox and the data files in the memory are dumped.
And placing the collected executable malicious samples into a corresponding operating system, such as placing the file of exe under a windows operating system, and placing the file of elf under a linux operating system.
Calling the plug-in for acquiring the memory data after the improvement of the invention, executing malicious samples A2 in batches, and then dumping the memory data of the executed samples, which is denoted as A3.
The dumping operation is repeated at intervals of 12 minutes, and the memory data is repeatedly dumped for three times. This is because the executable file does not call all data into memory for execution at run-time, and multiple dumps can extract as much useful information as possible.
Sample B2 is run and dumped memory data in the same manner as described above, denoted B3.
And acquiring a report of each execution sample file through the azalea sandbox host, extracting a called API sequence, and respectively marking the API sequence executed by the malicious sample and the API sequence executed by the benign sample as C1 and D1.
In step 3, A3, B3, C1, D1 is preprocessed and converted into a RGB three-channel color image.
(1) Markov image
A3 B3 is dump memory data, the data format is binary file, the invention converts the eight-bit binary number into a decimal number, and the purpose of selecting the eight-bit binary is that the value range of the eight-bit binary is 0x00-0xff, and is consistent with the value range of the image color code value.
And (3) grouping the converted decimal numbers into a group through a sliding window, and constructing a two-dimensional coordinate. Constructing a 256-by-256 matrix, initializing the matrix to enable the values in the matrix to be 0, traversing the constructed two-dimensional coordinate data points, and increasing the value in the corresponding matrix by 1 when each coordinate appears once until all coordinates are traversed, thereby constructing the byte frequency table. As shown in table 1:
table 1 byte frequency table
0x00 | 0x01 | 0x02 | 0x03 | … | 0xff | |
0x00 | 2 | 1 | 2 | 0 | … | 0 |
0x01 | 25 | 5 | 0 | 50 | … | 20 |
0x02 | 10 | 12 | 24 | 0 | … | 4 |
0x03 | 0 | 2 | 0 | 0 | … | 3 |
… | … | … | … | … | … | … |
0xff | 0 | 0 | 0 | 0 | … | 0 |
Converting the byte frequency table into a byte probability table, firstly calculating the frequency sum of each row, and then calculating the overall proportion of the numerical value in the coordinate matrix occupied by the line change, thereby obtaining the byte probability table of the point. As shown in table 2.
Table 2 byte probability table
0x00 | 0x01 | 0x02 | 0x03 | … | 0xff | |
0x00 | 0.4 | 0.2 | 0.4 | 0 | … | 0 |
0x01 | 0.25 | 0.5 | 0.0 | 0.5 | … | 0.2 |
0x02 | 0.2 | 0.24 | 0.48 | 0 | … | 0.08 |
0x03 | 0 | 0.4 | 0 | 0 | … | 0.6 |
… | … | … | … | … | … | … |
0xff | 0 | 0 | 0 | 0 | … | 0 |
And obtaining the Markov image according to the Markov image calculation formula through the byte frequency table and the byte probability.
(2) Entropy diagram
Decompiling the memory binary files into assembly instructions by the A3 and B3 through a decompilation tool, and carrying out dictionary coding on the decompiled assembly instructions to construct digital vectors.
The 16 instructions are divided into a group, and entropy values are calculated, so that a line graph of the entropy values is constructed.
The entropy diagram is converted into a uniform size gray image by a third party library of python, and the invention converts the entropy diagram into a 256 x 256 gray image.
(3) GIST feature image
Dictionary encoding the API sequence of C1, D1, and converting the API sequence into a numerical vector.
The method comprises the steps of normalizing a numerical vector into a numerical vector of 0-255, fixing the width of the vector to 256, converting the width of the vector into a gray image, performing feature processing through GIST, and finally scaling the image into a gray image with the size of 256 x 256.
In step 4, since the three gray-scale images obtained are 256×256 images, this embodiment superimposes the three gray-scale images and converts the superimposed images into a three-channel color image, which is denoted as M.
In step 5, the preprocessed sample M is divided into a training set M1 and a test set M2 in a ratio of 8:2.
M1 is input into a CSNN deep learning model, the left half CNN-based neural network architecture and the right half CNN-based neural network architecture in FIG. 2 share weights through a twin neural network, and finally output classification is performed through a softmax function.
The model is continuously trained, the best training model is stored, and the test is carried out by using the test set until the test set reaches the accuracy rate of more than 85%.
And saving the trained model for classifying and detecting malicious codes of the data to be detected.
Embodiment two:
the second embodiment of the invention provides a malicious code detection system based on multi-source collaboration and behavior analysis, which comprises:
a data acquisition module configured to acquire statically executable benign samples and malicious samples, respectively; labeling benign samples and malicious samples according to types;
the sandbox operation module is configured to put the marked benign samples and malicious samples into the sandbox for operation, so as to obtain the operated sample data and API sequences of the benign samples and the malicious samples;
the image conversion module is configured to convert the operated sample data into three gray images through preprocessing operation;
the feature fusion module is configured to extract features of the three gray images respectively, fuse the extracted features and convert the fused images into color images;
the model training module is configured to build a neural network model, train the neural network model by utilizing the color image, select the model weight with the highest evaluation index to generate a malicious code detection model, and detect and classify the malicious code by adopting the malicious code detection model.
In the sandbox operation module, the marked benign sample and malicious sample are put into the sandbox to be operated, and the specific steps of obtaining the operated sample data and API sequences of the benign sample and the malicious sample are as follows:
a) An isolated sandboxed environment is created. In this embodiment, a plurality of virtual systems including windows 7, 10, ubuntu, centos are installed in a sandbox environment, a file with a suffix of. Exe is placed under the windows 7, 10, a file of. Elf is placed under the Ubuntu, centos virtual system, and then a snapshot of the virtual system is saved.
b) And running the executable files of the samples in batches by utilizing a sandbox, and simultaneously dumping to obtain running sample data. . Because the existing plug-in proccdump.exe cannot acquire the process memory data of the sample set in batches, the method and the device are embedded into the written codes to acquire the process memory data in batches. Specific:
the executable file name of the sample is saved in a list.
Multithreading is created to run cmd instructions through os.system to run collected data samples in batches, where the reason for creating multithreading is that upon starting a sample run, the program of the present invention will enter a suspended state until the started sample is shut down.
And acquiring a process list of the system by calling the psutil library.
And comparing the process list acquired by the system with the stored file name list to acquire a pid list of the system process list consistent with the file name.
The process list is traversed by the fetched pid using a proccdump.
Because the executable file does not call all data into the memory for execution during running, the invention selects ten threads to be started each time by dumping once every 12 minutes for three times during batch dumping, considering the memory space of the server. And then high-efficiency batch memory process data dump is realized, and the running sample data are obtained.
c) API sequences of benign samples and malicious samples are extracted through sandboxed analysis reports.
And configuring a rhododendron sandbox host to realize link access to windows 7, 10 and ubuntu, centos. And submitting an executable file to obtain an analysis report through the azalea sandbox, and extracting the called API behavior sequence.
The steps involved in the second embodiment correspond to those of the first embodiment of the method, and the detailed description of the second embodiment can be found in the related description section of the first embodiment.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.
Claims (10)
1. The malicious code detection method based on multi-source collaboration and behavior analysis is characterized by comprising the following steps of:
respectively acquiring a static executable benign sample and a malicious sample; labeling benign samples and malicious samples according to types;
placing the marked benign samples and malicious samples into a sandbox for operation to obtain operated sample data and API sequences of the benign samples and the malicious samples;
converting the operated sample data into three gray images through preprocessing operation;
respectively extracting the features of the three gray images, fusing the extracted features, and converting the fused images into color images;
building a neural network model, training the neural network model by utilizing a color image, selecting the model weight with the highest evaluation index to generate a malicious code detection model, and detecting and classifying the malicious code by adopting the malicious code detection model.
2. The malicious code detection method based on multi-source collaboration and behavior analysis according to claim 1, wherein the specific steps of placing the marked benign samples and malicious samples into a sandbox for operation are as follows:
creating an isolated sandboxed environment;
the method comprises the steps of running executable files of samples in batches by utilizing a sandbox, and simultaneously dumping to obtain running sample data;
the API sequence is extracted by sandboxed analysis of the report.
3. The malicious code detection method based on multi-source collaboration and behavior analysis according to claim 1, wherein the running sample data is converted into three gray scale images of markov images, entropy diagrams and GIST feature images through preprocessing operation.
4. The malicious code detection method based on multi-source collaboration and behavior analysis according to claim 3, wherein the specific step of converting the running sample data into a markov image through a preprocessing operation is as follows:
converting the sample data after operation into decimal data;
constructing a byte frequency table in a matrix form by using decimal data;
converting the byte frequency table into a byte probability table according to the matrix coordinate frequency;
a markov image is constructed from the byte frequency table and the byte probability table.
5. The malicious code detection method based on multi-source collaboration and behavior analysis of claim 3, wherein the specific steps of converting the running sample data into a entropy diagram through preprocessing operation are as follows:
decompiling the operated sample data into an operation behavior instruction;
dictionary coding is carried out on the operation behavior instruction;
calculating a shannon entropy value according to the encoded operation behavior instruction;
and constructing a line graph according to the shannon entropy value, and converting the line graph into a gray image through normalization.
6. The malicious code detection method based on multi-source collaboration and behavior analysis according to claim 5, wherein the step of decompiling the running sample data into operational behavior instructions is as follows: the decompilation tool decompiles the binary file of the memory dump, filters other data, and only retains the assembly instructions.
7. The malicious code detection method based on multi-source collaboration and behavior analysis according to claim 3, wherein the specific steps of converting the running sample data into GIST feature images through preprocessing operation are as follows:
normalizing API sequences of the benign samples and the malicious samples to obtain normalized feature vectors;
and converting the normalized feature vector into a gray level image, and then carrying out feature extraction of the GIST to obtain the GIST feature image.
8. The method for detecting malicious code based on multi-source collaboration and behavior analysis according to claim 7, wherein the specific step of normalizing the API sequences of the benign samples and the malicious samples is:
scanning the whole API sequences of the benign samples and the malicious samples, constructing dictionary coding formats of the API sequences of the benign samples and the malicious samples, and converting the API sequences of the benign samples and the malicious samples into numerical vectors; the transformed sequences were then normalized.
9. The malicious code detection method based on multi-source collaboration and behavior analysis according to claim 1, wherein the specific steps of constructing a neural network model, training the neural network model by using a color image, and selecting the model weight with the highest evaluation index to generate the malicious code detection model are as follows:
dividing the color image into a training set and a testing set;
inputting the training set into a CSNN deep learning model, sharing weights through a twin neural network, and finally carrying out output classification through a softmax function;
and continuously training the model, storing the best training model, and testing by using the test set until the accuracy of the test set reaches a threshold value to obtain the malicious code detection model.
10. A malicious code detection system based on multi-source collaboration and behavioral analysis, comprising:
a data acquisition module configured to acquire statically executable benign samples and malicious samples, respectively; labeling benign samples and malicious samples according to types;
the sandbox operation module is configured to put the marked benign samples and malicious samples into the sandbox for operation, so as to obtain the operated sample data and API sequences of the benign samples and the malicious samples;
the image conversion module is configured to convert the operated sample data into three gray images through preprocessing operation;
the feature fusion module is configured to extract features of the three gray images respectively, fuse the extracted features and convert the fused images into color images;
the model training module is configured to build a neural network model, train the neural network model by utilizing the color image, select the model weight with the highest evaluation index to generate a malicious code detection model, and detect and classify the malicious code by adopting the malicious code detection model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310331389.7A CN116361797A (en) | 2023-03-28 | 2023-03-28 | Malicious code detection method and system based on multi-source collaboration and behavior analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310331389.7A CN116361797A (en) | 2023-03-28 | 2023-03-28 | Malicious code detection method and system based on multi-source collaboration and behavior analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116361797A true CN116361797A (en) | 2023-06-30 |
Family
ID=86919294
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310331389.7A Pending CN116361797A (en) | 2023-03-28 | 2023-03-28 | Malicious code detection method and system based on multi-source collaboration and behavior analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116361797A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116663019A (en) * | 2023-07-06 | 2023-08-29 | 华中科技大学 | Source code vulnerability detection method, device and system |
-
2023
- 2023-03-28 CN CN202310331389.7A patent/CN116361797A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116663019A (en) * | 2023-07-06 | 2023-08-29 | 华中科技大学 | Source code vulnerability detection method, device and system |
CN116663019B (en) * | 2023-07-06 | 2023-10-24 | 华中科技大学 | Source code vulnerability detection method, device and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10984101B2 (en) | Methods and systems for malware detection and categorization | |
Li et al. | Deeppayload: Black-box backdoor attack on deep learning models through neural payload injection | |
CN109359439B (en) | software detection method, device, equipment and storage medium | |
US11481492B2 (en) | Method and system for static behavior-predictive malware detection | |
CN110348214B (en) | Method and system for detecting malicious codes | |
CN109829306B (en) | Malicious software classification method for optimizing feature extraction | |
CN111639344A (en) | Vulnerability detection method and device based on neural network | |
CN109905385B (en) | Webshell detection method, device and system | |
CN111639337B (en) | Unknown malicious code detection method and system for massive Windows software | |
CN109271788B (en) | Android malicious software detection method based on deep learning | |
CN109858239B (en) | Dynamic and static combined detection method for CPU vulnerability attack program in container | |
CN109614795B (en) | Event-aware android malicious software detection method | |
CN109255241B (en) | Android permission promotion vulnerability detection method and system based on machine learning | |
CN114936371B (en) | Malicious software classification method and system based on three-channel visualization and deep learning | |
CN116361797A (en) | Malicious code detection method and system based on multi-source collaboration and behavior analysis | |
CN113935033A (en) | Feature-fused malicious code family classification method and device and storage medium | |
CN115100739B (en) | Man-machine behavior detection method, system, terminal device and storage medium | |
AlGarni et al. | An efficient convolutional neural network with transfer learning for malware classification | |
CN111382428A (en) | Malicious software recognition model training method, malicious software recognition method and device | |
CN112712005B (en) | Training method of recognition model, target recognition method and terminal equipment | |
CN114579965A (en) | Malicious code detection method and device and computer readable storage medium | |
CN114491528A (en) | Malicious software detection method, device and equipment | |
CN117113352B (en) | Method, system, equipment and medium for detecting malicious executable file of DCS upper computer | |
CN115221516B (en) | Malicious application program identification method and device, storage medium and electronic equipment | |
CN108304719B (en) | Android malicious code analysis and detection algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |