CN116361797A - Malicious code detection method and system based on multi-source collaboration and behavior analysis - Google Patents

Malicious code detection method and system based on multi-source collaboration and behavior analysis Download PDF

Info

Publication number
CN116361797A
CN116361797A CN202310331389.7A CN202310331389A CN116361797A CN 116361797 A CN116361797 A CN 116361797A CN 202310331389 A CN202310331389 A CN 202310331389A CN 116361797 A CN116361797 A CN 116361797A
Authority
CN
China
Prior art keywords
malicious
samples
malicious code
code detection
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310331389.7A
Other languages
Chinese (zh)
Inventor
张淑慧
胡长栋
王连海
徐淑奖
周瑞瑶
兰田
于菲菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Qilu University of Technology
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology, Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Qilu University of Technology
Priority to CN202310331389.7A priority Critical patent/CN116361797A/en
Publication of CN116361797A publication Critical patent/CN116361797A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Abstract

The invention discloses a malicious code detection method and system based on multi-source collaboration and behavior analysis, and relates to the technical field of malicious software detection. The method comprises the following steps: respectively acquiring and labeling a static executable benign sample and a static executable malicious sample; placing the marked benign samples and malicious samples into a sandbox for operation; converting the operated sample data into three gray images through preprocessing operation; respectively extracting the features of the three gray images, and fusing and converting the extracted features into color images; training the neural network model by utilizing the color image to generate a malicious code detection model, and detecting and classifying the malicious code by adopting the malicious code detection model. According to the method and the device for detecting the memory data, the detection can be realized under the condition that the operation of the target system is not stopped, the authenticity of the data can be ensured, the situation that the data confusion and the data encryption cannot be identified is avoided, and therefore the classification and the accurate detection of the malicious codes are realized.

Description

Malicious code detection method and system based on multi-source collaboration and behavior analysis
Technical Field
The invention relates to the technical field of malicious software detection, in particular to a malicious code detection method and system based on multi-source collaboration and behavior analysis.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Aiming at the network age, protecting the safety of a computer and ensuring the smooth progress of daily work is an indispensable work. In the traditional malicious code detection method based on the static file, the encrypted and shelled malicious program is not real malicious data, so that the data cannot be effectively identified as the malicious program. Dynamic analysis can solve the encryption encrusting problem to a certain extent, but is easily deceived by the kernel rootkit. Statistically, most of the newly appeared malicious codes are core segments of the existing malicious codes, and the core segments are modified to generate more threatening malicious codes. From the aspect of family characteristics, by analyzing the behavior characteristics of the malicious code race, a research method is provided for behavior detection of the increasing number of malicious codes.
However, in the face of various and huge number of malicious code varieties in the attack technology in the network, classification and identification of malicious code groups become difficult, and the existing malicious code detection technology needs to stop running a target system in the malicious code identification process, so that the detection efficiency is low, and the situations that data confusion and encrypted data cannot be identified easily occur, so that how to accurately and efficiently identify and classify the malicious code groups becomes an important defense line for protecting a computer from malicious intrusion.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention aims to provide a malicious code detection method and a malicious code detection system based on multi-source collaboration and behavior analysis, which can detect memory data without stopping running a target system, ensure the authenticity of the data due to the data in the memory, and avoid the situation that the data confusion and data encryption cannot be identified, thereby realizing classification and accurate detection of the malicious code.
In order to achieve the above object, the present invention is realized by the following technical scheme:
the first aspect of the invention provides a malicious code detection method based on multi-source collaboration and behavior analysis, which comprises the following steps:
respectively acquiring a static executable benign sample and a malicious sample; labeling benign samples and malicious samples according to types;
placing the marked benign samples and malicious samples into a sandbox for operation to obtain operated sample data and API sequences of the benign samples and the malicious samples;
converting the operated sample data into three gray images through preprocessing operation;
respectively extracting the features of the three gray images, fusing the extracted features, and converting the fused images into color images;
building a neural network model, training the neural network model by utilizing a color image, selecting the model weight with the highest evaluation index to generate a malicious code detection model, and detecting and classifying the malicious code by adopting the malicious code detection model.
Further, the specific steps of placing the marked benign samples and malicious samples into a sandbox for operation are as follows:
creating an isolated sandboxed environment;
the method comprises the steps of running executable files of samples in batches by utilizing a sandbox, and simultaneously dumping to obtain running sample data;
the API sequence is extracted by sandboxed analysis of the report.
Further, the running sample data is converted into three gray images of a Markov image, a entropy diagram and a GIST characteristic image through preprocessing operation.
Further, the specific steps of converting the running sample data into the Markov image through the preprocessing operation are as follows:
converting the sample data after operation into decimal data;
constructing a byte frequency table in a matrix form by using decimal data;
converting the byte frequency table into a byte probability table according to the matrix coordinate frequency;
a markov image is constructed from the byte frequency table and the byte probability table.
Further, the specific steps of converting the sample data after operation into a entropy diagram through a preprocessing operation are as follows:
decompiling the operated sample data into an operation behavior instruction;
dictionary coding is carried out on the operation behavior instruction;
calculating a shannon entropy value according to the encoded operation behavior instruction;
and constructing a line graph according to the shannon entropy value, and converting the line graph into a gray image through normalization.
Furthermore, the method for decompiling the running sample data into the operation behavior instruction comprises the following specific steps: the decompilation tool decompiles the binary file of the memory dump, filters other data, and only retains the assembly instructions.
Further, the specific steps of converting the running sample data into GIST feature images through preprocessing operation are as follows:
normalizing API sequences of the benign samples and the malicious samples to obtain normalized feature vectors;
and converting the normalized feature vector into a gray level image, and then carrying out feature extraction of the GIST to obtain the GIST feature image.
Further, the specific steps of normalizing the API sequences of the benign samples and the malicious samples are as follows:
scanning the whole API sequences of the benign samples and the malicious samples, constructing dictionary coding formats of the API sequences of the benign samples and the malicious samples, and converting the API sequences of the benign samples and the malicious samples into numerical vectors; the transformed sequences were then normalized.
Further, a neural network model is built, the neural network model is trained by utilizing a color image, and the specific steps of selecting the model weight with the highest evaluation index to generate a malicious code detection model are as follows:
dividing the color image into a training set and a testing set;
inputting the training set into a CSNN deep learning model, sharing weights through a twin neural network, and finally carrying out output classification through a softmax function;
and continuously training the model, storing the best training model, and testing by using the test set until the accuracy of the test set reaches a threshold value to obtain the malicious code detection model.
A second aspect of the present invention provides a malicious code detection system based on multi-source collaboration and behavior analysis, comprising:
a data acquisition module configured to acquire statically executable benign samples and malicious samples, respectively; labeling benign samples and malicious samples according to types;
the sandbox operation module is configured to put the marked benign samples and malicious samples into the sandbox for operation, so as to obtain the operated sample data and API sequences of the benign samples and the malicious samples;
the image conversion module is configured to convert the operated sample data into three gray images through preprocessing operation;
the feature fusion module is configured to extract features of the three gray images respectively, fuse the extracted features and convert the fused images into color images;
the model training module is configured to build a neural network model, train the neural network model by utilizing the color image, select the model weight with the highest evaluation index to generate a malicious code detection model, and detect and classify the malicious code by adopting the malicious code detection model.
The one or more of the above technical solutions have the following beneficial effects:
the invention discloses a malicious code detection method based on multi-source collaboration and behavior analysis. And decompiling the process data into a behavior instruction, filtering out data information, only retaining instruction information, and reconstructing the gray image. The three gray images are overlapped to form a color image, the data is preprocessed and trained through the MEG-CSNN model, finally a training result is output, and the weight parameters of the training model are saved for detecting and classifying malicious codes, so that efficient and accurate detection and classification of the malicious codes are realized. The invention can detect the memory data without stopping the operation of the target system, and can ensure the authenticity of the data due to the data in the memory, and the condition that the data confusion and the data encryption cannot be identified can not occur, thereby realizing the classification and the accurate detection of the malicious codes.
The invention discloses a malicious code detection system based on multi-source collaboration and behavior analysis, which is characterized in that collected data are executed in a virtual sandbox by collecting malicious data sets of multiple platforms and commonly used benign software, a running snapshot, namely a mirror image of a running system is obtained, and process data and behavior data of the running system are extracted by a technical means of memory evidence obtaining: the method comprises the calling actions of API interfaces such as starting and exiting processes, loading library files, calling system functions, running threads, registering and starting services, file operation, registry operation, network connection and the like. Because different behavior modes are realized under multiple platforms, semantic mapping between a behavior sequence and behavior functions is constructed, memory behavior data of heterogeneous platforms are converted into functional semantic mapping, and a cross-platform semantic mapping model is formed. In addition, the computer under the cloud platform is not only a physical host, but also comprises a plurality of virtual hosts, and the data extracted by the method can be analyzed by the plurality of virtual hosts, so that the escape behavior of the virtual machines can be effectively detected.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a flowchart of a malicious code detection method based on multi-source collaboration and behavior analysis in a first embodiment of the present invention;
fig. 2 is a CSNN structure diagram of a CNN-based twin neural network in accordance with an embodiment of the present invention.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It should be noted that, in the embodiments of the present invention, related data such as a statically executable benign sample and a malicious sample is related, when the embodiments of the present invention are applied to specific products or technologies, user permission or consent is required to be obtained, and the collection, use and processing of related data are required to comply with related laws and regulations and standards of related countries and regions.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof;
embodiment one:
the first embodiment of the invention provides a malicious code detection method based on multi-source collaboration and behavior analysis, which is shown in fig. 1 and comprises the following steps:
step 1, respectively acquiring a static executable benign sample and a malicious sample; and labeling the benign samples and the malicious samples according to the types. In this embodiment, malicious static samples under multiple platforms are collected at the malicious sample website MalwareBazaar, virusShare and vx_weaven, and hash values of the samples are stored. The method comprises the steps of inputting a hash value through a VirusTotal API (Application Programming Interface ) interface and a common disinfection detection platform, obtaining the type of malicious codes of the sample, and if the hash value is not available, directly inputting the sample into the VirusTotal API, and obtaining the type of the malicious codes (such as worm viruses, levoviruses, spam advertisements and the like). The sample is then marked. And finally, downloading common normal software on a plurality of platforms to serve as benign static samples, and uniformly marking the benign static samples as benign samples.
And 2, placing the marked benign samples and malicious samples into a sandbox for operation, and obtaining the operated sample data and API sequences of the benign samples and the malicious samples.
Step 2.1, creating an isolated sandboxed environment.
Installing a plurality of virtual systems in the sandbox environment, placing the marked benign samples and malicious samples into the virtual systems, and then saving a snapshot of the virtual systems.
And 2.2, running the executable files of the samples in batches by utilizing a sandbox, and simultaneously dumping to obtain the running sample data. Because the existing plug-in proccdump.exe cannot acquire the process memory data of the sample set in batches, the method and the device are embedded into the written codes to acquire the process memory data in batches.
The method for obtaining the progress in batches comprises the following steps:
step 2.2.1, saving the executable file name of the sample in a list.
Step 2.2.2, creating multithreading to run cmd instructions through os.system to run the collected data samples in bulk, wherein the reason for creating multithreading is that when a sample run is initiated, the program of the present invention will enter a suspended state until the initiated sample is shut down.
And 2.2.3, acquiring a process list of the system by calling the psuil library.
And 2.2.4, comparing the process list acquired by the system with the stored file name list to acquire a pid list of the system process list consistent with the file name.
Step 2.2.5, traversing the process list by using a proccdump.exe through the fetched pid.
Because the executable file does not call all data into the memory for execution during running, the invention selects ten threads to be started each time by dumping once every 12 minutes for three times during batch dumping, considering the memory space of the server. And then high-efficiency batch memory process data dump is realized, and the running sample data are obtained.
Step 2.3, respectively extracting API sequences of the benign sample and the malicious sample through a sandbox analysis report. In the embodiment, an azalea sandbox is adopted to submit an executable file to obtain an analysis report, and an API behavior sequence called by the analysis report is extracted.
And step 3, converting the operated sample data into three gray images through preprocessing operation. The running dataset is converted into three gray images of a Markov image, a entropy diagram and a GIST characteristic image through data preprocessing, and then is converted into a RGB three-channel color image, and the color image is easier to realize deep learning and migration learning, which is also one of the reasons for converting the invention into the RGB three-channel color image.
And 3.1, converting the operated sample data into Markov images through preprocessing operation.
And step 3.1.1, converting the sample data after operation into decimal data.
Since the process data in the data set of the present invention is binary data, i.e., byte stream data. The invention takes 8 binary bits as one byte, and the range of binary decimal values of 8 bits is 0x00-0xff. Thus, a 256×256 matrix is constructed, and coordinate points within the matrix are initialized to 0.
Step 3.1.2, constructing a byte frequency table in a matrix form by using decimal data.
The present invention sets the sliding window to 2, that is, two bytes represent the coordinate point of the matrix, for example, the sliding window 0x00,0x01 represents the first row, and the second column represents the coordinate point once, and the coordinate point is increased by 1. The sliding window continues until a single process data sample is traversed.
And 3.1.3, converting the byte frequency table into a byte probability table according to the matrix coordinate frequency.
The invention converts a byte frequency table into a byte probability table, wherein the frequency of each matrix coordinate is as follows:
Figure BDA0004155065510000091
wherein BPT ij BFT representing probability distribution of ith row and jth column ij Represents the ith row and the jth columnIs a frequency distribution of (a). Si represents the sum of frequencies of the i-th row. The byte frequency table is traversed to construct a byte probability table.
And 3.1.4, constructing the Markov image according to the byte frequency table and the byte probability table.
After the byte frequency table and the byte probability table are calculated, the invention further constructs a Markov image, thereby constructing a first gray image, and the formula for constructing the Markov image is as follows:
Figure BDA0004155065510000101
MI ij the image value representing the ith row and jth column builds a markov image by traversing the byte frequency table and byte probabilities.
And 3.2, converting the operated sample data into a entropy diagram through a preprocessing operation.
And 3.2.1, decompiling the running sample data into an operation behavior instruction. Specifically, the decompiling of the running sample data into the operation behavior instruction comprises the following specific steps: the decompilation tool decompiles the binary file of the memory dump, filters other data, and only retains the assembly instructions.
And 3.2.2, performing dictionary coding on the operation behavior instruction.
All assembler instructions in the dataset are scanned, all assembler instructions are encoded, dictionary encoding is performed on all assembler instructions, as there are 111 instructions in total, such as (MOV: 0; ADD:1, etc.), and then the instruction set is scanned, dictionary encoding is performed on it.
And 3.2.3, calculating the shannon entropy value according to the encoded operation behavior instruction.
The present embodiment divides 16 instructions into a group and then calculates shannon entropy values, wherein the abscissa represents what group and the ordinate represents entropy values. In combination with the instruction sequence of the present embodiment, a specific calculation formula of the entropy value is:
Figure BDA0004155065510000102
where Bi represents the number of groups of instructions divided, bki represents the kth instruction of the ith group, H (Bi) represents the entropy value, and P (bki) represents the probability of the kth instruction of the ith group occurring in the kth group.
And 3.2.4, constructing a line graph according to the shannon entropy value, and converting the line graph into a gray image through normalization.
In this embodiment, 16 instructions are divided into a group, then a shannon entropy value is calculated, a line graph is constructed according to the shannon entropy value, the obtained line graph is converted into a gray image, and then a third party library of python is utilized to convert the non-uniform-size line graph into a 256×256 gray image.
And 3.3, converting the operated sample data into a GIST characteristic image through preprocessing operation.
And 3.3.1, normalizing the API sequences of the benign samples and the malicious samples to obtain normalized feature vectors.
Scanning the whole API sequences of the benign samples and the malicious samples, constructing dictionary coding formats of the API sequences of the benign samples and the malicious samples, and converting the API sequences of the benign samples and the malicious samples into numerical vectors; and then normalizing the converted sequence to 0-255, wherein the normalized formula is as follows:
Figure BDA0004155065510000111
where S (x) represents the normalized value, x represents the dictionary encoded value, API size Representing the total number of different APIs.
And 3.3.2, converting the normalized feature vector into a gray level image with a fixed width of 256, then carrying out feature extraction of the GIST, and scaling the GIST into a gray level image GIST feature image with 256×256 by an image scaling technology. Gist is a global feature description manner, and can well capture overall features of a graph.
And 4, respectively extracting the features of the three gray images, fusing the extracted features, and converting the fused images into color images.
Because of the step 3.1, the step 3.2, the image size of the step 3.3 is 256 x 256, the invention superimposes the images, extracts single channel data of each gray level image, and then uses three gray level images as three channels of a color image through a merge function of openCV in python, thereby generating the color image. The three-channel color image is used for training and testing a neural network model.
And 5, building a neural network model, training the neural network model by utilizing a color image, selecting the model weight with the highest evaluation index to generate a malicious code detection model, and detecting and classifying the malicious code by adopting the malicious code detection model. The embodiment builds a CNN-based twin neural network model, which comprises a convolution layer, a pooling layer, a RELU nonlinear activation function and a full connection layer as shown in fig. 2, and finally realizes classification prediction through a softmax function.
And 5.1, building a neural network model.
And 5.1.1, setting specific parameters.
Since the color image size of this embodiment is fixed to 256×256, the parameters of the CSNN model of the present invention also select a fixed value, where two convolution kernels of the first layer of CNN are 4, the convolution kernel of the second layer is 5, and the parameters of the pooling layer are all 8.
And 5.1.2, verifying the classification index.
For detection of malicious code, the present embodiment uses four evaluation indexes of classification: accuracy, precision, composite score F-measure, and recall rate recovery. Wherein F-measure indicates that an index can reflect both accuracy and recall.
Figure BDA0004155065510000121
Figure BDA0004155065510000122
Figure BDA0004155065510000123
Figure BDA0004155065510000124
Wherein TP represents the number of samples among the true malicious samples predicted as malicious samples; FP represents the number of true benign samples predicted to be malicious samples; TN represents the number of true benign samples predicted as benign samples, FN represents that the true samples are malicious and the predicted samples are benign.
And 5.2, training the neural network model by utilizing the color image, and selecting the model weight with the highest evaluation index to generate a malicious code detection model.
Step 5.2.1, dividing the color image into a training set and a test set.
And 5.2.2, inputting the training set into a CSNN deep learning model, sharing weights through a twin neural network, and finally carrying out output classification through a softmax function.
And 5.2.3, continuously training the model, storing the best training model, and testing by using the test set until the accuracy of the test set reaches a threshold value to obtain the malicious code detection model.
In this embodiment, steps 1 to 5 will be described in detail by taking a malicious sample A1 (storing a hash value of a sample) collected by a malicious sample website and a benign sample B1 obtained by downloading common software of a master edition, such as office software, video, audio software, game software, etc., through a browser as an example.
In step 1, the hash value of the malicious sample is input to the VirusTotal API interface, and the malicious sample is returned to contain malicious code of that type. If the type of the malicious code is not returned, the malicious sample can be directly transmitted into the VirusTotal platform for detection, and the malicious type in the detection report is marked. The file name of the malicious sample is changed into a malicious label in the detection report, and the malicious sample is marked and is marked as A2. The file name of the benign sample is labeled (benign 1, benign2, … benignN). And the benign sample was designated B2.
In step 2, A2 and B2 are run in a sandbox and the data files in the memory are dumped.
And placing the collected executable malicious samples into a corresponding operating system, such as placing the file of exe under a windows operating system, and placing the file of elf under a linux operating system.
Calling the plug-in for acquiring the memory data after the improvement of the invention, executing malicious samples A2 in batches, and then dumping the memory data of the executed samples, which is denoted as A3.
The dumping operation is repeated at intervals of 12 minutes, and the memory data is repeatedly dumped for three times. This is because the executable file does not call all data into memory for execution at run-time, and multiple dumps can extract as much useful information as possible.
Sample B2 is run and dumped memory data in the same manner as described above, denoted B3.
And acquiring a report of each execution sample file through the azalea sandbox host, extracting a called API sequence, and respectively marking the API sequence executed by the malicious sample and the API sequence executed by the benign sample as C1 and D1.
In step 3, A3, B3, C1, D1 is preprocessed and converted into a RGB three-channel color image.
(1) Markov image
A3 B3 is dump memory data, the data format is binary file, the invention converts the eight-bit binary number into a decimal number, and the purpose of selecting the eight-bit binary is that the value range of the eight-bit binary is 0x00-0xff, and is consistent with the value range of the image color code value.
And (3) grouping the converted decimal numbers into a group through a sliding window, and constructing a two-dimensional coordinate. Constructing a 256-by-256 matrix, initializing the matrix to enable the values in the matrix to be 0, traversing the constructed two-dimensional coordinate data points, and increasing the value in the corresponding matrix by 1 when each coordinate appears once until all coordinates are traversed, thereby constructing the byte frequency table. As shown in table 1:
table 1 byte frequency table
0x00 0x01 0x02 0x03 0xff
0x00 2 1 2 0 0
0x01 25 5 0 50 20
0x02 10 12 24 0 4
0x03 0 2 0 0 3
0xff 0 0 0 0 0
Converting the byte frequency table into a byte probability table, firstly calculating the frequency sum of each row, and then calculating the overall proportion of the numerical value in the coordinate matrix occupied by the line change, thereby obtaining the byte probability table of the point. As shown in table 2.
Table 2 byte probability table
0x00 0x01 0x02 0x03 0xff
0x00 0.4 0.2 0.4 0 0
0x01 0.25 0.5 0.0 0.5 0.2
0x02 0.2 0.24 0.48 0 0.08
0x03 0 0.4 0 0 0.6
0xff 0 0 0 0 0
And obtaining the Markov image according to the Markov image calculation formula through the byte frequency table and the byte probability.
(2) Entropy diagram
Decompiling the memory binary files into assembly instructions by the A3 and B3 through a decompilation tool, and carrying out dictionary coding on the decompiled assembly instructions to construct digital vectors.
The 16 instructions are divided into a group, and entropy values are calculated, so that a line graph of the entropy values is constructed.
The entropy diagram is converted into a uniform size gray image by a third party library of python, and the invention converts the entropy diagram into a 256 x 256 gray image.
(3) GIST feature image
Dictionary encoding the API sequence of C1, D1, and converting the API sequence into a numerical vector.
The method comprises the steps of normalizing a numerical vector into a numerical vector of 0-255, fixing the width of the vector to 256, converting the width of the vector into a gray image, performing feature processing through GIST, and finally scaling the image into a gray image with the size of 256 x 256.
In step 4, since the three gray-scale images obtained are 256×256 images, this embodiment superimposes the three gray-scale images and converts the superimposed images into a three-channel color image, which is denoted as M.
In step 5, the preprocessed sample M is divided into a training set M1 and a test set M2 in a ratio of 8:2.
M1 is input into a CSNN deep learning model, the left half CNN-based neural network architecture and the right half CNN-based neural network architecture in FIG. 2 share weights through a twin neural network, and finally output classification is performed through a softmax function.
The model is continuously trained, the best training model is stored, and the test is carried out by using the test set until the test set reaches the accuracy rate of more than 85%.
And saving the trained model for classifying and detecting malicious codes of the data to be detected.
Embodiment two:
the second embodiment of the invention provides a malicious code detection system based on multi-source collaboration and behavior analysis, which comprises:
a data acquisition module configured to acquire statically executable benign samples and malicious samples, respectively; labeling benign samples and malicious samples according to types;
the sandbox operation module is configured to put the marked benign samples and malicious samples into the sandbox for operation, so as to obtain the operated sample data and API sequences of the benign samples and the malicious samples;
the image conversion module is configured to convert the operated sample data into three gray images through preprocessing operation;
the feature fusion module is configured to extract features of the three gray images respectively, fuse the extracted features and convert the fused images into color images;
the model training module is configured to build a neural network model, train the neural network model by utilizing the color image, select the model weight with the highest evaluation index to generate a malicious code detection model, and detect and classify the malicious code by adopting the malicious code detection model.
In the sandbox operation module, the marked benign sample and malicious sample are put into the sandbox to be operated, and the specific steps of obtaining the operated sample data and API sequences of the benign sample and the malicious sample are as follows:
a) An isolated sandboxed environment is created. In this embodiment, a plurality of virtual systems including windows 7, 10, ubuntu, centos are installed in a sandbox environment, a file with a suffix of. Exe is placed under the windows 7, 10, a file of. Elf is placed under the Ubuntu, centos virtual system, and then a snapshot of the virtual system is saved.
b) And running the executable files of the samples in batches by utilizing a sandbox, and simultaneously dumping to obtain running sample data. . Because the existing plug-in proccdump.exe cannot acquire the process memory data of the sample set in batches, the method and the device are embedded into the written codes to acquire the process memory data in batches. Specific:
the executable file name of the sample is saved in a list.
Multithreading is created to run cmd instructions through os.system to run collected data samples in batches, where the reason for creating multithreading is that upon starting a sample run, the program of the present invention will enter a suspended state until the started sample is shut down.
And acquiring a process list of the system by calling the psutil library.
And comparing the process list acquired by the system with the stored file name list to acquire a pid list of the system process list consistent with the file name.
The process list is traversed by the fetched pid using a proccdump.
Because the executable file does not call all data into the memory for execution during running, the invention selects ten threads to be started each time by dumping once every 12 minutes for three times during batch dumping, considering the memory space of the server. And then high-efficiency batch memory process data dump is realized, and the running sample data are obtained.
c) API sequences of benign samples and malicious samples are extracted through sandboxed analysis reports.
And configuring a rhododendron sandbox host to realize link access to windows 7, 10 and ubuntu, centos. And submitting an executable file to obtain an analysis report through the azalea sandbox, and extracting the called API behavior sequence.
The steps involved in the second embodiment correspond to those of the first embodiment of the method, and the detailed description of the second embodiment can be found in the related description section of the first embodiment.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims (10)

1. The malicious code detection method based on multi-source collaboration and behavior analysis is characterized by comprising the following steps of:
respectively acquiring a static executable benign sample and a malicious sample; labeling benign samples and malicious samples according to types;
placing the marked benign samples and malicious samples into a sandbox for operation to obtain operated sample data and API sequences of the benign samples and the malicious samples;
converting the operated sample data into three gray images through preprocessing operation;
respectively extracting the features of the three gray images, fusing the extracted features, and converting the fused images into color images;
building a neural network model, training the neural network model by utilizing a color image, selecting the model weight with the highest evaluation index to generate a malicious code detection model, and detecting and classifying the malicious code by adopting the malicious code detection model.
2. The malicious code detection method based on multi-source collaboration and behavior analysis according to claim 1, wherein the specific steps of placing the marked benign samples and malicious samples into a sandbox for operation are as follows:
creating an isolated sandboxed environment;
the method comprises the steps of running executable files of samples in batches by utilizing a sandbox, and simultaneously dumping to obtain running sample data;
the API sequence is extracted by sandboxed analysis of the report.
3. The malicious code detection method based on multi-source collaboration and behavior analysis according to claim 1, wherein the running sample data is converted into three gray scale images of markov images, entropy diagrams and GIST feature images through preprocessing operation.
4. The malicious code detection method based on multi-source collaboration and behavior analysis according to claim 3, wherein the specific step of converting the running sample data into a markov image through a preprocessing operation is as follows:
converting the sample data after operation into decimal data;
constructing a byte frequency table in a matrix form by using decimal data;
converting the byte frequency table into a byte probability table according to the matrix coordinate frequency;
a markov image is constructed from the byte frequency table and the byte probability table.
5. The malicious code detection method based on multi-source collaboration and behavior analysis of claim 3, wherein the specific steps of converting the running sample data into a entropy diagram through preprocessing operation are as follows:
decompiling the operated sample data into an operation behavior instruction;
dictionary coding is carried out on the operation behavior instruction;
calculating a shannon entropy value according to the encoded operation behavior instruction;
and constructing a line graph according to the shannon entropy value, and converting the line graph into a gray image through normalization.
6. The malicious code detection method based on multi-source collaboration and behavior analysis according to claim 5, wherein the step of decompiling the running sample data into operational behavior instructions is as follows: the decompilation tool decompiles the binary file of the memory dump, filters other data, and only retains the assembly instructions.
7. The malicious code detection method based on multi-source collaboration and behavior analysis according to claim 3, wherein the specific steps of converting the running sample data into GIST feature images through preprocessing operation are as follows:
normalizing API sequences of the benign samples and the malicious samples to obtain normalized feature vectors;
and converting the normalized feature vector into a gray level image, and then carrying out feature extraction of the GIST to obtain the GIST feature image.
8. The method for detecting malicious code based on multi-source collaboration and behavior analysis according to claim 7, wherein the specific step of normalizing the API sequences of the benign samples and the malicious samples is:
scanning the whole API sequences of the benign samples and the malicious samples, constructing dictionary coding formats of the API sequences of the benign samples and the malicious samples, and converting the API sequences of the benign samples and the malicious samples into numerical vectors; the transformed sequences were then normalized.
9. The malicious code detection method based on multi-source collaboration and behavior analysis according to claim 1, wherein the specific steps of constructing a neural network model, training the neural network model by using a color image, and selecting the model weight with the highest evaluation index to generate the malicious code detection model are as follows:
dividing the color image into a training set and a testing set;
inputting the training set into a CSNN deep learning model, sharing weights through a twin neural network, and finally carrying out output classification through a softmax function;
and continuously training the model, storing the best training model, and testing by using the test set until the accuracy of the test set reaches a threshold value to obtain the malicious code detection model.
10. A malicious code detection system based on multi-source collaboration and behavioral analysis, comprising:
a data acquisition module configured to acquire statically executable benign samples and malicious samples, respectively; labeling benign samples and malicious samples according to types;
the sandbox operation module is configured to put the marked benign samples and malicious samples into the sandbox for operation, so as to obtain the operated sample data and API sequences of the benign samples and the malicious samples;
the image conversion module is configured to convert the operated sample data into three gray images through preprocessing operation;
the feature fusion module is configured to extract features of the three gray images respectively, fuse the extracted features and convert the fused images into color images;
the model training module is configured to build a neural network model, train the neural network model by utilizing the color image, select the model weight with the highest evaluation index to generate a malicious code detection model, and detect and classify the malicious code by adopting the malicious code detection model.
CN202310331389.7A 2023-03-28 2023-03-28 Malicious code detection method and system based on multi-source collaboration and behavior analysis Pending CN116361797A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310331389.7A CN116361797A (en) 2023-03-28 2023-03-28 Malicious code detection method and system based on multi-source collaboration and behavior analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310331389.7A CN116361797A (en) 2023-03-28 2023-03-28 Malicious code detection method and system based on multi-source collaboration and behavior analysis

Publications (1)

Publication Number Publication Date
CN116361797A true CN116361797A (en) 2023-06-30

Family

ID=86919294

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310331389.7A Pending CN116361797A (en) 2023-03-28 2023-03-28 Malicious code detection method and system based on multi-source collaboration and behavior analysis

Country Status (1)

Country Link
CN (1) CN116361797A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116663019A (en) * 2023-07-06 2023-08-29 华中科技大学 Source code vulnerability detection method, device and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116663019A (en) * 2023-07-06 2023-08-29 华中科技大学 Source code vulnerability detection method, device and system
CN116663019B (en) * 2023-07-06 2023-10-24 华中科技大学 Source code vulnerability detection method, device and system

Similar Documents

Publication Publication Date Title
US10984101B2 (en) Methods and systems for malware detection and categorization
Li et al. Deeppayload: Black-box backdoor attack on deep learning models through neural payload injection
CN109359439B (en) software detection method, device, equipment and storage medium
US11481492B2 (en) Method and system for static behavior-predictive malware detection
CN110348214B (en) Method and system for detecting malicious codes
CN109829306B (en) Malicious software classification method for optimizing feature extraction
CN111639344A (en) Vulnerability detection method and device based on neural network
CN109905385B (en) Webshell detection method, device and system
CN111639337B (en) Unknown malicious code detection method and system for massive Windows software
CN109271788B (en) Android malicious software detection method based on deep learning
CN109858239B (en) Dynamic and static combined detection method for CPU vulnerability attack program in container
CN109614795B (en) Event-aware android malicious software detection method
CN109255241B (en) Android permission promotion vulnerability detection method and system based on machine learning
CN114936371B (en) Malicious software classification method and system based on three-channel visualization and deep learning
CN116361797A (en) Malicious code detection method and system based on multi-source collaboration and behavior analysis
CN113935033A (en) Feature-fused malicious code family classification method and device and storage medium
CN115100739B (en) Man-machine behavior detection method, system, terminal device and storage medium
AlGarni et al. An efficient convolutional neural network with transfer learning for malware classification
CN111382428A (en) Malicious software recognition model training method, malicious software recognition method and device
CN112712005B (en) Training method of recognition model, target recognition method and terminal equipment
CN114579965A (en) Malicious code detection method and device and computer readable storage medium
CN114491528A (en) Malicious software detection method, device and equipment
CN117113352B (en) Method, system, equipment and medium for detecting malicious executable file of DCS upper computer
CN115221516B (en) Malicious application program identification method and device, storage medium and electronic equipment
CN108304719B (en) Android malicious code analysis and detection algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination