CN114692156A

CN114692156A - Memory segment malicious code intrusion detection method, system, storage medium and equipment

Info

Publication number: CN114692156A
Application number: CN202210603899.0A
Authority: CN
Inventors: 张淑慧; 胡长栋; 王连海; 王金鹏; 匡瑞雪
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-07-01
Anticipated expiration: 2042-05-31
Also published as: CN114692156B

Abstract

The invention belongs to the technical field of computer malicious software detection, and provides a method, a system, a storage medium and equipment for detecting the intrusion of memory segment malicious codes, wherein the method comprises the following steps: acquiring a memory file to be detected; after binary conversion and word segmentation pretreatment are sequentially carried out on the memory file to be detected, fragment interception is carried out on the basis of the optimal fragment position and length combination to obtain a predicted fragment; inputting the predicted fragments into an optimal neural network model, and detecting the predicted fragments to obtain a result of whether the memory file to be detected is implanted with malicious codes or not; the neural network model adopts an embedded layer to carry out dimension increasing on an input prediction segment, then carries out pooling after convolution through convolution layers with different convolution kernel sizes, and finally inputs the input prediction segment into the classifier after conversion through a flattening layer and a full connecting layer. By learning the potential rules and characteristics of the malicious code, the virus which is not discovered yet can be detected, and the existing virus can be detected.

Description

Memory segment malicious code intrusion detection method, system, storage medium and equipment

Technical Field

The invention belongs to the technical field of computer malicious software detection, and particularly relates to a method, a system, a storage medium and equipment for detecting malicious code intrusion of a memory segment.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the rapid development of computer and internet technologies, the number of malicious software is exponentially increased, malicious programs have the development characteristics of multiple varieties and faster updating of anti-detection technologies, detection of a security protection system is broken through in a file-free malicious software attack mode, and serious threats and challenges are formed for enterprise security defenders. A file-free malware attack is a method for diving a victim organization to execute codes from a memory, and malicious files or file fragments are not used on a computer disk, so that the malicious files or the file fragments hide self and attack traces of the malicious files. However, they cannot completely delete their traces in memory. Therefore, memory analysis is one of the best methods for systematically analyzing programs with unknown malicious characteristics and without source code.

In addition, the paging and replacement mechanism of the memory makes most of the information in the memory incomplete, and the program will not call all the information into the memory during execution, and only call part of the information into the memory first, so that it is impossible to obtain a complete file, and it is difficult to detect whether the obtained file is a malicious program or file by a professional analysis method.

The existing antivirus software basically compares some characteristics existing in a virus library to judge whether the file is a malicious file or not, and the obvious advantages of the method are high accuracy, convenience and low false alarm rate, but incomplete detection of the file in a memory is relatively difficult, and the newly generated virus cannot be detected.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a method, a system, a storage medium and a device for detecting intrusion of malicious codes of memory segments, which detect undiscovered viruses and detect the existing viruses by learning the potential rules and characteristics of the malicious codes by utilizing a neural network model.

In order to achieve the purpose, the invention adopts the following technical scheme:

the first aspect of the present invention provides a method for detecting intrusion of malicious codes into a memory segment, which includes:

acquiring a memory file to be detected;

after binary conversion and word segmentation pretreatment are sequentially carried out on the memory file to be detected, fragment interception is carried out on the basis of the optimal fragment position and length combination to obtain a predicted fragment;

inputting the predicted fragments into an optimal neural network model, and detecting the predicted fragments to obtain a result of whether the memory file to be detected is implanted with malicious codes or not;

the neural network model adopts an embedded layer to carry out dimension increasing on an input prediction fragment, then is subjected to pooling after convolution through convolution layers with different convolution kernel sizes, and finally is input into a classifier after being converted through a flattening layer and a full connection layer.

Further, the word segmentation preprocessing comprises the following specific steps:

converting the binary file obtained after binary conversion into a decimal system to obtain a decimal file;

and judging whether the decimal file reaches the preset length, if not, adding 1 to the data in the decimal file as a whole, and then filling up with 0.

Further, the fragment position and length combination is:

taking data with the length of integral multiple of 1024 from the head of the memory file as a prediction fragment;

or taking the data with the length of the integral multiple of 1024 from the tail of the memory file as a prediction fragment;

or selecting a plurality of discontinuous sub-segments from the memory file, and combining the plurality of discontinuous sub-segments to obtain the prediction segment.

Further, the step of obtaining the optimal neural network model and the optimal combination of the position and the length of the segment is as follows:

acquiring a malicious sample set and a benign sample set, and performing binary conversion and word segmentation pretreatment on each memory file in the malicious sample set and the benign sample set to obtain an initial training test set;

performing fragment interception on each memory file in the initial training test set based on a plurality of fragment position and length combinations to obtain a plurality of training test sets;

and respectively training and testing the neural network model by adopting each training test set, taking the neural network model with the highest accuracy as an optimal neural network model, and taking the combination of the position and the length of the segment adopted by the training test set when the segment is intercepted as the optimal combination of the position and the length of the segment.

Further, the step of training and testing the neural network model by adopting the training and testing set comprises:

(a) dividing a training test set into a training set and a test set;

(b) updating the weight of the neural network model based on the training set, and outputting the neural network model after a plurality of iterations until the loss function reaches the minimum;

(c) and (c) classifying the samples in the test set by using the output neural network model, and returning to the step (b) to continue parameter training on the neural network model when the classification accuracy is less than the threshold value until the classification accuracy of the samples in the test set by the output neural network model is more than or equal to the threshold value.

A second aspect of the present invention provides a memory segment malicious code intrusion detection system, including:

a file acquisition module configured to: acquiring a memory file to be detected;

a fragment intercept module configured to: after binary conversion and word segmentation pretreatment are sequentially carried out on the memory file to be detected, fragment interception is carried out on the basis of the optimal fragment position and length combination to obtain a predicted fragment;

a prediction module configured to: inputting the predicted fragments into an optimal neural network model, and detecting the predicted fragments to obtain a result of whether the memory file to be detected is implanted with malicious codes or not;

the neural network model adopts an embedded layer to carry out dimension increasing on an input prediction segment, then carries out pooling after convolution through convolution layers with different convolution kernel sizes, and finally inputs the input prediction segment into the classifier after conversion through a flattening layer and a full connecting layer.

Further, the fragment position and length combination is:

Further, a training module is included that is configured to:

A third aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps in the memory segment malicious code intrusion detection method as described above.

A fourth aspect of the present invention provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the steps in the memory segment malicious code intrusion detection method described above.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a method for detecting intrusion of malicious codes of memory segments, which detects undiscovered viruses and detects the existing viruses by learning potential rules and characteristics of the malicious codes by utilizing a neural network model.

The invention provides a method for detecting malicious code intrusion of a memory segment, which detects a dynamic file analyzed by a memory, on one hand, the malicious code which cannot be detected under a static file and can be detected only when in operation can be detected; on the other hand, the memory evidence can be fixed more efficiently for the memory forensics personnel.

The invention provides a method for detecting the intrusion of malicious codes of memory fragments, which has effective detection on running memory files dumped onto a disk and has strong referential property and practicability on memory forensics personnel.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flowchart of an optimal neural network model obtaining method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of binary and decimal translation according to a first embodiment of the present invention;

FIG. 3 is a diagram of a neural network model according to a first embodiment of the present invention;

fig. 4 is a hidden layer structure diagram of the neural network model according to the first embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Interpretation of terms:

convolutional Neural Networks (CNNs) are a class of feed-forward Neural Networks that contain convolution computations and have a deep structure.

Example one

The embodiment provides a method for detecting intrusion of malicious codes of memory segments, which specifically comprises the following steps:

step 1, obtaining a memory file to be detected. The specific process of acquiring the memory file to be detected can be divided into three stages of sandbox operation, dumping and extracting.

Wherein, the sandbox operation specifically is: and running an untrusted program in the virtual machine, thereby effectively controlling that real equipment is not damaged. The dump is specifically: and dumping the memory file in the virtual machine by using a snapshot form. The extraction method specifically comprises the following steps: and extracting the memory file to be detected from the memory file obtained by the trans-storage by using a evidence obtaining tool, wherein the memory file to be detected is an executable file (. exe), a dynamic link library file (. dll) or a system file (. sys).

And 2, sequentially carrying out binary conversion and word segmentation pretreatment on the memory file to be detected, and then carrying out fragment interception based on the optimal fragment position and length (size) combination to obtain a predicted fragment.

And 3, inputting the predicted fragments into the optimal neural network model, and detecting the predicted fragments to obtain a result of whether the memory file to be detected is implanted with the malicious codes or not. If the memory file to be detected belongs to the memory file invaded by the malicious codes, the fact that the malicious codes are implanted into the detected memory file can be judged, otherwise, the detected memory file belongs to the memory file not implanted with the malicious codes, and therefore the file which is not detected in a static state and is detected out of the malicious codes in the dynamic file is found out in the static file.

Specifically, as shown in fig. 1, the steps of obtaining the optimal neural network model and the optimal combination of the position and the length of the segment are as follows:

(1) the method comprises the steps of obtaining a malicious sample set and a benign sample set, carrying out binary conversion on each memory file in the malicious sample set and the benign sample set to obtain a binary file data set, and carrying out word segmentation pretreatment on each binary file in the binary file data set to obtain an initial training test set.

In this embodiment, the tag of each sample (i.e., each memory file) in the malicious sample set is invaded by a malicious code; the label of each sample in the benign sample set is unimplanted with malicious code. The specific method for acquiring the malicious sample set and the benign sample set comprises the following steps: downloading a malicious sample set, in this embodiment, 600 static malicious samples are downloaded in the Virus Share website (https:// Virus Share. com /); running an untrusted program in a virtual machine, acquiring a memory mirror image, and extracting a malicious sample set through a memory forensics tool; the windows system obtains a memory mirror image, and extracts a benign sample set through a memory forensics tool, wherein the benign sample set includes 300 samples in the embodiment.

The word segmentation pretreatment comprises the following specific steps: converting the binary file obtained after binary conversion into a decimal file, and obtaining the decimal file, wherein the value range of the decimal file is 0-255, in order to enable the memory file to be closer to the image, and the pixel point of one image is in the interval of 0-255, so that the binary file can be converted into a gray map, and the performance of a neural network model is more adapted; in order to keep the lengths of the memory files input into the neural network model consistent, whether the decimal file reaches the preset length is judged, if not, 1 is added to the whole data in the decimal file, and then 0 is used for completing the data, and in order to enable 0 in the original data to become meaningful, 1 is added to the whole data in the decimal file, as shown in fig. 2.

(2) And performing fragment interception on each memory file in the initial training test set based on a plurality of fragment position and length combinations to obtain a plurality of training test sets. A training test set corresponds to a combination of fragment position and length to carry out fragment interception on an initial training test set to obtain a result.

Fragment position and length combinations were: taking data with length of integral multiple of 1024 (1024 bytes or 2048 bytes and the like) from the head of the memory file as a prediction fragment; or taking the data with the length of the integral multiple of 1024 from the tail of the memory file as a prediction fragment; or, a plurality of non-continuous sub-segments are selected from the memory file, the plurality of non-continuous sub-segments are combined to be used as the prediction segment, and when the sub-segments are combined, the sub-segments can be combined according to the sequence of the sub-segments in the memory file, or the sub-segments can be combined after the sequence is disturbed.

The number and the length of the sub-segments are calculated by the following method:

data_len=file_len/k；

if train _ len < data _ len, data _ len = 256;

NN = train_len/data_len；

wherein k is a parameter set according to the average length of the samples, and can be taken as 60; file _ len is the sample length; data _ len is the length of the sample at different positions (i.e. the length of one sub-segment); train _ len is the length of the set predicted segment; NN represents the number of subfragments.

(3) And respectively training and testing the neural network model by adopting each training test set, taking the neural network model with the highest accuracy as an optimal neural network model, and taking the fragment position and length combination adopted by the training and testing test set when the fragment is intercepted as the optimal fragment position and length combination.

The method for training and testing the neural network model by adopting a training and testing set comprises the following steps:

(a) the training test set is as follows 2: 1, dividing the mode into a training set and a test set;

(b) updating the weight of the neural network model based on the training set, and outputting the neural network model after a plurality of iterations until the loss function reaches the minimum and the training result is stable and unchanged;

(c) and (3) testing the output neural network model by adopting the test set, namely classifying the samples in the test set by using the output neural network model, and returning to the step (b) to continue parameter adjustment training of the neural network model when the classification accuracy is less than 80% of the threshold value until the classification accuracy of the samples in the test set by using the output neural network model is more than or equal to 80% of the threshold value.

Specifically, the neural network model is a CNN neural network model. As shown in fig. 3, the neural network model includes an input layer, a hidden layer, and an output layer. The number of neurons in the input layer is the number of data in the input prediction segment, in this embodiment, the number of neurons in the input layer is 256, each neuron xi represents one data in the prediction segment, and the number of channels in the input layer is 1; the number of neurons in the output layer is 2, representing the category: invaded by malicious code or not implanted with malicious code. The neural network model adopts an embedded layer to carry out dimension increasing on an input prediction segment, then carries out pooling after convolution through convolution layers with different convolution kernel sizes, and finally inputs the input prediction segment into a classifier after conversion through a flattening layer and a full connecting layer. Specifically, as shown in fig. 4, the hidden layer has 12 layers in total, including an embedding layer (embedding), a first one-dimensional convolution layer, a first pooling layer, a second one-dimensional convolution layer, a first Dropout layer, a third one-dimensional convolution layer, a second pooling layer, a second Dropout layer, a third pooling layer, a flattening layer (flatting), a full-link layer, and a softmax layer, which are connected in sequence. The neural network model convolutional layer of the invention adopts a smaller convolutional kernel and more convolutional layers instead of a larger convolutional kernel (generally set to be more than 10) as the general natural language processing classification task. For capturing more data features. Although the calculation times of the convolution layer are increased, the training time is not increased too much, because the memory data is sparse, and 0 with at least one fourth of the data is distributed in the data, so that the convolution calculation becomes simple.

The embedded layer converts the vector of each neuron of the input layer into a multidimensional (100-dimensional) vector by using a dimension-increasing mode, so that the difference between different values can be better learned, and the dimension-increased data (100-dimensional) is spliced for the second time and used as the input of the first one-dimensional convolutional layer. The number of channels inputted to the first one-dimensional convolutional layer is 200.

The use of stacked small convolution kernels is preferable to the use of large convolution kernels for a given received field because multiple non-linear layers can increase the depth of the network. Therefore, a plurality of one-dimensional convolutional layers and smaller convolution kernels are adopted for training, wherein the sizes of the convolution kernels of the first one-dimensional convolutional layer, the second one-dimensional convolutional layer and the third one-dimensional convolutional layer are respectively set to be 3, 4 and 5; the step lengths of the first one-dimensional convolutional layer, the second one-dimensional convolutional layer and the third one-dimensional convolutional layer are all 1; the first one-dimensional convolutional layer input and output channels are 200 and 100 respectively; the input channel of the second one-dimensional convolution layer is 100, and the output channel is 50; the third one-dimensional convolutional layer input channel is set to 50 and the output channel is set to 25.

The output characteristics of the first, second, and third one-dimensional convolutional layers may all be expressed as:

wherein, N is the size of the batch, C is the size of the channel, L is the length of the sequence, bias is the offset value (the offset value defaults to 1) in the neural network, batch is the number of samples processed in batch, that is, all samples in the whole data set (which can be a training set, a test set or a single memory file to be detected during use) are divided into a plurality of groups, the number of samples in each group is the size of the batch, i is expressed as the group, j is the ith group in the ith groupj samples, k being the index of the input channel, the jth sample of the ith group being input with the kth neuron node, out_jOutput channel representing the jth sample, C_inRepresenting the total number of input channels, weight representing the weight vectors of the first, second or third one-dimensional convolutional layers, input representing the input characteristics of the first, second or third one-dimensional convolutional layers.

The lengths of the output sequences of the first, second and third one-dimensional convolutional layers are all calculated by:

wherein, the first and the second end of the pipe are connected with each other,L _outin order to output the length of the sequence,L _inin order to input the length of the sequence,paddingin order to be able to fill the length,dilationwhich is the size of the hole convolution, here set to 0,kernel_sizeis the size of the convolution kernel and is,strideis the step size.

The first Dropout layer and the second Dropout layer not only prevent the problem of overfitting of the training data, but also do not reduce the precision of the training. The mask values (sizes of the random hidden neuron nodes) of the first Dropout layer and the second Dropout layer are each set to 0.5. Dropout is a random hiding regularization technology, a Dropout layer is used for randomly hiding real neuron nodes, and default is to fill the neuron nodes with zeros randomly, so that the convolutional neural network can consider the neuron nodes as new data, learn again, update weights and inhibit the overfitting problem during training.

Due to sparsity of memory data, the extracted features are weakened by using average pooling, and maximum pooling (maxpool) is adopted in the first pooling layer, the second pooling layer and the third pooling layer, so that the model is better than the average pooling performance. The sizes of the first pooling layer, the second pooling layer and the third pooling layer are all set to 4, which is determined by the sparse characteristics of the dynamic data binary file data.

And after the data characteristics output by the third pooling layer are flattened by the flattening layer, the input parameters of the flattening layer are obtained by combining the length of the output sequence of the third one-dimensional convolution layer.

And flattening the sequence by utilizing a flattening layer, converting the sequence into two neuron nodes through a full connection layer, and finally realizing classification through a softmax layer.

The last softmax layer of the hidden layer, namely the classifier, adopts a normalized exponential function:

wherein exp (x) represents e^xIs given in the figure (e is a nanopiere constant 2.7182.), n denotes neurons sharing n in the output layer, y_kRepresenting the output of the kth neuron of the output layer, a_kAn input representing a kth neuron of an output layer; the molecule is the input signal a of the kth neuron_kThe denominator is the sum of the exponential functions of all input signals.

The optimizer in the neural network model selects the optimizer most suitable for the model under the verification of a large amount of data.

The loss function of the neural network model adopts a min-batch cross entropy loss function:

where M represents the number of predicted segments in the training set, t_mkValue, y, representing the kth element (data) of the mth prediction fragment_mkIs the output of the neural network on the kth element (data) of the mth prediction fragment, t_mkIs supervisory data. By extending the loss function of a single data to M data, but finally dividing by M, the average loss function of a single prediction segment can be obtained, by which averaging a uniform index independent of the training data can be obtained, e.g. even if there are 1000 or 10000 training data (prediction segments in the training set), a single pre-prediction segment can be obtainedThe average loss function of the fragments was measured.

The invention provides two classifications of the files extracted from the memory by using machine learning based on the analysis, and the files extracted from the memory are detected by the method, so that whether malicious codes are implanted into the memory can be effectively found.

Example two

The embodiment provides a system for detecting intrusion of malicious codes in a memory segment, which specifically comprises the following modules:

Wherein the combination of fragment position and length is:

A training module configured to:

It should be noted that, each module in the present embodiment corresponds to each step in the first embodiment one to one, and the specific implementation process is the same, which is not described again here.

EXAMPLE III

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the memory segment malicious code intrusion detection method according to the first embodiment.

Example four

The embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor executes the program to implement the steps in the intrusion detection method for malicious codes in memory segments according to the first embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The intrusion detection method for the malicious codes of the memory segments is characterized by comprising the following steps:

acquiring a memory file to be detected;

2. The intrusion detection method for the malicious codes in the memory segment according to claim 1, wherein the word segmentation preprocessing comprises the following specific steps:

3. The memory segment malicious code intrusion detection method according to claim 1, wherein the combination of the segment position and the length is:

4. The intrusion detection method for the malicious codes in the memory segment according to claim 1, wherein the steps of obtaining the optimal neural network model and the optimal combination of the position and the length of the segment are as follows:

5. The memory segment malicious code intrusion detection method according to claim 4, wherein the step of training and testing the neural network model by adopting the training test set comprises the following steps:

(a) dividing a training test set into a training set and a test set;

6. The intrusion detection system for the malicious codes of the memory segments is characterized by comprising the following steps:

7. The memory segment malicious code intrusion detection system according to claim 6, wherein the segment location and length combination is:

8. The memory segment malicious code intrusion detection system of claim 6, further comprising a training module configured to:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the memory segment malicious code intrusion detection method according to any one of claims 1 to 5.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps in the memory segment malicious code intrusion detection method according to any one of claims 1 to 5 when executing the program.