CN111881447A

CN111881447A - Intelligent evidence obtaining method and system for malicious code fragments

Info

Publication number: CN111881447A
Application number: CN202010594720.0A
Authority: CN
Inventors: 李炳龙; 张宇; 李媛芳; 佟金龙; 孙怡峰; 常朝稳; 王清贤
Original assignee: Henan Yunyan Technology Co ltd; Information Engineering University of PLA Strategic Support Force
Current assignee: Henan Yunyan Technology Co ltd; Information Engineering University of PLA Strategic Support Force
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-11-03
Anticipated expiration: 2040-06-28
Also published as: CN111881447B

Abstract

The invention belongs to the technical field of digital evidence obtaining, and particularly relates to an intelligent evidence obtaining method and system for malicious code fragments, wherein a code fragment training set and a code fragment testing set for training and testing are constructed by extracting the bottom data characteristics of a storage medium; training the set fully-connected neural network model by using data in the code segment training set, wherein the input is a feature vector, and the output is a normal or malicious prediction result; aiming at the code fragment test set, carrying out test output by using the trained fully-connected neural network model so as to judge whether the model input is a malicious code fragment; and after the characteristics of the target code segment are extracted, inputting the target code segment into a full-connection neural network model generated through training and testing to obtain an intelligent malicious code recognition result. The method can be used for identifying malicious code fragments in storage media such as computer mobile phone tablets and evidence containers such as RAW, E01 and AFF, and has a good application prospect in the digital evidence obtaining field such as the automatic analysis of evidence bottom data of criminal incidents.

Description

Intelligent evidence obtaining method and system for malicious code fragments

Technical Field

The invention belongs to the technical field of digital evidence obtaining, and particularly relates to an intelligent evidence obtaining method and system for malicious code fragments, which are suitable for detecting and obtaining the malicious code fragments in storage media in magnetic disks, U disks, mobile hard disks, iPad, smart phones and other devices of computers (including notebook computers) and in evidence obtaining containers such as RAW, E01, AFF and the like.

Background

With the rapid development of mobile internet technology, digital crime incidents frequently occur, and the volume of digital evidences needing to be processed by judicial agencies such as public security and the like in the incident investigation process is increased greatly due to the continuous increase of the capacity of magnetic disk media and the continuous increase of the number of digital devices for storing crime incident information. According to 2019 digital evidence-taking capability analysis report of Texas department of justice, Texas: the Federal Bureau of Investigations (FBI) in the united states has the best forensic laboratories, but has overstocked the volume of digital evidence for up to more than nine months, and has resulted in situations where the volume of final cases has to be reduced because a large amount of digital evidence cannot be effectively analyzed. In addition, since the crime evidence comes from different types of devices such as computers, smart phones, tablet computers, and even internet of things devices and wearable devices, these massive evidences have metadata information such as different operating systems and file systems, and thus cause great differences in crime evidence analysis. In addition, in order to ensure the integrity and repeatability of the digital crime evidence analysis, the digital crime evidence needs to store evidences in different devices in evidence containers such as AFF, E01, RAW and the like through a storage medium mapping technology, and the evidence data in the evidence containers are stored in an underlying binary format, so that the evidence analysis is more and more complicated. Therefore, in order to solve the big data nature, evidence difference and complexity of digital evidence analysis in digital criminal incidents, the automatic evidence collection analysis technology becomes a key research problem in the current digital evidence collection field. Aiming at the identification problem of malicious code fragments in complex, heterogeneous and underlying massive evidence data in the digital criminal incident investigation, the automatic evidence obtaining detection problem of malicious code fragments is explored from the underlying characteristics of evidence data storage by utilizing a deep learning theory and a deep learning model, so that the automatic evidence obtaining detection problem becomes a hotspot research direction of digital evidence obtaining detection.

Disclosure of Invention

Therefore, the malicious code fragment intelligent forensics method and the malicious code fragment intelligent forensics system extract the bottom layer data characteristics of the storage medium, can identify malicious code fragments in storage media such as computers (including notebook computers) and Android smart phones (tablet computers) and forensics containers such as RAW, E01 and AFF, improve the identification effect of the malicious code fragments, and have good application prospects.

According to the design scheme provided by the invention, the intelligent malicious code fragment evidence obtaining method comprises the following contents:

extracting the bottom data characteristics of the storage medium, and constructing a code segment training set and a code segment testing set for training and testing, wherein the code segment training set and the code segment testing set both comprise normal code segments and malicious code segment data;

training the set fully-connected neural network model by using data in the code segment training set to adjust network model parameters, wherein the input is a feature vector, and the output is a normal or malicious prediction result; aiming at the code fragment test set, carrying out test output by using the trained fully-connected neural network model so as to judge whether the model input is a malicious code fragment;

and after the characteristics of the target code segment are extracted, inputting the target code segment into a full-connection neural network model generated through training and testing to obtain an intelligent malicious code recognition result.

As the intelligent malicious code fragment evidence obtaining method, further, evidence source data from a plurality of storage media are collected, the evidence source data are analyzed, malicious code fragment characteristics are extracted, and the malicious code fragment characteristics are normalized, wherein the storage media come from different devices and/or adopt different file system types.

As the intelligent evidence obtaining method for the malicious code segment, the method further comprises the steps of carrying out file type and evidence storage container type identification according to an original evidence source in the analysis of evidence source data, and determining the original evidence source file system type or the evidence storage container type; analyzing the initial/end position of file data storage in the storage medium and the cluster size of the file data storage, and recording the initial/end position of the file data storage as the initial/end position of the malicious code segment, wherein the cluster size of the file data storage is the size of the malicious code segment; starting from the initial position of the malicious code segment, reading the malicious code data in a hexadecimal format by taking the size of the malicious code segment as a reading unit, and taking the hexadecimal data of the malicious code segment as the characteristics of the malicious code segment.

As the intelligent evidence obtaining method for the malicious code fragments, further, different devices comprise a disk and/or a portable device with a storage function; the different file system types comprise an android file system and/or a Linux file system and/or a Windows file system.

As the intelligent evidence obtaining method for the malicious code fragments, a training set and a test set are further constructed, labels are added to the data sample code fragments in a batch processing mode to distinguish normal code fragment data from malicious code fragment data, and the labeled data sample code fragments are disordered by a pseudorandom method to obtain code fragment data which are randomly sequenced and used for constructing the training set and the test set.

As the intelligent evidence obtaining method for the malicious code fragments, a fully-connected neural network model structure in a deep learning open source framework Tensflow is further utilized, and each neuron in the fully-connected neural network model structure has a connection relation with each neuron of front and rear adjacent connection layers.

As the intelligent evidence obtaining method for the malicious code segments, a back propagation training algorithm is further utilized to train a fully-connected neural network model, and by setting cycle turns of a cycle, model parameters are saved and a current loss value is determined when the cycle turns in each cycle are met, so that the network model parameters are adjusted.

As the intelligent evidence obtaining method for the malicious code fragments, random parameter initialization is adopted in the adjustment of network model parameters, so that the parameters are subjected to normal distribution or uniform distribution, and different neurons in a network layer in the model are ensured to have different outputs for different inputs; in the adjustment of the network model parameters, a cross entropy loss function is used for searching for the optimal solution of the model, a loss function is obtained according to the predicted value and the actual value of the input model, and the model parameters are adjusted by calculating the gradient of the loss function.

As the intelligent evidence obtaining method for the malicious code segments, model complex indexes are further introduced into a loss function, and index weights are set for each weight parameter so as to inhibit noise in training data; and selecting an exponential decay learning rate, dynamically adjusting the learning rate in the training process, and updating the learning rate every other round, wherein an updating formula adopts: the new learning rate is the learning rate initial value and the learning rate attenuation rate.

Further, the present invention also provides an intelligent malicious code fragment forensics system, comprising: a data preprocessing module, a training test module and a target identification module, wherein,

the data preprocessing module is used for storing the bottom-layer data characteristics of a medium or an evidence container in advance, constructing a code segment training set and a code segment testing set for training and testing and performing characteristic preprocessing on a target code segment to be analyzed, wherein the code segment training set and the code segment testing set both comprise normal code segments and malicious code segment data;

the training test module is used for training the set fully-connected neural network model by using the data in the code segment training set so as to adjust the parameters of the network model, wherein the input is a feature vector, and the output is a normal or malicious prediction result; aiming at the code fragment test set, carrying out test output by using the trained fully-connected neural network model so as to judge whether the model input is a malicious code fragment;

and the target identification module is used for performing feature extraction on the target code segment and inputting the target code segment into the fully-connected neural network model generated through training test to obtain an intelligent identification result of the malicious code.

The invention has the beneficial effects that:

according to the method, the characteristics of the bottom layer data of the storage medium are extracted, and the malicious code fragments in the storage medium such as a computer (including a notebook computer), an Android smart phone (tablet) and the like and in the evidence obtaining container such as RAW, E01 and AFF can be identified; the extracted code fragment data is converted into feature vector data consistent with the existing deep learning model, and then the Tensoflow deep learning model network structure is used for training and parameter adjustment to obtain a deep learning model suitable for code fragment processing, so that the method has a good application prospect in the digital evidence obtaining field such as automatic analysis of crime incident evidences.

Description of the drawings:

FIG. 1 is a schematic flow chart of an intelligent malicious code fragment evidence obtaining method in an embodiment;

FIG. 2 is an illustration of an automatic evidence-taking detection framework for malicious code fragments in an embodiment;

FIG. 3 is a schematic diagram of malicious code fragment feature preprocessing in an embodiment;

FIG. 4 is a schematic diagram of the back propagation algorithm in the embodiment;

FIG. 5 is a schematic diagram of data set processing in an embodiment;

FIG. 6 is a schematic diagram of a TFRecords file in an embodiment;

FIG. 7 is a schematic diagram of an automatically generated data flow diagram of the Tensoboard in an embodiment;

FIG. 8 is a graph illustrating the variation of loss with training process in the example;

FIG. 9 is a graph showing the variation of the accuracy rate with the training process in the embodiment;

FIG. 10 is a schematic diagram of the evidence-obtaining test result in the example.

The specific implementation mode is as follows:

in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.

The research of automatic evidence-obtaining technology has already achieved initial research results. Scholars have explored the necessity and importance of highly automated digital forensics and analyzed the advantages of automated forensics. In addition, in order to improve the automaticity of Forensic analysis, a button-type automatic Forensic function has been added to a classic Forensic kit, such as EnCase, Forensic ToolKit, autopsy Forensic Browser and other fully functional Forensic kits, which allows a Forensic investigator to perform preliminary, even some complex investigation and analysis tasks by knowing which button to press. These popular tools attempt to make forensic investigators work easier and to promote automated forensic capabilities. And the forensic software with a single TraceHunter function can also provide the automatic forensics functions of association, interpretation and Windows registry analysis. In addition, the evidence classification technology in the field of digital evidence collection is a direction with rapid growth and high automation degree, and a plurality of works support the automatic evidence classification function related to computers and mobile phones. As they realize the benefits of rapid, automated, on-site intelligence acquisition. According to 2019 internet crime investigation report analysis published by U.S. FBI: the automatic evidence obtaining technology is beneficial to rapid and automatic criminal event analysis and becomes a key technology for reducing digital evidence depth analysis in digital investigation. However, compared research is conducted by scholars on manual investigation and automatic evidence classification, and research results show that in a more complicated network attack criminal investigation, for example, a criminal stores malicious codes in a network hard disk in a form of fragments or in a peer-to-peer network storage system, potential evidence detected by an automatic classification evidence-taking technology is missed due to the lack of overall knowledge of the malicious codes. In addition, malware threats are increasing day by day, and have become a difficult point for digital forensic detection. According to the report of the McAfe laboratory, more than 6500 million new malware was added to the laboratory in the first quarter of 2019. Traditional malware detection mechanisms rely on extracting signature features in malware samples and storing these features in a database. However, a great deal of manual analysis is required for extracting characteristics of a malware sample, and the signature characteristic-based malicious code detection technology is difficult to effectively keep up with the rapid increase in the amount of malware, and the fundamental reason is that the malicious code signature scanning technology is only effective on known malware samples and is not effective on newly added unknown malware. Another classical approach is to detect malware based run-time behavior, which involves running malware samples and observing their run-time behavior. Although this approach can improve the detection of unknown malware, this approach is vulnerable to virtual machine escape technology malware. Moreover, the execution of suspicious malicious code requires a significant amount of time and computational resources. Due to the limitations of the two malicious code detection technologies, and the increasing anti-forensics means of fragmentation, encryption and the like in a large-capacity storage medium by criminals, the detection difficulty of the malicious codes is higher. Researchers turn to heuristic methods, and train and learn characteristics of malicious software by using machine learning models, so that detection accuracy is enhanced, and speed is increased. The embodiment of the invention, as shown in fig. 1, provides an intelligent malicious code fragment forensics method, which comprises the following contents:

s101, extracting the bottom layer data characteristics of a storage medium, and constructing a code segment training set and a code segment testing set for training and testing, wherein the code segment training set and the code segment testing set both comprise normal code segments and malicious code segment data;

s102, training the set fully-connected neural network model by using data in the code segment training set to adjust network model parameters, wherein input is a feature vector, and output is a normal or malicious prediction result; aiming at the code fragment test set, carrying out test output by using the trained fully-connected neural network model so as to judge whether the model input is a malicious code fragment;

s103, after feature extraction is carried out on the target code segment, the input value is input to train the tested fully-connected neural network model, and an intelligent malicious code recognition result is obtained.

Deep neural Learning (Deep Learning) is a branch of the field of machine Learning, an algorithm that attempts to perform high-level abstraction of data using multiple processing layers that contain complex structures or consist of multiple nonlinear transformations. At present, deep learning obtains breakthrough progress in several main fields of images, voice, machine translation and the like, and a great deal of research results are generated. However, to obtain a good deep learning model, a deep learning framework needs to be studied for each specific problem (such as image classification) and long-term tuning is performed (i.e., model parameters are optimized through training), which makes the application of the deep neural network learning method limited. Therefore, Kaiser et al explores a unified deep learning model, i.e., a plurality of tasks of different types in different fields, different data modalities, such as speech recognition, image classification, machine translation, etc., are adaptively solved by constructing a model, and the performance on a specific task is not obviously lost or is close to the existing mainstream method. The model is mainly applied to the problems of image classification, voice recognition, machine translation and the like at present, and the problem of malicious code fragment recognition is not discussed yet. TensorFlow is an open source software library which adopts a data flow graph and is used for numerical calculation, is a deep learning framework developed by Google corporation, is also one of the mainstream frameworks of deep learning at present, can realize classical algorithms such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) and a Deep Neural Network (DNN), and is applied to the aspects of speech recognition, natural language processing, computer vision and the like. By using the TensorFlow platform, a neural network can be designed without starting from the beginning, and a desired network can be generated by directly calling an interface. Having been widely used in the leading industry and academia, many deep learning articles providing source code use TensorFlow to implement their models. Therefore, aiming at the problem of malicious code segment identification in complex, heterogeneous and underlying massive evidence data in digital criminal incident investigation, the automatic evidence-obtaining detection problem of malicious code segments is explored from the underlying characteristics of evidence data storage, and becomes the hot point direction of digital evidence-obtaining detection. Further, in the embodiment of the present invention, a fully-connected neural network model structure in a deep learning open source framework tensrflow is used, and each neuron in the fully-connected neural network model structure has a connection relation with each neuron in front and back adjacent connection layers. Referring to the TensorFlow deep learning open source framework shown in FIG. 2, the first module is an automatic identification framework based on TensorFlow malicious code fragments. The module uses a full-connection network (FCN) as a deep learning network for solving the problem of identifying the sensitive sectors of the disk, namely, each neuron has a connection relation with each neuron of adjacent layers, the input is 4096-dimensional feature vectors, and the output is a normal or malicious prediction result. And the second module is used for training the deep learning model by utilizing the malicious code segment data training set, finely adjusting the relevant parameters of the deep learning model, and learning and obtaining the abstract characteristics of the malicious code segments. And the third module is used for detecting and classifying the code segments to be detected through the trained hierarchical deep learning model. In the deep learning open source framework TensorFlow model adopted in the embodiment, a fully-connected neural network model in the TensorFlow framework is utilized, the number of the initial input nodes is 4096, and the number of the model output nodes is 2 for classified output.

Evidence sources for digital event surveys originate from different devices and storage media of different file system types. The code fragment training set and the data set are important for building and evaluating a deep learning network model, and after media such as a disk are processed based on a malicious code feature preprocessing algorithm, a large number of data sample files can be obtained, and the result is a binary data file which is 4KB in size and contains a certain type of code fragments. There is currently no code fragment data set based on a storage medium such as a disk. As an intelligent malicious code fragment evidence obtaining method in the embodiment of the present invention, further, evidence source data derived from a plurality of storage media are collected, the evidence source data are analyzed, malicious code fragment features are extracted, and the malicious code fragment features are normalized, wherein the plurality of storage media are from different devices and/or adopt different file system types. Further, the different devices comprise magnetic disks and/or portable devices with storage functions; the different file system types comprise an android file system and/or a Linux file system and/or a Windows file system.

As an intelligent evidence obtaining method for malicious code segments in the embodiment of the invention, further, during analysis of evidence source data, identification of a file type and an evidence storage container type is performed according to an original evidence source, and the original evidence source file system type or the evidence storage container type is determined; analyzing the initial/end position of file data storage and the cluster size of the file data storage in the storage medium according to the file system characteristics of the storage medium or the evidence obtaining storage container principle and the like, and recording the initial/end position of the file data storage as the initial/end position of a malicious code segment, wherein the cluster size of the file data storage is the size of the malicious code segment; starting from the initial position of the malicious code segment, reading the malicious code data in a hexadecimal format by taking the size of the malicious code segment as a reading unit, and taking the hexadecimal data of the malicious code segment as the characteristics of the malicious code segment.

Referring to fig. 3, the file system type and the evidence storage container type are identified according to the original evidence, and the file system type or the evidence storage container type of the original evidence is determined. According to the file system characteristics of the storage medium or the principles of the storage containers such as AFF and E01, the start/end position of file data storage in the storage medium and the cluster size of the file data storage are analyzed, wherein the cluster is composed of a plurality of sectors, and the size of each sector is 512 bytes. Starting from the initial position of the malicious code segment, reading the malicious code data in a hexadecimal format by taking the size of the malicious code segment as a reading unit, calling the hexadecimal data of the malicious code segment as a preprocessing feature of the malicious code, and taking the preprocessing feature as a direct input feature of a deep learning model framework. Because the size of the malicious code fragment corresponds to a 'file cluster' storage unit in the storage medium and is an integral multiple of a sector in the storage medium, the problem of less or more fragment data does not exist in the malicious code fragment characteristic preprocessing process, and the problem of malicious code fragment characteristic data complementing or cutting does not need to be considered. But the size of the malicious code segment has certain influence on deep learning network model training and actual detection. As the intelligent evidence obtaining method for the malicious code fragments in the embodiment of the invention, a training set and a test set are further constructed, labels are added to the data sample code fragments in a batch processing mode to distinguish normal code fragment data from malicious code fragment data, and the labeled data sample code fragments are disordered by using a pseudorandom method to obtain code fragment data which are used for constructing the training set and the test set in a random sequence.

As the intelligent malicious code segment evidence obtaining method in the embodiment of the invention, further, a back propagation training algorithm is used for training a fully-connected neural network model, and by setting cycle times of the cycles, model parameters are stored and the current loss value is determined when the cycle times in each cycle are met, so that the network model parameters are adjusted. Further, in the adjustment of network model parameters, random parameter initialization is adopted, so that the parameters obey normal distribution or uniform distribution, and different neurons in a network layer in the model are ensured to have different outputs for different inputs; in the adjustment of the network model parameters, a cross entropy loss function is used for searching for the optimal solution of the model, a loss function is obtained according to the predicted value and the actual value of the input model, and the model parameters are adjusted by calculating the gradient of the loss function. Further, model complex indexes are introduced into the loss function, and index weight is set for each weight parameter so as to inhibit noise in training data; and selecting an exponential decay learning rate, dynamically adjusting the learning rate in the training process, and updating the learning rate every other round, wherein an updating formula adopts: the new learning rate is the learning rate initial value and the learning rate attenuation rate.

The back propagation training algorithm is a main module of a malicious code fragment recognition algorithm framework, and as shown in fig. 4, the algorithm flow may specifically be: and after the training process is started, if the model exists, the model is recovered, otherwise, the training cycle is directly started, model parameters are set to be stored once every 1000 times of training, and the current loss value is calculated and printed. The algorithm is added with a training model storage function, and aims to realize breakpoint continuous training and give a given round of training after the loss value tends to be stable. Two points need to be considered in the implementation of the back propagation algorithm. Firstly, a random parameter initialization method is adopted, and the purpose is to make parameters obey normal distribution or uniform distribution, ensure that different neurons in a network layer have different outputs for different inputs, and ensure that a network training process has a good convergence effect. Secondly, the training optimization method comprises the following steps: in a deep learning model, a cross entropy (Cross Entry) loss function is adopted to find the optimal solution of the model, TensorFlow obtains the loss function according to the predicted value and the actual value of the input model, the gradient of the loss function is calculated, and the model parameters are adjusted according to the gradient. In addition, a regularization mechanism is introduced for improving the generalization capability of the malicious code segment automatic identification algorithm framework. And introducing a model complex index into the loss function, adding a weight to each weight parameter, and suppressing noise in training data, wherein the bias parameters in the model are not generally used. In addition, the setting of the learning rate has great influence on the training, the exponential decay learning rate can be selected, the learning rate is dynamically adjusted in the training process, the learning rate decay rate is calculated every other round, and the learning rate is updated: the new learning rate is the learning rate initial value and the learning rate attenuation rate. The effect of the running average is to record the average value of each parameter over a period of time, and the average value changes slowly like a shadow, so that the generalization of the model can be increased. The running average is optimized for all parameters.

Further, an embodiment of the present invention further provides an intelligent malicious code fragment forensics system, including: a data preprocessing module, a training test module and a target identification module, wherein,

and the target identification module is used for carrying out feature extraction on the target code fragments, inputting the values to train the tested fully-connected neural network model, and acquiring the intelligent identification result of the malicious codes.

In order to verify the effectiveness of the technical scheme in the embodiment of the invention, the following further explanation is made through specific experimental data:

at present, no code fragment data set based on storage media such as a disk exists, so that in a preprocessing stage, normal codes and malicious codes for training are selected to be about 150MB respectively, and the normal codes and the malicious codes are ensured to come from Android, Linux and Windows platforms respectively and are inhibited on average in quantity. The code fragment files are respectively stored in two folders (normal and malware), and after the data set is manufactured, the data set is adjusted to finally contain 39944 normal code fragment files and 40056 malicious code fragment files, and the total number of sample files is 80000. The test set can be used to evaluate the effect of the trained model, and in this experiment, the test set is used to test the algorithm without separately designing a validation set. 2500 training data of the two types are randomly selected when the test set is manufactured, wherein the number of the training data is 5000, and 5000 new code fragment files are manufactured by the method again, and the total number of the code fragment files is 10000.

And labeling the data sample code fragment file on the basis of the processing. After the classified data sample code segment files are obtained, adding labels to the data code segment files in a batch processing mode, wherein the specific algorithm process comprises the following steps:

(1) traversing all file names in the normal directory, one line for each file name, adding a label '0' after the file name, and saving to label0.txt, wherein the file represents a normal code fragment label.

(2) Traversing all file names in the malware directory, wherein each file name is in a row, adding a label '1' after the file name, and finally obtaining a label1.txt file which represents a malicious code fragment label.

(3) And merging the file contents of the normal code fragment tag and the malicious code fragment tag, and disordering the merged tag contents by using a pseudorandom method, so that the data of the tag '0' and the tag '1' are randomly distributed to obtain 80000 lines of random sequence texts in total, and storing the random sequence texts as train _ tags. The test set label file test _ labels.

In addition, in order to improve the automatic evidence obtaining operation efficiency of the malicious code fragments and reduce the time consumed by reading files in the training process, the TFRecord file is used for processing the training data set and the labels thereof (including the test set and the labels thereof), and a specific algorithm process is shown in fig. 5. According to the data set processing algorithm of fig. 5, two TFRecords files as in fig. 6 are obtained.

According to the FAT32 file system format in the Windows system, the malicious code segment size can be selected to be 4 kbytes, which is also the default file storage basic unit (4K corresponds to the size of 8 sectors) in the current large-capacity disk. Combining a malicious code fragment preprocessing algorithm and a data set making algorithm, the method specifically comprises the following steps: (1) preparing normal code and malicious code, which are equivalent in number (total size); (2) respectively writing the two types of codes into a clean disk, and filling all 0 in the disk by using a WinHex tool before the disk; (3) positioning a single program according to a disk file system directory table; (4) reading program (normal program or malicious program) data by taking 4K as a unit, and storing the data as a file with the size of 4K, wherein the file name is as follows: unique ID + serial number of the program; (5) and reading the data of each code one by one, and storing the normal codes and the malicious codes separately, thereby facilitating the manufacture of a data set. (6) Files of 4 directories are generated, and are respectively a normal code segment training (train _ positive), a malicious code segment training (train _ virus), a normal code segment testing (test _ positive) and a malicious code segment testing (test _ virus).

The quality of the data set directly influences the functional effect of the fully-connected deep learning network, because the training and learning processes of the network are carried out according to the training data. If the fully-connected deep learning network is trained for a long time, loss reduction is slow or even not convergent, or accuracy is always low, in case the fully-connected deep learning network has no error, it may be a problem with the data set, which may include but is not limited to: the number proportion of the malicious code fragments to the normal code fragments is unbalanced, and the proportion of certain data is overlarge; the data set labels are not in a disorderly sequence, if the situation that the first half is all '0', namely the first half is all normal code fragments, and the second half is all '1', namely malicious code fragments exists; failure of the collected data or errors in label making. In order to improve the training effect of the fully-connected deep learning network, the unbalanced problem that the number of malicious codes and normal codes in a data set is unbalanced and the unbalanced problem that the code fragment labels are not in a disordered sequence and the like can be solved, and the specific method comprises the following steps: in the embodiment of the invention, in the process of realizing the full-connection network for the first time, because the collected malicious programs are limited and have no balance proportion, the proportion of normal programs and malicious program cluster files is about 8:1, the problems directly caused by the fact that training data randomly selected during program training has few malicious sensitive information, a model constructed by a neural network has insufficient expression capability on the malicious sensitive information, and the expected function cannot be realized are solved. Similarly, for the 2-class problem, if the labels are not in disorder, the training may be limited to a single selection round. In fact, in machine learning, it is often assumed that the number of training samples of each class is equal, i.e., the number of samples of each class is balanced. An "unbalanced" training sample may result in the training model being "overlooked" to a class with a higher number of samples, and the generalization capability of the model is affected by "overlooking" to a class with a lower number of samples. As an extreme example, if the ratio of normal samples to malicious samples in the training set of the classification problem is 99:1, in the learning process, the accuracy can be 99% by directly classifying the samples as "normal", and thus the classification algorithm may discard the prediction of the malicious samples. It is contemplated that if the model is used to identify and classify data with a 1:99 ratio of normal program samples to malicious programs, the network has only 1% accuracy, which is virtually completely without the ability to identify predictions.

TABLE 1 Effect of unbalanced proportional samples on training

Number of positive and negative samples	Tendency to loss	Accuracy rate after 50000 rounds	Recognition result
				80000:10000	Only slowly falls down	About 50 percent	Random
40000:40000	Decrease and gradually converge	>95％	Is basically correct

Similar to the case of about 50% accuracy in table 1, 50% accuracy is not meaningful in the 2-class problem, and the random uniform distribution result is theoretically 50% in the engineering implementation process. This occurs unexpectedly, but through the above analysis process, the effect of the unbalanced samples is known, and a relatively balanced data set is recreated: the number of normal and malicious program samples is equivalent, and both are about 40000; the label files are disorderly in sequence and randomly rearranged. A similar situation does not occur again for the new data sample set.

Through testing and adjustment, the three-layer fully-connected neural network (FCN) can achieve the best evidence obtaining detection result. The back propagation hyper-parameter value of the fully-connected neural network set according to experience is as follows: the initial value of the learning rate is 0.1, the attenuation rate of the learning rate is 0.99, the regularization coefficient is 0.0001, and the sliding average of the parameters is 0.99. And meanwhile, training is carried out by utilizing a training data set, 200 groups of data are fed in each time, the training result is stored in every 1000 rounds, and 80000 rounds of training are carried out. The resulting inverse training algorithm data flow diagram is shown in fig. 7. The loss and accuracy during training (saved once per 1000 rounds of training) is shown in fig. 8 and 9. It can be seen from the figure that with continuous training, the loss and accuracy gradually tend to be stable, and the accuracy is more than 99% after the stability.

In addition, through the construction of an experimental five-layer fully-connected neural network (comprising three hidden layers), the data set is used for training and finding: the test results prove that only a small loss value is calculated for the first time after training is started, the loss value obtained after the training is 'nan', namely the loss value is too small to be calculated, and the test results prove that: theoretically, when the fully-connected neural network reaches five levels, the problem of gradient disappearance occurs in the training process. Although training can be normally carried out after the activation function is adjusted to relu (), the training time is greatly increased due to the increase of the network structure, because the parameters are exponentially increased, and the network model detection result generated by training is not as good as that of a three-layer fully-connected neural network. The possible reasons are: the number of network input parameters is too large and 4096 inputs in a sense inherently limit the size of the network.

The trained automatic malicious code fragment evidence obtaining detection algorithm can not only identify code fragments, but also directly identify target files and underlying storage fragment data in evidence obtaining containers such as RAW and AFF, and the like. The result of a test forensics detection run is shown in fig. 10, wherein the full name is filled with all the same contents, the malware is a malicious program segment, and the normal is a normal code segment. It can be seen that the automatic evidence-taking detection algorithm for the malicious code segment is relatively accurate in identifying the data.

In addition, the VirusTotal is utilized to perform searching and killing on the malicious code fragments in the test set. VirusTotal is a service developed by the independent IT Security laboratory Hispasec Sistemas. It uses a variety of antivirus engines. Analysis was performed on the test set code snippet sample file using VirusTotal, with the results shown in table 2:

TABLE 2 data forensics test results based on VirusTotal website test set

As shown in the data in table 2, the comprehensive detection results of different antivirus engines for the malicious code fragments in the test set are poor, which indicates that most of the existing antivirus engines are not suitable for the analysis processing of the malicious code fragments although being capable of detecting the whole malicious code. In addition, it is also described that when malicious analysis is performed on the underlying data in the storage medium in the digital forensics survey, including forew, AFF, and other forensics containers, a new tool needs to be developed, and the automatic forensics detection scheme for the malicious code fragments based on deep learning in the embodiment of the present invention can better solve the problem.

Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.

Based on the foregoing system, an embodiment of the present invention further provides a server, including: one or more processors; a storage device to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the system as described above.

Based on the above system, the embodiment of the present invention further provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the above system.

The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the system embodiment, and for the sake of brief description, reference may be made to the corresponding content in the system embodiment for the part where the device embodiment is not mentioned.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing system embodiments, and are not described herein again.

In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and system may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the system according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An intelligent malicious code fragment evidence obtaining method is characterized by comprising the following contents:

2. The intelligent malicious code fragment evidence obtaining method according to claim 1, wherein evidence source data from a plurality of storage media are collected, analyzed, malicious code fragment features are extracted, and malicious code fragment features are normalized, wherein the plurality of storage media are from different devices and/or adopt different file system types.

3. The intelligent malicious code segment evidence obtaining method according to claim 2, wherein in the analysis of the evidence source data, the identification of the file type and the evidence storage container type is performed according to an original evidence source, and the original evidence source file system type or the evidence storage container type is determined; analyzing the initial/end position of file data storage in the storage medium and the cluster size of the file data storage, and recording the initial/end position of the file data storage as the initial/end position of the malicious code segment, wherein the cluster size of the file data storage is the size of the malicious code segment; starting from the initial position of the malicious code segment, reading the malicious code data in a hexadecimal format by taking the size of the malicious code segment as a reading unit, and taking the hexadecimal data of the malicious code segment as the characteristics of the malicious code segment.

4. The intelligent malicious code fragment evidence obtaining method according to claim 2, wherein the different devices comprise a disk and/or a portable device with a storage function; the different file system types comprise an android file system and/or a Linux file system and/or a Windows file system.

5. The intelligent malicious code fragment evidence obtaining method according to claim 1, wherein a training set and a test set are constructed, tags are added to data sample code fragments in a batch processing mode to distinguish normal code fragment data from malicious code fragment data, and the tagged data sample code fragments are scrambled by a pseudorandom method to obtain code fragment data which are randomly sequenced and used for constructing the training set and the test set.

6. The intelligent malicious code fragment evidence obtaining method according to claim 1, wherein a fully-connected neural network model structure in a deep learning open source framework TensorFlow is utilized, and each neuron in the fully-connected neural network model structure has a connection relation with each neuron in a front and back adjacent connection layer.

7. The intelligent malicious code fragment evidence obtaining method according to claim 1 or 6, wherein a back propagation training algorithm is used for training a fully-connected neural network model, and by setting cycle turns of a period, model parameters are saved and a current loss value is determined when the cycle turns in each period are met so as to adjust the network model parameters.

8. The intelligent malicious code fragment evidence obtaining method according to claim 7, wherein random parameter initialization is adopted in adjusting network model parameters, so that the parameters obey normal distribution or uniform distribution, and different neurons in a network layer in the model are ensured to have different outputs for different inputs; in the adjustment of the network model parameters, a cross entropy loss function is used for searching for the optimal solution of the model, a loss function is obtained according to the predicted value and the actual value of the input model, and the model parameters are adjusted by calculating the gradient of the loss function.

9. The intelligent malicious code fragment evidence obtaining method according to claim 7, wherein model complex indexes are introduced into the loss function, and index weight is set for each weight parameter to suppress noise in training data; and selecting an exponential decay learning rate, dynamically adjusting the learning rate in the training process, and updating the learning rate every other round, wherein an updating formula adopts: the new learning rate is the learning rate initial value and the learning rate attenuation rate.

10. An intelligent malicious code fragment forensics system, comprising: a data preprocessing module, a training test module and a target identification module, wherein,