WO2021098620A1 - 一种文件碎片分类方法及系统 - Google Patents

一种文件碎片分类方法及系统 Download PDF

Info

Publication number
WO2021098620A1
WO2021098620A1 PCT/CN2020/128860 CN2020128860W WO2021098620A1 WO 2021098620 A1 WO2021098620 A1 WO 2021098620A1 CN 2020128860 W CN2020128860 W CN 2020128860W WO 2021098620 A1 WO2021098620 A1 WO 2021098620A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
data set
neural network
file fragment
convolutional neural
Prior art date
Application number
PCT/CN2020/128860
Other languages
English (en)
French (fr)
Inventor
尹凌
奚桂锴
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2021098620A1 publication Critical patent/WO2021098620A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to a file fragment classification method and system.
  • One of the existing file fragment classification methods is to use magic numbers and the like to identify files of different file types. These magic numbers generally appear in the file header and the end of the file, and files of different file types will have different values of magic numbers in different positions. Since files on a disk are often stored in fragmented form, multiple file fragments belonging to the same file are not always connected in sequence, so it is usually difficult to use file header information and file tail information to identify file fragments of different file types.
  • Another type of file fragment classification method is a content-based file fragment classification method.
  • the content-based file fragment classification method is to directly analyze the content of the file fragment to predict the file type of the file fragment. This method does not need to rely on file signatures or magic numbers, etc.
  • the existing content-based file fragment classification methods mainly start from a statistical point of view. By extracting the statistical characteristics of each file fragment, such as the frequency distribution of unigram and bigram, and entropy, etc., traditional machine learning models such as LDA and SVM are established. And KNN, etc., and then identify the corresponding type of each file fragment.
  • the method of extracting the statistical characteristics of the file fragments and then establishing the traditional machine learning model relies heavily on the feature design, which is time-consuming and requires a lot of professional knowledge. Moreover, this type of method currently does not achieve a better classification effect.
  • the existing deep learning-based file fragment classification methods are not yet mature, and the corresponding classification effect is not good, which is lower than the file fragment classification methods based on traditional machine learning models.
  • Existing researches based on deep learning also need to design different neural network architectures for file fragments of different sizes, so the applicability of such existing methods is also limited to a certain extent.
  • the present invention provides a method for classifying file fragments.
  • the method includes the following steps: a. Using a file data set to construct a file fragment data set, the file fragment data set includes: a training set and a test set; b. Preprocess the fragmented data set; c. Construct a deep convolutional neural network model; d. Use the preprocessed training set and test set to train and evaluate the deep convolutional neural network model constructed above; e. Use the The deep convolutional neural network model predicts the file type to which the file fragment belongs.
  • step a specifically includes:
  • the step b specifically includes:
  • the deep convolutional neural network model includes L convolutional blocks, a global average pooling layer and two fully connected layers.
  • the convolution block includes three parts: a convolution layer, a residual unit, and a maximum pooling layer;
  • the number of convolutional blocks L is limited by the size of the converted grayscale image:
  • L max refers to the maximum number of convolution blocks allowed to be stacked in the model
  • w and h respectively refer to the width and height of the converted two-dimensional grayscale image.
  • the convolution layer uses d 1 ⁇ 1 convolution kernels, and assuming that the convolution block has input C I ⁇ J feature maps, the convolution layer up-samples the number of channels of the input feature maps.
  • the residual unit includes two convolutional layers, and the residual learning method is adopted for skip connection.
  • the maximum pooling layer performs spatial down-sampling on each input feature map, reducing it to the original which is
  • the step d specifically includes:
  • the pre-processed test set is used to evaluate the deep convolutional neural network.
  • the evaluation indicators include the average classification accuracy of multiple file fragment categories, the macro average F1 score and the micro average F1 score.
  • the present invention provides a file fragment classification system.
  • the system includes a fragment data set building module, a preprocessing module, a model building module, a training evaluation module, and a file type prediction module.
  • the fragment data set building module is used to use file data.
  • Set, construct a file fragment data set, the file fragment data set includes: a training set and a test set;
  • the preprocessing module is used to preprocess the constructed file fragment data set;
  • the model building module is used to construct the depth Convolutional neural network model;
  • the training evaluation module is used to use the preprocessed training set and test set to train and evaluate the deep convolutional neural network model constructed above;
  • the file type prediction module is used to use the The deep convolutional neural network model predicts the file type to which the file fragment belongs.
  • the present application provides a method and system for classifying file fragments, which only need to convert the input file fragments into a two-dimensional grayscale image, and then input it into a model for prediction.
  • the present invention when the file fragments are converted into a two-dimensional grayscale image, no additional calculation amount is required.
  • the present invention makes a judgment based entirely on the content of the file fragments without other prior knowledge.
  • the invention can directly learn the features from the input file fragments, and does not need to manually extract the features from the file fragments before performing modeling.
  • the deep convolutional neural network designed in the present invention can be suitable for classification tasks of file fragments of different sizes.
  • the deep convolutional neural network designed by the present invention adopts the residual structure design, can build a deeper network model, is suitable for processing file fragment classification tasks of different sizes, effectively improves the classification accuracy of file fragments, and has better classification effects .
  • Fig. 1 is a flowchart of a method for classifying file fragments of the present invention
  • FIG. 2 is a schematic diagram of a process of converting file fragments into grayscale images according to an embodiment of the present invention
  • Fig. 3 is a schematic diagram of a deep convolutional neural network model according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a convolution block in a deep convolutional neural network model according to an embodiment of the present invention.
  • Fig. 5 is a schematic diagram of a residual unit in a deep convolutional neural network model according to an embodiment of the present invention.
  • Figure 6 is a hardware architecture diagram of the file fragment classification system of the present invention.
  • FIG. 1 it is a flowchart of a preferred embodiment of the file fragment classification method of the present invention.
  • Step S1 using the file data set to construct a file fragment data set.
  • the file fragment data set includes: a training set and a test set. in particular:
  • the public file data set govdocs1 is used to generate the file fragment data set.
  • the file data set contains 1000 zip files. Decompress all the zip compression package files contained in the file data set, and divide the files in the decompressed folder into different categories according to the file types they belong to.
  • a certain number of files are selected for the experiment.
  • the selected files corresponding to the file types to be studied are divided into two categories according to the ratio of 6:4 to generate file fragments for the training set and the test set.
  • Each file is sliced according to the selected file fragment size to generate a large number of file fragments.
  • the first file fragment of each file is deleted, and at the same time, the last file fragment of each file that is smaller than the specified file fragment size is deleted.
  • the number of file fragments corresponding to each file type is restricted by random sampling, so that the data set is as balanced as possible, and a large number of file fragments corresponding to different file types for training and testing are obtained.
  • Step S2 preprocessing the constructed file fragment data set, that is, preprocessing the training set and the test set. in particular:
  • Each file fragment in the generated training set and test set is converted, and a one-dimensional file fragment can be converted into a two-dimensional grayscale image through a simple shape change.
  • the file fragments are composed of a sequence of bytes; each byte corresponds to each pixel in the two-dimensional grayscale image.
  • the shape of the grayscale image should be as close to a square as possible to facilitate the construction of a sufficiently deep model to classify the file fragments.
  • Step S3 build a deep convolutional neural network model. in particular:
  • the deep convolutional neural network model includes L convolutional blocks, a global average pooling layer, and two fully connected layers.
  • the ReLU (Rectified Linear Unit) described in Figure 3 all refers to a modified linear unit, which is an activation function.
  • each convolution block includes three parts: a convolution layer, a residual unit, and a maximum pooling layer.
  • the convolutional layer uses d 1x1 convolution kernels, assuming that the convolution block has input C IxJ feature maps, and the convolutional layer upsamples the number of channels of the input feature maps (increasing from C to d) ;
  • the residual unit performs feature learning, and the maximum pooling layer performs spatial down-sampling on each input feature map, reducing it to the original which is The number of feature maps remains unchanged.
  • the number L of convolutional blocks is limited by the size of the converted grayscale image, as shown in the following formula:
  • L max refers to the maximum number of convolution blocks allowed to be stacked in the model
  • w and h respectively refer to the width and height of the converted two-dimensional grayscale image.
  • the structure of the residual unit is shown in FIG. 5, and the residual unit includes two convolutional layers, and the residual learning method is used for skip connection.
  • the two convolutional layers both use d 3x3 convolution kernels for learning the features of the input feature map. Before the input feature map is input to the two convolutional layers, it is first calculated through the ReLU activation function.
  • the two fully connected layers of the model each have 2048 neurons.
  • Step S4 Use the preprocessed training set and test set to train and evaluate the deep convolutional neural network model constructed above.
  • the evaluation indicators include the average classification accuracy of multiple file fragment categories, the macro average F1 score and the micro average F1 score. in particular:
  • the Adam-based gradient descent method is used to train the deep convolutional neural network.
  • the initial learning rate is set to 0.001
  • the learning rate is reduced from the original 0.2 every 5 rounds
  • the total number of training rounds is set to 40.
  • the earlystop technique is also used to train the described deep convolutional neural network.
  • the training is stopped in advance, and the current model parameters are taken as the optimal parameters of the deep convolutional neural network.
  • Step S5 Use the deep convolutional neural network model to predict the file type to which the file fragment belongs. Specifically:
  • step S2 the file fragments are first converted into a two-dimensional grayscale image, and then the converted grayscale image is normalized.
  • the grayscale values of the pixels at the corresponding positions of the grayscale images are scaled to between -1 and 1, and then the normalized two
  • the one-dimensional grayscale image is input into the deep convolutional neural network model to predict the file type to which the file fragment belongs.
  • FIG. 6 is a hardware architecture diagram of the file fragment classification system 10 of the present invention.
  • the system includes: a fragmented data set building module 101, a preprocessing module 102, a model building module 103, a training evaluation module 104, and a file type prediction module 105.
  • the fragment data set construction module 101 is used to construct a file fragment data set by using a file data set.
  • the file fragment data set includes: a training set and a test set. in particular:
  • the fragment data set construction module 101 uses the public file data set govdocs1 to generate the file fragment data set.
  • the file data set contains 1000 zip files. Decompress all the zip compression package files contained in the file data set, and divide the files in the decompressed folder into different categories according to the file types they belong to.
  • the files selected corresponding to the file types to be studied are divided into two categories according to the ratio of 6:4 to generate file fragments for the training set and the test set.
  • the fragment data set construction module 101 slices each file according to the selected file fragment size to generate a large number of file fragments.
  • the first file fragment of each file is deleted, and at the same time, the last file fragment of each file that is smaller than the specified file fragment size is deleted.
  • the number of file fragments corresponding to each file type is restricted by random sampling, so that the data set is as balanced as possible, and a large number of file fragments corresponding to different file types for training and testing are obtained.
  • the preprocessing module 102 is used for preprocessing the constructed file fragment data set, that is, preprocessing the training set and the test set. Specifically:
  • the preprocessing module 102 converts each file fragment in the generated training set and test set, and a one-dimensional file fragment can be converted into a two-dimensional gray image through a simple shape change.
  • the file fragments are composed of a sequence of bytes; each byte corresponds to each pixel in the two-dimensional grayscale image.
  • the shape of the grayscale image should be as close to a square as possible to facilitate the construction of a sufficiently deep model to classify the file fragments.
  • the preprocessing module 102 performs normalization processing on each of the two-dimensional grayscale images, calculates the maximum and minimum values of pixels at each position in the training set, and compares the corresponding two-dimensional grayscale images in the training set and the test set. For a degree image, the corresponding pixels are scaled according to the maximum and minimum values obtained in the training set, so that the gray value of the pixel falls between -1 and 1.
  • the model building module 103 is used to build a deep convolutional neural network model. in particular:
  • the deep convolutional neural network model includes L convolutional blocks, a global average pooling layer, and two fully connected layers.
  • the ReLU (Rectified Linear Unit) described in FIG. 3 refers to a modified linear unit, which is an activation function.
  • each convolution block includes three parts: a convolution layer, a residual unit, and a maximum pooling layer.
  • the convolutional layer uses d 1x1 convolution kernels, assuming that the convolution block has input C IxJ feature maps, and the convolutional layer upsamples the number of channels of the input feature maps (increasing from C to d) ;
  • the residual unit performs feature learning, and the maximum pooling layer performs spatial down-sampling on each input feature map, reducing it to the original which is The number of feature maps remains unchanged.
  • the number L of convolutional blocks is limited by the size of the converted grayscale image, as shown in the following formula:
  • L max refers to the maximum number of convolution blocks allowed to be stacked in the model
  • w and h respectively refer to the width and height of the converted two-dimensional grayscale image.
  • the structure of the residual unit is shown in FIG. 5, and the residual unit includes two convolutional layers, and the residual learning method is used for skip connection.
  • the two convolutional layers both use d 3x3 convolution kernels for learning the features of the input feature map. Before the input feature map is input to the two convolutional layers, it is first calculated through the ReLU activation function.
  • the two fully connected layers of the model each have 2048 neurons.
  • the training evaluation module 104 is used to train and evaluate the deep convolutional neural network model constructed above by using the preprocessed training set and test set.
  • the evaluation indicators include the average classification accuracy of multiple file fragment categories, the macro average F1 score and the micro average F1 score. in particular:
  • the training evaluation module 104 uses Adam-based gradient descent method to train the deep convolutional neural network. Among them, the initial learning rate is set to 0.001, the learning rate is reduced from the original 0.2 every 5 rounds, and the total number of training rounds is set to 40. In addition, the earlystop technique is also used to train the described deep convolutional neural network. When the evaluation index of the deep convolutional neural network on the test set is not improved for 5 consecutive rounds, the training is stopped in advance, and the current model parameters are taken as the optimal parameters of the deep convolutional neural network.
  • the file type prediction module 105 is configured to use the deep convolutional neural network model to predict the file type to which the file fragment belongs. Specifically:
  • the file type prediction module 105 first converts the file fragments into a two-dimensional grayscale image after the file fragments to be predicted are given, and then normalizes the converted grayscale images.
  • the file type prediction module 105 scales the gray value of the pixel at the corresponding position of the gray image to between -1 and 1, according to the maximum and minimum of the pixel at the corresponding position of the gray image in the training set.
  • the normalized two-dimensional grayscale image is input into the deep convolutional neural network model, and the file type to which the file fragment belongs is predicted.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文件碎片分类方法和文件碎片分类系统。所述方法包括:利用文件数据集,构建文件碎片数据集(S1),所述的文件碎片数据集包括:训练集和测试集;对构建的文件碎片数据集进行预处理(S2);构建深度卷积神经网络模型(S3);利用预处理后的训练集和测试集,对上述构建的深度卷积神经网络模型进行训练和评估(S4);利用所述深度卷积神经网络模型预测文件碎片所属的文件类型(S5)。所述方法和系统无需手动设计特征,无需其他先验知识,能够自动学习到输入的文件碎片的特征,并且设计的深度卷积神经网络能够适用于不同大小的文件碎片的分类任务,具有更好的分类效果。

Description

一种文件碎片分类方法及系统 技术领域
本发明涉及一种文件碎片分类方法及系统。
背景技术
当犯罪嫌疑人删除存储在磁盘中的文件后,磁盘中往往还会有残留的文件内容。如果司法取证人员想要通过磁盘中的文件碎片寻找证据,就需要对这些文件碎片进行重组进而拼接成文件。
如果直接对大量的文件碎片进行两两拼接,则需要耗费巨大的计算量。如果能够提前知道各文件碎片所属文件的文件类型(即,文件碎片的类型),则可以大大减少所需要尝试的组合数量。
现有文件碎片分类方法中的一类是利用魔幻数字等来识别不同文件类型的文件。这些魔幻数字一般出现在文件头和文件尾,并且不同文件类型的文件会在不同的位置出现不同数值的魔幻数字。由于磁盘中的文件往往以碎片化的形式存储,同属一个文件的多个文件碎片并不总是顺序相连,故通常难以利用文件的文件头信息和文件尾信息来识别不同文件类型的文件碎片。
另一类文件碎片分类方法是基于内容的文件碎片分类方法。基于内容的文件碎片分类方法是直接通过对文件碎片内容的分析进而预测该文件碎片的文件类型。该方法不需要依赖于文件签名或者魔幻数字等。 现有基于内容的文件碎片分类方法主要是从统计学角度出发,通过提取各文件碎片的统计学特征,如unigram和bigram的频率分布,以及熵等,建立传统的机器学习模型,如LDA、SVM和KNN等,进而识别出各文件碎片所对应的类型。在基于内容的文件碎片分类方法中,通过提取文件碎片的统计学特征进而建立传统机器学习模型的方法严重依赖于特征的设计,是耗时的并且要求具备大量的专业知识。而且,这类方法目前并不能达到比较好的分类效果。
在基于内容的文件碎片分类方法中,现有基于深度学习的文件碎片分类方法还没有成熟,相应的分类效果不好,低于基于传统机器学习模型的文件碎片分类方法。现有基于深度学习的研究还需针对不同大小的文件碎片设计不同的神经网络架构,因此这类现有方法的适用性也受到了一定的限制。
发明内容
有鉴于此,有必要提供一种文件碎片分类方法及系统。
本发明提供一种文件碎片分类方法,该方法包括如下步骤:a.利用文件数据集,构建文件碎片数据集,所述的文件碎片数据集包括:训练集和测试集;b.对构建的文件碎片数据集进行预处理;c.构建深度卷积神经网络模型;d.利用预处理后的训练集和测试集,对上述构建的深度卷积神经网络模型进行训练和评估;e.利用所述深度卷积神经网络模型预测文件碎片所属的文件类型。
其中,所述的步骤a具体包括:
对公开文件数据集govdocs1包含的所有zip压缩包文件进行解压, 将解压后文件夹中的文件按照所属的文件类型划分到不同的类别;
将对应待研究的文件类型所选取的文件划分成两类,以生成分别用于训练集和测试集的文件碎片;
对每个文件根据所选的文件碎片大小进行切片以生成大量文件碎片,并删除每个文件的头一个文件碎片,及每个文件最后一个小于指定文件碎片大小的文件碎片。
所述的步骤b具体包括:
对生成的训练集和测试集中的每一个文件碎片都进行转换,通过简单的形状变化将一维的文件碎片转换为二维灰度图像;
对每个所述二维灰度图像进行归一化处理,计算训练集中每个位置像素点的最大值和最小值,将训练集和测试集中对应的二维灰度图像,依据训练集中求得的所述最大值和最小值将对应的像素点进行缩放,使得所述像素点的灰度值落在-1到1之间。
所述的深度卷积神经网络模型包含L个卷积块,一个全局平均池化层以及两个全连接层。
所述卷积块包括:卷积层、残差单元和最大池化层三个部分;
卷积块的数量L受转换后的灰度图像的大小限制:
L max=min(log 2max(w,h)-1,log 2min(w,h))
在该式中,L max指的是所述模型中允许堆叠的卷积块的最大数量,w和h分别指的是转换后的二维灰度图像的宽和高。
所述卷积层使用d个1x1的卷积核,假设卷积块输入了C个IxJ的特征图,则卷积层对输入特征图的通道数进行上采样。
所述残差单元包含两个卷积层,采用残差学习的方法进行跳跃连接。
所述最大池化层对每个输入特征图进行空间上的下采样,减小为原来的
Figure PCTCN2020128860-appb-000001
Figure PCTCN2020128860-appb-000002
所述的步骤d具体包括:
利用预处理后的测试集对所述的深度卷积神经网络进行评估,评估指标包括多个文件碎片类别的平均分类准确率,宏平均的F1分数和微平均的F1分数。
本发明提供一种文件碎片分类系统,该系统包括碎片数据集构建模块、预处理模块、模型构建模块、训练评估模块以及文件类型预测模块,其中:所述碎片数据集构建模块用于利用文件数据集,构建文件碎片数据集,所述的文件碎片数据集包括:训练集和测试集;所述预处理模块用于对构建的文件碎片数据集进行预处理;所述模型构建模块用于构建深度卷积神经网络模型;所述训练评估模块用于利用预处理后的训练集和测试集,对上述构建的深度卷积神经网络模型进行训练和评估;所述文件类型预测模块用于利用所述深度卷积神经网络模型预测文件碎片所属的文件类型。
本申请提供了一种文件碎片分类方法及系统,只需将输入的文件碎片先转换为二维的灰度图像,再输入到模型中即可进行预测。本发明在将文件碎片转换成二维的灰度图像时,并不需要耗费额外的计算量。本发明在预测文件碎片的类型时,完全基于该文件碎片的内容进行判断,无需其他先验知识。本发明可直接从输入的文件碎片中自动学习到特征,不需要先从文件碎片中手动提取特征再进行建模。另外,本发明设计的深度卷积神经网络能够适用于不同大小的文件碎片的分类任务。本发明设计的深度卷积神经网络采用残差结构设计,能够搭建更深的网络模型, 适用于处理不同大小的文件碎片分类任务,有效地提高了文件碎片的分类准确率,具有更好的分类效果。
附图说明
图1为本发明文件碎片分类方法的流程图;
图2是本发明实施例将文件碎片转换为灰度图像的过程示意图;
图3是本发明实施例深度卷积神经网络模型的示意图;
图4是本发明实施例深度卷积神经网络模型中卷积块的示意图;
图5是本发明实施例深度卷积神经网络模型中残差单元的示意图。
图6为本发明文件碎片分类系统的硬件架构图。
具体实施方式
下面结合附图及具体实施例对本发明作进一步详细的说明。
参阅图1所示,是本发明文件碎片分类方法较佳实施例的作业流程图。
步骤S1,利用文件数据集,构建文件碎片数据集。所述的文件碎片数据集包括:训练集和测试集。具体而言:
在本实施例中,利用公开的文件数据集govdocs1生成所述的文件碎片数据集。所述文件数据集包含1000个zip压缩包文件。对该文件数据集包含的所有zip压缩包文件进行解压,并将解压后文件夹中的文件按照所属的文件类型划分到不同的类别。
针对需要研究的文件碎片类型,均选取一定数量的文件用于实验。将对应待研究的文件类型所选取的文件分别按照6:4的比例,划分成 两类,以生成分别用于训练集和测试集的文件碎片。
对每个文件根据所选的文件碎片大小进行切片以生成大量文件碎片。为了避免文件头中包含可用于识别文件类型的文件签名,删除每个文件的头一个文件碎片,同时,将每个文件最后一个小于指定文件碎片大小的文件碎片删除。针对所述训练集和所述测试集,通过随机抽样的方式限制各文件类型对应的文件碎片数量,以使得数据集尽可能平衡,得到对应不同文件类型分别用于训练和测试的大量文件碎片。
步骤S2,对构建的文件碎片数据集进行预处理,也即,对所述训练集和所述测试集进行预处理。具体而言:
对生成的训练集和测试集中的每一个文件碎片都进行转换,通过简单的形状变化即可将一维的文件碎片转换为二维灰度图像,请参考图2。其中,所述文件碎片由字节序列组成;每个字节对应所述二维灰度图像中的每个像素点。在将文件碎片(一维的字节序列)转换为二维的灰度图像时,应使得灰度图像的形状尽可能接近方形,以利于构建足够深的模型来进行文件碎片的分类。
在本实施例中,将512字节的文件碎片转换为16x32(16x32=512)的二维灰度图像;将4096字节的文件碎片转换为64x64(64x64=4096)的二维灰度图像。
最后,对每个所述二维灰度图像进行归一化处理,计算训练集中每个位置像素点的最大值和最小值,将训练集和测试集中对应的二维灰度图像,依据训练集中求得的所述最大值和最小值将对应的像素点进行缩放,使得所述像素点的灰度值落在-1到1之间。
步骤S3,构建深度卷积神经网络模型。具体而言:
如图3所示,所述的深度卷积神经网络模型包含L个卷积块,一个全局平均池化层以及两个全连接层。图3中所述的ReLU(Rectified Linear Unit)均指的是修正线性单元,是一种激活函数。
其中,每个卷积块的结构如图4所示,包含三个部分:卷积层、残差单元和最大池化层。其中:所述卷积层使用d个1x1的卷积核,假设卷积块输入了C个IxJ的特征图,卷积层对输入特征图的通道数进行上采样(从C增大到d);所述残差单元进行特征学习,而所述最大池化层对每个输入特征图进行空间上的下采样,减小为原来的
Figure PCTCN2020128860-appb-000003
Figure PCTCN2020128860-appb-000004
特征图的数量则保持不变。
卷积块的数量L受到转换后的灰度图像的大小限制,如下式:
L max=min(log 2max(w,h)-1,log 2min(w,h))
在该式中,L max指的是所述模型中允许堆叠的卷积块的最大数量,w和h分别指的是转换后的二维灰度图像的宽和高。
其中,所述残差单元的结构如图5所示,所述残差单元包含两个卷积层,采用残差学习的方法进行跳跃连接。所述两个卷积层均采用d个3x3的卷积核,用于学习输入特征图的特征。输入特征图在输入到所述两个卷积层之前,都先经过ReLU激活函数进行计算。
所述模型的两个全连接层均具有2048个神经元。
尽管本申请在一定的实践基础上构建了如图3、图4、图5的模型结构,给出了模型相关部分的参数,但本发明的模型结构不应仅限于此,也不应局限于所述的模型结构参数。
步骤S4,利用预处理后的训练集和测试集,对上述构建的深度卷积 神经网络模型进行训练和评估。评价指标包括多个文件碎片类别的平均分类准确率,宏平均的F1分数和微平均的F1分数。具体而言:
在本实施例中:
采用基于Adam的梯度下降法对所述的深度卷积神经网络进行训练。其中,初始学习率设为0.001,每5个轮次降低学习率为原先的0.2,训练的总轮次设为40。此外,还采用earlystop技术训练所述的深度卷积神经网络。当所述的深度卷积神经网络在测试集上的评价指标连续5轮没有改进,就提前停止训练,取当前的模型参数作为所述的深度卷积神经网络的最佳参数。
步骤S5:利用所述深度卷积神经网络模型预测文件碎片所属的文件类型。具体包括:
给定待预测的文件碎片后,按照步骤S2所示,先将文件碎片转换成二维灰度图像,再将转换后的灰度图像进行归一化处理。
具体的,依据训练集中灰度图像对应位置像素点的最大值和最小值,将该灰度图像对应位置像素点的灰度值缩放至-1到1之间,再将归一化后的二维灰度图像输入所述深度卷积神经网络模型中,以预测所述文件碎片所属的文件类型。
参阅图6所示,是本发明文件碎片分类系统10的硬件架构图。该系统包括:碎片数据集构建模块101、预处理模块102、模型构建模块103、训练评估模块104以及文件类型预测模块105。
所述碎片数据集构建模块101用于利用文件数据集,构建文件碎片数据集。所述的文件碎片数据集包括:训练集和测试集。具体而言:
在本实施例中,所述碎片数据集构建模块101利用公开的文件数据集govdocs1生成所述的文件碎片数据集。所述文件数据集包含1000个zip压缩包文件。对该文件数据集包含的所有zip压缩包文件进行解压,并将解压后文件夹中的文件按照所属的文件类型划分到不同的类别。
针对需要研究的文件碎片类型,均选取一定数量的文件用于实验。将对应待研究的文件类型所选取的文件分别按照6:4的比例,划分成两类,以生成分别用于训练集和测试集的文件碎片。
所述碎片数据集构建模块101对每个文件根据所选的文件碎片大小进行切片以生成大量文件碎片。为了避免文件头中包含可用于识别文件类型的文件签名,删除每个文件的头一个文件碎片,同时,将每个文件最后一个小于指定文件碎片大小的文件碎片删除。针对所述训练集和所述测试集,通过随机抽样的方式限制各文件类型对应的文件碎片数量,以使得数据集尽可能平衡,得到对应不同文件类型分别用于训练和测试的大量文件碎片。
所述预处理模块102用于对构建的文件碎片数据集进行预处理,也即,对所述训练集和所述测试集进行预处理。具体包括:
所述预处理模块102对生成的训练集和测试集中的每一个文件碎片都进行转换,通过简单的形状变化即可将一维的文件碎片转换为二维灰度图像,请参考图2。其中,所述文件碎片由字节序列组成;每个字节对应所述二维灰度图像中的每个像素点。在将文件碎片(一维的字节序列)转换为二维的灰度图像时,应使得灰度图像的形状尽可能接近方形,以利于构建足够深的模型来进行文件碎片的分类。
在本实施例中,所述预处理模块102将512字节的文件碎片转换为 16x32(16x32=512)的二维灰度图像;将4096字节的文件碎片转换为64x64(64x64=4096)的二维灰度图像。
最后,所述预处理模块102对每个所述二维灰度图像进行归一化处理,计算训练集中每个位置像素点的最大值和最小值,将训练集和测试集中对应的二维灰度图像,依据训练集中求得的所述最大值和最小值将对应的像素点进行缩放,使得所述像素点的灰度值落在-1到1之间。
所述模型构建模块103用于构建深度卷积神经网络模型。具体而言:
如图3所示,所述的深度卷积神经网络模型包含L个卷积块,一个全局平均池化层以及两个全连接层。图3中所述的ReLU(Rectified Linear Unit)指的是修正线性单元,是一种激活函数。
其中,每个卷积块的结构如图4所示,包含三个部分:卷积层、残差单元和最大池化层。其中:所述卷积层使用d个1x1的卷积核,假设卷积块输入了C个IxJ的特征图,卷积层对输入特征图的通道数进行上采样(从C增大到d);所述残差单元进行特征学习,而所述最大池化层对每个输入特征图进行空间上的下采样,减小为原来的
Figure PCTCN2020128860-appb-000005
Figure PCTCN2020128860-appb-000006
特征图的数量则保持不变。
卷积块的数量L受到转换后的灰度图像的大小限制,如下式:
L max=min(log 2max(w,h)-1,log 2min(w,h))
在该式中,L max指的是所述模型中允许堆叠的卷积块的最大数量,w和h分别指的是转换后的二维灰度图像的宽和高。
其中,所述残差单元的结构如图5所示,所述残差单元包含两个卷积层,采用残差学习的方法进行跳跃连接。所述两个卷积层均采用d个3x3的卷积核,用于学习输入特征图的特征。输入特征图在输入到所述两个卷积层之前,都先经过ReLU激活函数进行计算。
所述模型的两个全连接层均具有2048个神经元。
尽管本申请在一定的实践基础上构建了如图3、图4、图5的模型结构,给出了模型相关部分的参数,但本发明的模型结构不应仅限于此,也不应局限于所述的模型结构参数。
所述训练评估模块104用于利用预处理后的训练集和测试集,对上述构建的深度卷积神经网络模型进行训练和评估。评价指标包括多个文件碎片类别的平均分类准确率,宏平均的F1分数和微平均的F1分数。具体而言:
在本实施例中:
所述训练评估模块104采用基于Adam的梯度下降法对所述的深度卷积神经网络进行训练。其中,初始学习率设为0.001,每5个轮次降低学习率为原先的0.2,训练的总轮次设为40。此外,还采用earlystop技术训练所述的深度卷积神经网络。当所述的深度卷积神经网络在测试集上的评价指标连续5轮没有改进,就提前停止训练,取当前的模型参数作为所述的深度卷积神经网络的最佳参数。
所述文件类型预测模块105用于利用所述深度卷积神经网络模型预测文件碎片所属的文件类型。具体包括:
所述文件类型预测模块105在给定待预测的文件碎片后,先将文件碎片转换成二维灰度图像,再将转换后的灰度图像进行归一化处理。
具体的,所述文件类型预测模块105依据训练集中灰度图像对应位置像素点的最大值和最小值,将该灰度图像对应位置像素点的灰度值缩放至-1到1之间,再将归一化后的二维灰度图像输入所述深度卷积神经网络模型中,预测所述文件碎片所属的文件类型。
虽然本发明参照当前的较佳实施方式进行了描述,但本领域的技术人员应能理解,上述较佳实施方式仅用来说明本发明,并非用来限定本发明的保护范围,任何在本发明的精神和原则范围之内,所做的任何修饰、等效替换、改进等,均应包含在本发明的权利保护范围之内。

Claims (10)

  1. 一种文件碎片分类方法,其特征在于,该方法包括如下步骤:
    a.利用文件数据集,构建文件碎片数据集,所述的文件碎片数据集包括:训练集和测试集;
    b.对构建的文件碎片数据集进行预处理;
    c.构建深度卷积神经网络模型;
    d.利用预处理后的训练集和测试集,对上述构建的深度卷积神经网络模型进行训练和评估;
    e.利用所述深度卷积神经网络模型预测文件碎片所属的文件类型。
  2. 如权利要求1所述的方法,其特征在于,所述的步骤a具体包括:
    对公开文件数据集govdocs1包含的所有zip压缩包文件进行解压,将解压后文件夹中的文件按照所属的文件类型划分到不同的类别;
    将对应待研究的文件类型所选取的文件划分成两类,以生成分别用于训练集和测试集的文件碎片;
    对每个文件根据所选的文件碎片大小进行切片以生成大量文件碎片,并删除每个文件的头一个文件碎片,及最后一个小于指定文件碎片大小的文件碎片。
  3. 如权利要求2所述的方法,其特征在于,所述的步骤b具体包括:
    对生成的训练集和测试集中的每一个文件碎片都进行转换,通过简单的形状变化将一维的文件碎片转换为二维灰度图像;
    对每个所述二维灰度图像进行归一化处理,计算训练集中每个位置像素点的最大值和最小值,将训练集和测试集中对应的二维灰度图像, 依据训练集中求得的所述最大值和最小值将对应的像素点进行缩放,使得所述像素点的灰度值落在-1到1之间。
  4. 如权利要求3所述的方法,其特征在于,所述的深度卷积神经网络模型包含L个卷积块,一个全局平均池化层以及两个全连接层。
  5. 如权利要求4所述的方法,其特征在于,所述卷积块包括:卷积层、残差单元和最大池化层三个部分;
    卷积块的数量L受转换后的灰度图像的大小限制:
    L max=min(log 2max(w,h)-1,log 2min(w,h))
    在该式中,L max指的是所述模型中允许堆叠的卷积块的最大数量,w和h分别指的是转换后的二维灰度图像的宽和高。
  6. 如权利要求5所述的方法,其特征在于,所述卷积层使用d个1x1的卷积核,假设卷积块输入了C个IxJ的特征图,则卷积层对输入特征图的通道数进行上采样。
  7. 如权利要求6所述的方法,其特征在于,所述残差单元包含两个卷积层,采用残差学习的方法进行跳跃连接。
  8. 如权利要求7所述的方法,其特征在于,所述最大池化层对每个输入特征图进行空间上的下采样,减小为原来的
    Figure PCTCN2020128860-appb-100001
    Figure PCTCN2020128860-appb-100002
  9. 如权利要求8所述的方法,其特征在于,所述的步骤d具体包括:
    利用预处理后的测试集对所述的深度卷积神经网络进行评估,评估指标包括多个文件碎片类别的平均分类准确率,宏平均的F1分数和微平均的F1分数。
  10. 一种文件碎片分类系统,其特征在于,该系统包括碎片数据集构建模块、预处理模块、模型构建模块、训练评估模块以及文件类型预测 模块,其中:
    所述碎片数据集构建模块用于利用文件数据集,构建文件碎片数据集,所述的文件碎片数据集包括:训练集和测试集;
    所述预处理模块用于对构建的文件碎片数据集进行预处理;
    所述模型构建模块用于构建深度卷积神经网络模型;
    所述训练评估模块用于利用预处理后的训练集和测试集,对上述构建的深度卷积神经网络模型进行训练和评估;
    所述文件类型预测模块用于利用所述深度卷积神经网络模型预测文件碎片所属的文件类型。
PCT/CN2020/128860 2019-11-21 2020-11-13 一种文件碎片分类方法及系统 WO2021098620A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911146348.0A CN110928848A (zh) 2019-11-21 2019-11-21 一种文件碎片分类方法及系统
CN201911146348.0 2019-11-21

Publications (1)

Publication Number Publication Date
WO2021098620A1 true WO2021098620A1 (zh) 2021-05-27

Family

ID=69851521

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/128860 WO2021098620A1 (zh) 2019-11-21 2020-11-13 一种文件碎片分类方法及系统

Country Status (2)

Country Link
CN (1) CN110928848A (zh)
WO (1) WO2021098620A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116055174A (zh) * 2023-01-10 2023-05-02 吉林大学 一种基于改进MobileNetV2的车联网入侵检测方法
CN116975863A (zh) * 2023-07-10 2023-10-31 福州大学 基于卷积神经网络的恶意代码检测方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110928848A (zh) * 2019-11-21 2020-03-27 中国科学院深圳先进技术研究院 一种文件碎片分类方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682024A (zh) * 2011-03-11 2012-09-19 中国科学院高能物理研究所 未残缺jpeg文件碎片重组的方法
US20160071010A1 (en) * 2014-05-31 2016-03-10 Huawei Technologies Co., Ltd. Data Category Identification Method and Apparatus Based on Deep Neural Network
CN108694414A (zh) * 2018-05-11 2018-10-23 哈尔滨工业大学深圳研究生院 基于数字图像转化和深度学习的数字取证文件碎片分类方法
CN109359090A (zh) * 2018-08-27 2019-02-19 中国科学院信息工程研究所 基于卷积神经网络的文件碎片分类方法及系统
CN110928848A (zh) * 2019-11-21 2020-03-27 中国科学院深圳先进技术研究院 一种文件碎片分类方法及系统

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IL299565B1 (en) * 2017-10-16 2024-03-01 Illumina Inc Classifies pathogenic variants using a recurrent neural network
CN108319518B (zh) * 2017-12-08 2023-04-07 中国电子科技集团公司电子科学研究院 基于循环神经网络的文件碎片分类方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682024A (zh) * 2011-03-11 2012-09-19 中国科学院高能物理研究所 未残缺jpeg文件碎片重组的方法
US20160071010A1 (en) * 2014-05-31 2016-03-10 Huawei Technologies Co., Ltd. Data Category Identification Method and Apparatus Based on Deep Neural Network
CN108694414A (zh) * 2018-05-11 2018-10-23 哈尔滨工业大学深圳研究生院 基于数字图像转化和深度学习的数字取证文件碎片分类方法
CN109359090A (zh) * 2018-08-27 2019-02-19 中国科学院信息工程研究所 基于卷积神经网络的文件碎片分类方法及系统
CN110928848A (zh) * 2019-11-21 2020-03-27 中国科学院深圳先进技术研究院 一种文件碎片分类方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN, QIAN ET AL.: "File Fragment Classification Using Grayscale Image Conversion and Deep Learning in Digital Forensics", 2018 IEEE SYMPOSIUM ON SECURITY AND PRIVACY WORKSHOPS DOI 10.1109/SPW.2018.00029, 31 May 2018 (2018-05-31), XP033379545 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116055174A (zh) * 2023-01-10 2023-05-02 吉林大学 一种基于改进MobileNetV2的车联网入侵检测方法
CN116975863A (zh) * 2023-07-10 2023-10-31 福州大学 基于卷积神经网络的恶意代码检测方法

Also Published As

Publication number Publication date
CN110928848A (zh) 2020-03-27

Similar Documents

Publication Publication Date Title
WO2021098620A1 (zh) 一种文件碎片分类方法及系统
CN108427920B (zh) 一种基于深度学习的边海防目标检测方法
US10692218B2 (en) Method and system of detecting image tampering, electronic device and storage medium
EP3333768A1 (en) Method and apparatus for detecting target
WO2021000678A1 (zh) 企业信贷审核方法、装置、设备及计算机可读存储介质
CN102413328B (zh) Jpeg图像双重压缩检测方法及系统
EP3754548A1 (en) A method for recognizing an object in an image using features vectors of an encoding neural network
JP6192271B2 (ja) 画像処理装置、画像処理方法及びプログラム
CN102938054B (zh) 基于视觉注意模型的压缩域敏感图像识别方法
CN110569814B (zh) 视频类别识别方法、装置、计算机设备及计算机存储介质
CN112686331A (zh) 伪造图像识别模型训练方法及伪造图像识别方法
JP2014232533A (ja) Ocr出力検証システム及び方法
CN104661037B (zh) 压缩图像量化表篡改的检测方法和系统
CN103927531A (zh) 一种基于局部二值和粒子群优化bp神经网络的人脸识别方法
WO2019109793A1 (zh) 人头区域识别方法、装置及设备
CN108717512A (zh) 一种基于卷积神经网络的恶意代码分类方法
CN110879982A (zh) 一种人群计数系统及方法
CN107679572A (zh) 一种图像判别方法、存储设备及移动终端
JP6945253B2 (ja) 分類装置、分類方法、プログラム、ならびに、情報記録媒体
CN113077444A (zh) 一种基于cnn的超声无损检测图像缺陷分类方法
CN110322418A (zh) 一种超分辨率图像生成对抗网络的训练方法及装置
KR102177247B1 (ko) 조작 이미지 판별 장치 및 방법
CN111222545A (zh) 基于线性规划增量学习的图像分类方法
CN109508639B (zh) 基于多尺度带孔卷积神经网络的道路场景语义分割方法
CN115292538A (zh) 一种基于深度学习的地图线要素提取方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20890512

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20890512

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 110123)

122 Ep: pct application non-entry in european phase

Ref document number: 20890512

Country of ref document: EP

Kind code of ref document: A1