CN117113163A - Malicious code classification method based on bidirectional time domain convolution network and feature fusion - Google Patents

Malicious code classification method based on bidirectional time domain convolution network and feature fusion Download PDF

Info

Publication number
CN117113163A
CN117113163A CN202310711779.7A CN202310711779A CN117113163A CN 117113163 A CN117113163 A CN 117113163A CN 202310711779 A CN202310711779 A CN 202310711779A CN 117113163 A CN117113163 A CN 117113163A
Authority
CN
China
Prior art keywords
malicious code
time domain
fusion
malicious
bidirectional time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310711779.7A
Other languages
Chinese (zh)
Inventor
李思聪
王坚
黄玮
史松昊
李乐民
王科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Air Force Engineering University of PLA
Original Assignee
Air Force Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Air Force Engineering University of PLA filed Critical Air Force Engineering University of PLA
Priority to CN202310711779.7A priority Critical patent/CN117113163A/en
Publication of CN117113163A publication Critical patent/CN117113163A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/54Extraction of image or video features relating to texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Hardware Design (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Virology (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application belongs to the field of malicious code classification, and provides a malicious code classification method based on a bidirectional time domain convolution network and feature fusion, which comprises the following steps: step 11, acquiring an original file of malicious codes; step 12, preprocessing the malicious code original file to obtain a malicious code image, so that the feature extraction of the malicious code by the model is more comprehensive, and the classification and identification accuracy of the malicious code is further improved; step 13, inputting the preprocessed malicious code image into a bidirectional time domain convolution network for training, wherein the bidirectional time domain convolution network processes the malicious code image into data with uniform size; according to the application, two different features are combined by adopting a pooling fusion method, and are mutually supplemented, so that the features can be fully learned, thereby better retaining malicious code features, effectively improving the model learning feature capability, and further obtaining a good classification effect.

Description

Malicious code classification method based on bidirectional time domain convolution network and feature fusion
Technical Field
The application belongs to the field of malicious code classification, and particularly relates to a malicious code classification method based on a bidirectional time domain convolution network and feature fusion.
Background
Malicious code is a code or web page script that is harmful to a computer system, and is aimed at creating a system vulnerability in a target computer, implementing theft of data and information, and presenting a potential hazard to systems and files. Malicious code analysis techniques are classified into dynamic analysis techniques and static analysis techniques according to whether files are executed. Dynamic analysis refers to the practice of running executable files in sandboxes, simulators and virtual machines and monitoring and analyzing application program behaviors through system calls, and static analysis methods extract static features of malicious codes to identify illegal behaviors of samples, and can capture information related to structural characteristics, such as API calls, operation codes (opcodes), and the like.
The time domain convolutional network (temporal convolutional network, TCN) is a new member of the convolutional neural network (convolutional neural networks, CNN) family, and the expanded convolution is adopted to expand the receptive field of the model and reduce the calculation amount. The causal convolution ensures the time sequence of data, and simultaneously adds residual connection, thereby effectively relieving the problems of gradient disappearance and gradient explosion. Therefore, the TCN not only can process data in a large-scale parallel manner, but also can avoid the influence of future data on the past, the time domain convolution network mainly comprises causal convolution, expansion convolution and residual connection, the convolution network in the TCN has the characteristic of being capable of performing parallel calculation, the problem of overlong time consumption can be effectively solved, but a single TCN cannot encode information from back to front, and therefore the association between the current characteristic item and the later characteristic item cannot be learned.
In order to mine the bidirectional characteristic information contained in the malicious code sequence and exert the advantages of TCN in time sequence characteristic information processing, the application provides a malicious code classification method based on a bidirectional time domain convolution network and characteristic fusion, which is inspired by a bidirectional cyclic neural network.
Disclosure of Invention
In order to solve the technical problems, the application provides a malicious code classification method based on a bidirectional time domain convolution network and feature fusion, which aims to solve the problems that in the prior art, a single TCN cannot encode information from back to front, so that the association between a current feature item and a later feature item cannot be learned, and the like.
A malicious code classification method based on a bidirectional time domain convolution network and feature fusion comprises the following steps:
step 11, acquiring an original file of malicious codes;
step 12, preprocessing the malicious code original file to obtain a malicious code image, so that the feature extraction of the malicious code by the model is more comprehensive, and the classification and identification accuracy of the malicious code is further improved;
step 13, inputting the preprocessed malicious code image into a bidirectional time domain convolution network for training, wherein the bidirectional time domain convolution network processes the malicious code image into data with uniform size;
step 14, fusing the bidirectional features with uniform size through the bidirectional time domain convolution network, so as to acquire the dependency relationship of data between the forward and reverse propagation directions;
step 15, extracting and compressing further features of the feature map after the bidirectional time domain convolution;
and step 16, obtaining a classification result of the malicious codes.
Preferably, the method for preprocessing data in step 12 includes:
s121, performing batch disassembly on PE files of the malicious code original files to obtain. Asm files and. Bytes files;
s122, extracting N-Gram operation code subsequence features based on the bytes file, and extracting gray image texture features based on the asm file;
s123, fusing the N-Gram operation code sub-sequence features and the gray image texture features to obtain a malicious code image.
Preferably, the pre-classified malicious code image in step 12 is divided into a training set and a verification set, the training set is used for training a model, a training feature vector is generated, and the verification set is used for observing and evaluating the performance of the model, and a verification feature vector is generated.
Preferably, the proportion of the training set is 70% and the proportion of the verification set is 30%.
Preferably, the training method of the bidirectional time domain convolution network in the step 13 is as follows:
step 131: performing convolution calculation on the sequence from left to right to realize forward feature extraction;
step 132: and carrying out convolution calculation on the sequence from right to left to realize backward feature extraction.
Preferably, the method of fusion in step 14 is pooling fusion: and extracting deep features by a fusion pooling layer obtained by parallelly connecting the maximum pooling and the average pooling, and further grabbing the dependency relationship in the data.
Preferably, the formula of the pooling fusion is
Wherein h is the output through the bidirectional time domain convolution network, h max Is the transport of the maximum pooling layerGo out, h ave Is the output of the average pooling layer, h fuse Is the output result after fusion pooling.
Compared with the prior art, the application has the following beneficial effects:
1. the gray image texture features are extracted from the asm file, the one-dimensional image is used instead of the two-dimensional image to represent the malicious code features, so that the problem that local correlation does not exist among pixel points in the malicious code image features caused by image folding is avoided, the gray image features and the operation code features respectively reflect the similarity of malicious codes in the same type on the global and local aspects of the malicious codes, the global and local features are fused, and the feature information of the malicious codes is utilized from multiple angles; the fusion features are used as the input of the bidirectional time domain convolution network model for training and classifying, so that the accuracy of malicious code detection can be increased, and the time domain convolution network can fully utilize the data information in the front direction and the back direction.
2. According to the application, two different features are combined by adopting a pooling fusion method, and are mutually supplemented, so that the features can be fully learned, thereby better retaining malicious code features, effectively improving the model learning feature capability, and further obtaining a good classification effect.
Drawings
FIG. 1 is a flow chart of a malicious code classification method according to the present application;
FIG. 2 is a flow chart of a method for preprocessing data according to the present application;
FIG. 3 is a flow chart of a training method for a two-way time domain convolutional network according to the present application;
FIG. 4 is a process diagram of malicious code byte feature extraction to generate a grayscale image in accordance with the present application;
FIG. 5 is a diagram of a malicious code classification model based on BiTCN-DLP according to the present application;
FIG. 6 is a graph of the performance of the training set and the test set as a function of training batch during the model training process of the present application;
FIG. 7 is a graph of loss rate as a function of training batch for the present application;
FIG. 8 is a graph showing the effect of n on model;
FIG. 9 is a comparison of the sequence features of the operational code, grayscale image, and hybrid features of the present application;
FIG. 10 is a graph showing the results of ablation experiments performed by BiTCN, forward TCN and reverse TCN of the present application;
FIG. 11 is a diagram of four different methods of pooling, mean pooling, max pooling, and fusion pooling of extracted features of the present application;
FIG. 12 is a comparison of the model of the present application with other malicious code classifications of recent years.
Detailed Description
Embodiments of the present application are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the application but are not intended to limit the scope of the application.
Embodiment one: as shown in fig. 1 to 12: the application provides a malicious code classification method based on a bidirectional time domain convolution network and feature fusion, which comprises the following steps:
step 11, acquiring an original file of malicious codes;
step 12, preprocessing the malicious code original file to obtain a malicious code image, so that the feature extraction of the malicious code by the model is more comprehensive, and the classification and identification accuracy of the malicious code is further improved;
the malicious codes are files composed of a series of bytes, binary files of the malicious codes are converted into gray images according to the similarity of the value ranges of the bytes and pixels in the gray images, and classification of malicious code families is achieved according to the concept that the malicious codes in the same family have similar textures of the gray images and the characteristics that the malicious codes in different families have different textures due to different structures of the malicious codes.
FIG. 4 is a process of extracting byte characteristics of malicious code to generate a gray level map, wherein malicious code files are converted into binary streams, vectors composed of 8-bit binary numbers are read from binary data, each vector corresponds to a pixel, binary values of the vectors are converted into decimal values, corresponding intervals are [0,255], wherein 0 is black, and 255 is white, so that malicious code can be converted into the gray level map.
Step 13, inputting the preprocessed malicious code image into a bidirectional time domain convolution network for training, wherein the bidirectional time domain convolution network processes the malicious code image into data with uniform size;
step 14, fusing the bidirectional features with uniform size through the bidirectional time domain convolution network, so as to acquire the dependency relationship of data between the forward and reverse propagation directions;
step 15, extracting and compressing further features of the feature map after the bidirectional time domain convolution;
and step 16, obtaining a classification result of the malicious codes, and classifying the malicious codes through a softmax layer.
The method for preprocessing the data in the step 12 comprises the following steps:
s121, performing batch disassembly on PE files of the malicious code original files to obtain. Asm files and. Bytes files;
s122, extracting N-Gram operation code subsequence features based on the bytes file, and extracting gray image texture features based on the asm file;
s123, fusing the N-Gram operation code sub-sequence features and the gray image texture features to obtain a malicious code image.
Dividing the pre-classified malicious code image in the step 12 into a training set and a verification set, wherein the training set is used for training a model to generate training feature vectors, and the verification set is used for observing and evaluating the performance of the model to generate verification feature vectors.
The proportion of the training set is 70% and the proportion of the verification set is 30%.
The training method of the bidirectional time domain convolution network in the step 13 is as follows:
step 131: performing convolution calculation on the sequence from left to right to realize forward feature extraction;
step 132: and carrying out convolution calculation on the sequence from right to left to realize backward feature extraction.
The fusion method in the step 14 is pooling fusion: and extracting deep features by a fusion pooling layer obtained by parallelly connecting the maximum pooling and the average pooling, and further grabbing the dependency relationship in the data.
The formula of the pooling fusion is
Wherein h is the output through the bidirectional time domain convolution network, h max Is the output of the maximum pooling layer, h ave Is the output of the average pooling layer, h fuse Is the output result after fusion pooling.
As can be seen from FIG. 5, the present application extracts the texture features of the gray image by using asm file, and uses one-dimensional image instead of two-dimensional image to represent the malicious code features, so as to avoid the local correlation which does not exist between the pixels in the malicious code image features caused by image folding, and the gray image features and the operation code features respectively reflect the similarity of the malicious codes in the same type on the global and local aspects of the malicious codes, so that the global and local features are fused, and the feature information of the malicious codes is utilized from multiple angles; the fusion features are used as the input of the bidirectional time domain convolution network model for training and classifying, so that the accuracy of malicious code detection can be increased, and the time domain convolution network can fully utilize the data information in the front direction and the back direction.
To verify the performance of the above method, the present application uses a training dataset in the public dataset provided by the malware classification challenge (BIG 2015) to estimate the performance of the model, which has 10868 tagged malicious code samples in total, divided into 9 malicious code families, wherein each malicious sample has been dehusked to contain two files, respectively hexadecimal represented. Bytes files, and malicious code binary files disassembled. Asm files, each malicious code file having an Id, a 20 character hash value of a unique identification file, and a Class.
The dataset is generally composed of 200G data, containing a.byte file of 50G and an. Asm file of 150G. The dataset is divided into two parts, a training set and a validation set. Wherein the training set is used for training the model and the verification set is used for observing and evaluating the performance of the model. In the experiment, 70% of the data set was divided into training sets and 30% into validation sets. The divided training set has 7608 samples and the test set has 3260 samples.
The performance of the model is evaluated herein by using four indexes, namely, accuracy (Accuracy), precision (Precision), recall (Recall) and F1 score (F1-score), which are widely applied to related researches, and the formula is as follows:
where TP indicates that the actual malicious code sample was correctly predicted as malicious code and FP indicates that the actual normal code sample was incorrectly predicted as malicious code. Similarly, the actual normal code samples of the TN are correctly predicted as normal codes, meaning that the actual malicious code samples are incorrectly predicted as normal codes.
In order to fully verify the effectiveness of the BiTCN-DLP based malicious code classification method presented herein, the following experiments were now set up:
experiment 1: biTCN-DLP Performance analysis experiment:
the selection of the hyper-parameters has a critical influence on the training effect of the model, so that the performance of the model is reflected to the greatest extent, and parameter adjustment of the model is needed. In the BiTCN-DLP model, only opcode single feature optimization model parameters are used.
The number of convolution kernels is selected from the parameters of the optimization class as the super-parameters, and the number of BiTCN layers and the number of neurons at each layer are selected from the parameters of the model class as variables.
The iteration times of the model are selected to be 200 times, the expansion coefficient in TCN is increased by 2 times of the index, the expansion coefficient is set to be (1, 2,4, 8), adam [37] is selected by the optimization algorithm, and the learning rate is set to be 0.002.
To avoid the over-fitting problem, a dropout layer was added and the value was 0.2. In order to make experimental data more accurate and effective, a five-fold cross-validation method is adopted.
Performing parameter optimization experiments on the number (2, 3,4,5, 6) of convolution kernels by using a grid search algorithm, and finally determining optimal parameter settings of a model as shown in the following table
And according to the parameter values, fusing two characteristics extracted by the operation code and the byte code, and carrying out experiments by using a BiTCN-DLP classification model.
FIG. 6 shows the performance of the training set and the test set as a function of training batch during model training, which is a plot of accuracy as a function of training batch.
Fig. 7 is a graph of loss rate as a function of training batch. Light grey represents the test set and dark grey represents the training set. It can be seen that the model can converge quickly. Through training and testing, the accuracy of the model reaches 99.54%, and the loss rate is 0.0292.
Experiment 2: n-gram feature selection experiment:
the model data extracts the characteristics of the operation codes in the malicious codes by using an N-gram algorithm in a data processing part, wherein the value of N has direct influence on the model effect. In order to obtain the optimal value of n, the rest conditions are the same, and four different values of n=2, 3,4,5 are compared, and the experimental result is shown in fig. 8:
as can be seen from the graph, when n=3, the accuracy of the model is as high as 99.54% compared with other N-gram values, which is higher than other values. When the value of n is higher than 3, the accuracy gradually decreases. Experimental results show that n=3 is the optimal value of N-gram.
Experiment 3: single-feature and multi-feature fusion contrast analysis experiments:
in order to further improve the extraction capacity of data information, the model uses an N-gram method to extract the characteristics of an operation code sequence during data processing, and uses byte codes to extract the gray image characteristics of malicious codes; and fusing the two. To verify the effectiveness of the method, a comparison experiment is set, the operation code sequence features, the gray image features and the mixed features are compared, the experimental result is shown in figure 9,
as can be seen from the graph, the four evaluation indexes of the accuracy, the precision, the recall and the F1 value of the mixed feature are respectively improved by 1.35%, 2.04%, 6.32% and 4.85% compared with the single operation code feature, and are respectively improved by 10.13%, 8.77%, 9.60% and 1.12% compared with the single byte code feature.
The experimental result shows that the effect of the mixed characteristic of the operation code characteristic and the gray image characteristic is obviously better than any single characteristic, the effect of the operation code characteristic and the gray image characteristic on the model is obviously improved, and the effectiveness of the method is verified. The analytical reasons are that: the characteristics of the operation code sequence and the gray level map can reflect the essence of the malicious code from different scales respectively, the characteristics extracted from the operation code sequence and the gray level map are combined, the characteristic information of the malicious code can be enriched, the complementary effect is generated, and the influence of confusion and crust addition of the malicious code is prevented, so that a better effect is obtained.
Experiment 4: two-way TCN validity verification experiment:
a bi-directional time domain convolutional network (BiTCN) is able to extract more comprehensive and robust features than a unidirectional network. As shown by a comparison experiment, the BiTCN provided by the application has the advantages of high accuracy and high convergence rate, and has certain advantages in parameter quantity and detection speed. To verify the effectiveness of BiTCN compared to unidirectional TCN, an ablation experiment was performed with BiTCN, forward TCN and reverse TCN, the results are shown in figure 10,
as can be seen from fig. 10, compared with the unidirectional model, the bidirectional model has certain advantages in terms of accuracy, recall, precision and F1 score in malicious code classification, which means that the bidirectional fusion feature is more effective in classifying malicious codes, can realize comprehensive utilization of the bidirectional feature, relieves limitation of the unidirectional feature on the malicious code classification, fully utilizes bidirectional correlation among distance units, ensures more balanced feature performance, has complementary advantage effect in the forward and reverse feature fusion, verifies rationality and effectiveness of the BiTCN-DLP on malicious code recognition, and is beneficial to further improving malicious code detection effect.
Experiment 5: pooling fusion validity verification experiment:
the problem of insufficient feature extraction capability of a model is solved by adopting a mode of maximum pooling and mean pooling fusion, and a comparison experiment of different pooling methods on the classified detection influence of malicious codes is set in this section for verifying the effectiveness of the pooling fusion method provided by articles: under the same experimental conditions, the model adopts four different methods of pooling, mean pooling, maximum pooling and fusion pooling to extract the characteristics, the experimental results of the four schemes are shown in figure 11,
as can be seen from fig. 11, the method of pooling fusion can obtain higher detection accuracy than the scheme of mean pooling or maximum pooling alone. The reason is that the mean pooling can extract the features containing global significance, and the features extracted by the maximum pooling have features with certain local significance, so that the features extracted by the two modes have great differences.
Two different features are combined by adopting a pooling fusion method, and the features can be fully learned by supplementing each other, so that malicious code features are better reserved, the model learning feature capability is effectively improved, and a good classification effect is obtained.
Experiment 6: model contrast analysis experiment:
to further verify the performance of the BiTCN-DLP based malicious code classification model, a comparison experiment was set to compare the model with other malicious code classification models in recent years, the results are shown in figure 12,
from the figure, the classification model based on BiTCN-DLP malicious codes provided herein has an accuracy rate as high as 99.54%, and is superior to all other methods in classification accuracy rate.
While embodiments of the present application have been shown and described above for purposes of illustration and description, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (7)

1. The malicious code classification method based on the bidirectional time domain convolution network and the feature fusion is characterized by comprising the following steps:
step 11, acquiring an original file of malicious codes;
step 12, preprocessing the malicious code original file to obtain a malicious code image, so that the feature extraction of the malicious code by the model is more comprehensive, and the classification and identification accuracy of the malicious code is further improved;
step 13, inputting the preprocessed malicious code image into a bidirectional time domain convolution network for training, wherein the bidirectional time domain convolution network processes the malicious code image into data with uniform size;
step 14, fusing the bidirectional features with uniform size through the bidirectional time domain convolution network, so as to acquire the dependency relationship of data between the forward and reverse propagation directions;
step 15, extracting and compressing further features of the feature map after the bidirectional time domain convolution;
and step 16, obtaining a classification result of the malicious codes.
2. The malicious code classification method based on the bidirectional time domain convolutional network and feature fusion as recited in claim 1, wherein: the method for preprocessing the data in the step 12 comprises the following steps:
s121, performing batch disassembly on PE files of the malicious code original files to obtain. Asm files and. Bytes files;
s122, extracting N-Gram operation code subsequence features based on the bytes file, and extracting gray image texture features based on the asm file;
s123, fusing the N-Gram operation code sub-sequence features and the gray image texture features to obtain a malicious code image.
3. The malicious code classification method based on the bidirectional time domain convolutional network and feature fusion as recited in claim 1, wherein: dividing the pre-classified malicious code image in the step 12 into a training set and a verification set, wherein the training set is used for training a model to generate training feature vectors, and the verification set is used for observing and evaluating the performance of the model to generate verification feature vectors.
4. The malicious code classification method based on the bidirectional time domain convolutional network and feature fusion of claim 3, wherein: the proportion of the training set is 70% and the proportion of the verification set is 30%.
5. The malicious code classification method based on the bidirectional time domain convolutional network and feature fusion as recited in claim 1, wherein: the training method of the bidirectional time domain convolution network in the step 13 is as follows:
step 131: performing convolution calculation on the sequence from left to right to realize forward feature extraction;
step 132: and carrying out convolution calculation on the sequence from right to left to realize backward feature extraction.
6. The malicious code classification method based on the bidirectional time domain convolutional network and feature fusion as recited in claim 1, wherein: the fusion method in the step 14 is pooling fusion: and extracting deep features by a fusion pooling layer obtained by parallelly connecting the maximum pooling and the average pooling, and further grabbing the dependency relationship in the data.
7. Malicious network and feature fusion based on a bi-directional time domain convolutional network as recited in claim 6The code classification method is characterized in that: the formula of the pooling fusion is
Wherein h is the output through the bidirectional time domain convolution network, h max Is the output of the maximum pooling layer, h ave Is the output of the average pooling layer, h fuse Is the output result after fusion pooling.
CN202310711779.7A 2023-06-15 2023-06-15 Malicious code classification method based on bidirectional time domain convolution network and feature fusion Pending CN117113163A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310711779.7A CN117113163A (en) 2023-06-15 2023-06-15 Malicious code classification method based on bidirectional time domain convolution network and feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310711779.7A CN117113163A (en) 2023-06-15 2023-06-15 Malicious code classification method based on bidirectional time domain convolution network and feature fusion

Publications (1)

Publication Number Publication Date
CN117113163A true CN117113163A (en) 2023-11-24

Family

ID=88799051

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310711779.7A Pending CN117113163A (en) 2023-06-15 2023-06-15 Malicious code classification method based on bidirectional time domain convolution network and feature fusion

Country Status (1)

Country Link
CN (1) CN117113163A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117574370A (en) * 2023-11-28 2024-02-20 中华人民共和国新疆出入境边防检查总站(新疆维吾尔自治区公安厅边境管理总队) Malicious code detection system
CN117909956A (en) * 2024-03-20 2024-04-19 山东科技大学 Hardware-assisted embedded system program control flow security authentication method
CN117972702A (en) * 2024-04-01 2024-05-03 山东省计算中心(国家超级计算济南中心) API call heterogeneous parameter enhancement-based malicious software detection method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117574370A (en) * 2023-11-28 2024-02-20 中华人民共和国新疆出入境边防检查总站(新疆维吾尔自治区公安厅边境管理总队) Malicious code detection system
CN117574370B (en) * 2023-11-28 2024-05-31 中华人民共和国新疆出入境边防检查总站(新疆维吾尔自治区公安厅边境管理总队) Malicious code detection system
CN117909956A (en) * 2024-03-20 2024-04-19 山东科技大学 Hardware-assisted embedded system program control flow security authentication method
CN117972702A (en) * 2024-04-01 2024-05-03 山东省计算中心(国家超级计算济南中心) API call heterogeneous parameter enhancement-based malicious software detection method and system

Similar Documents

Publication Publication Date Title
CN117113163A (en) Malicious code classification method based on bidirectional time domain convolution network and feature fusion
Chen et al. Adversarial examples for cnn-based malware detectors
CN111027069B (en) Malicious software family detection method, storage medium and computing device
CN113076994B (en) Open-set domain self-adaptive image classification method and system
CN111753881A (en) Defense method for quantitatively identifying anti-attack based on concept sensitivity
CN113297572B (en) Deep learning sample-level anti-attack defense method and device based on neuron activation mode
Zhu et al. Multi-loss siamese neural network with batch normalization layer for malware detection
CN111915437A (en) RNN-based anti-money laundering model training method, device, equipment and medium
CN111428557A (en) Method and device for automatically checking handwritten signature based on neural network model
CN112217787B (en) Method and system for generating mock domain name training data based on ED-GAN
CN113806746A (en) Malicious code detection method based on improved CNN network
CN111914254A (en) Weak coupling SGAN-based malicious software family classifier generation method and device and readable storage medium
CN111400713B (en) Malicious software population classification method based on operation code adjacency graph characteristics
CN105243327B (en) A kind of secure file processing method
CN109697240A (en) A kind of image search method and device based on feature
CN105468972B (en) A kind of mobile terminal document detection method
CN113033305B (en) Living body detection method, living body detection device, terminal equipment and storage medium
CN112733140A (en) Detection method and system for model tilt attack
Liu et al. Defend Against Adversarial Samples by Using Perceptual Hash.
CN116188439A (en) False face-changing image detection method and device based on identity recognition probability distribution
CN111209567B (en) Method and device for judging perceptibility of improving robustness of detection model
Khan et al. Detection of data scarce malware using one-shot learning with relation network
CN115828248B (en) Malicious code detection method and device based on interpretive deep learning
Chyou et al. Unsupervised Adversarial Detection without Extra Model: Training Loss Should Change
Spratling Comprehensive assessment of the performance of deep learning classifiers reveals a surprising lack of robustness

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination