CN117332419A - Malicious code classification method and device based on pre-training - Google Patents

Malicious code classification method and device based on pre-training Download PDF

Info

Publication number
CN117332419A
CN117332419A CN202311610887.1A CN202311610887A CN117332419A CN 117332419 A CN117332419 A CN 117332419A CN 202311610887 A CN202311610887 A CN 202311610887A CN 117332419 A CN117332419 A CN 117332419A
Authority
CN
China
Prior art keywords
training
layer
vector
model
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311610887.1A
Other languages
Chinese (zh)
Other versions
CN117332419B (en
Inventor
蔡波
袁正明
罗剑
于耀翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202311610887.1A priority Critical patent/CN117332419B/en
Publication of CN117332419A publication Critical patent/CN117332419A/en
Application granted granted Critical
Publication of CN117332419B publication Critical patent/CN117332419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • G06F18/15Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Virology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a pre-training-based malicious code classification method and device, wherein the method firstly extracts shallow features in malicious codes and operation codes in subtertine to construct a feature set; then, an improved pre-training model is built, a pre-training task is carried out, and a final model is obtained through training; and finally, inputting the code to be tested into a final model to obtain category probability distribution, and selecting the category with the highest probability as a final prediction result. The invention firstly extracts the operation code sequence of the subtitine in the malicious code, and then extracts the shallow features of TF-IDF and Asm2Vec. The subtutine is used for pre-training the input samples of the pre-training model, so that the generalization capability of the model can be improved, and the training speed and effect of the model can be improved. The shallow layer features are used as the pre-fix, so that the parameter scale required to be trained in the model training process can be reduced, the universality of the pre-training model can be improved, and the performance equivalent to the pre-training-fine tuning paradigm can be realized.

Description

Malicious code classification method and device based on pre-training
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for classifying malicious codes based on pre-training.
Background
Malicious code refers to programs or scripts that are specifically used to infringe on computer systems, networks, and data security. To effectively combat the threat of malicious code, researchers have developed a variety of classification techniques to identify and classify different types of malicious code.
Among them, static analysis and dynamic analysis are two basic classification techniques. Static analysis is the detection and identification of malicious behavior in executable files of malicious code by static scanning and analysis of the files. The main techniques of static analysis include disassembly, decompilation, code analysis, and the like. The disassembly technique can translate machine code in an executable file into human-readable assembly code, thereby making it easier to analyze its execution logic and determine if there is malicious activity. Unlike static analysis, dynamic analysis is to determine whether malicious code is malicious code by running malicious code in a virtual environment to obtain its runtime behavior and characteristics.
In addition to static and dynamic analysis, there are other classification techniques, such as machine learning based classification techniques. This technique automatically identifies and classifies new malicious code by collecting and analyzing a large number of malicious code samples and features, training a machine learning model. Common machine learning algorithms include support vector machines, decision trees, random forests, neural networks, and the like. These algorithms can automatically learn and extract features of malicious code to classify and identify new malicious code. In addition, there are some hybrid classification techniques, such as combining static and dynamic analysis, or combining machine learning and static analysis, etc.
However, in the existing classification technology based on machine learning, the model is difficult to handle large-scale malicious samples, and the performance of the model is poor.
Disclosure of Invention
The invention provides a pre-training-based malicious code classification method and device, which are used for solving or at least partially solving the technical problems that a model is difficult to process large-scale malicious samples and the performance of the model is poor in the prior art.
In order to solve the technical problem, a first aspect of the present invention provides a malicious code classification method based on pre-training, including:
extracting shallow features and operation codes in subroutines from malicious codes contained in a preset data set, and constructing a feature set according to the shallow features and the operation codes, wherein the shallow features comprise TF-IDF features and Asm2Vec features, the TF-IDF features comprise readable character string sequence features, and the Asm2Vec features are semantic information features logically related to code execution in an assembly file;
pre-training an improved pre-training model by taking a malicious code contained in a preset data set and a constructed feature set as a training data set to obtain a final model, wherein the improved pre-training model comprises a prefix tuning structure, an embedding layer, a one-dimensional convolutional neural network, a position encoder and a transducer encoding layer, the prefix tuning structure is used for segmenting an output vector into a plurality of sections of input transducer encoding layers, the embedding layer is used for embedding three-dimensional tensors of operation features in a subprogram to obtain four-dimensional tensors, the one-dimensional convolutional neural network is used for obtaining a new feature sequence vector according to the four-dimensional tensors obtained by the embedding layer, the position encoder is used for encoding position information of each word or mark in an input sequence into the new feature sequence vector, the transducer encoding layer consists of a plurality of stacked encoders, each encoder comprises a multi-head attention layer and a feedforward network layer, the multi-head attention layer is used for performing attention calculation according to the input vector to obtain an output vector, and the feedforward network layer is used for obtaining the encoded vector according to the input vector and the output vector;
and inputting the code to be tested into a final model to obtain category probability distribution, and selecting the category with the highest probability according to the category probability distribution as a final prediction result.
In one embodiment, the prefix tuning structure is a prefix_tuning model structure, and is formed by a multi-layer fully-connected neural network, input layer nodes are set to be the sequence length of prefix, length conversion is performed through a hidden layer, the length of output layer nodes is set to be the node length suitable for deepfreex processing, and the deepfreex processing is to divide vectors output by the output layer into a plurality of multi-head self-attention layers used in a transform coding layer of an input pre-training model, and splice the multi-head self-attention layers with key vectors K and value vectors V used in the multi-head attention layers.
In one embodiment, pre-training the improved pre-training model with malicious code contained in a preset data set and the constructed feature set as training data sets to obtain a final model, including:
inputting the operation codes in the training data set into the improved pre-training model for training;
freezing parameters of the pre-training model, and inputting the extracted shallow features as prefixes into a prefix tuning structure for training.
In one embodiment, the embedding layer is specifically configured to find an embedding vector for each element in a given input from an embedding matrix having a size v×d, where V represents the number of rows of the embedding matrix and D represents the dimension of each embedding vector; the input of the embedding layer is a three-dimensional tensor of the operating characteristics in the subroutine, which is a tensor of the shape (B, S, L), where B is the number of samples processed in the batch, S is the number of subroutines contained in the batch of samples, L is the number of opcodes contained in the subroutine, and the three-dimensional tensor contains an index of each word to be retrieved from the embedding matrix; the output of the embedded layer is in the shape ofContains an embedded vector for each input word.
In one embodiment, the calculation formula of the position encoder is:
where pos denotes the position of the word or token in the sequence, m denotes the dimension of the output vector of the position encoder, d denotes the dimension of the input vector, the above formula denotes that the position encoder encodes each position as a vector, each element of the vector is a sine or cosine function, and the coefficients in the function are different depending on the position and dimension.
In one embodiment, the transform coding layer has the following formula:
wherein,representing the input vector +.>Representing the normalization layer, FFN is the feed forward network layer, and the above formula represents that each encoder processes the input vector through the multi-headed self-attention layer first>Obtaining an output vector +.>The method comprises the steps of carrying out a first treatment on the surface of the Then +_input vector>And output vector->Adding and normalizing; and finally taking the normalized vector as the input of a feedforward network layer, and calculating again through a stacked encoder until the final output of a transducer coding layer is obtained.
In one embodiment, the pre-training task in the pre-training process adopts an MLM task, wherein the MLM task refers to randomly masking some words or marks in an input sequence, and then enabling a model to predict the masked words or marks;
the training target of the model is that the relation between the operation code representations is learned through the MLM task, and the calculation formula of the pre-training task is as follows:
where n represents the number of training samples, l represents the number of covered words or tokens in each sample,indicate->Sample No.)>Actual value of individual covered words or marks, < >>Representing the probability that the model predicts the word or token, < +.>Is the loss function of the MLM task.
Based on the same inventive concept, a second aspect of the present invention provides a malicious code classification device based on pre-training, comprising:
the feature extraction module is used for extracting shallow features and operation codes in subroutines from malicious codes contained in a preset data set, and constructing a feature set according to the shallow features and the operation codes, wherein the shallow features comprise TF-IDF features and Asm2Vec features, the TF-IDF features comprise readable character string sequence features, and the Asm2Vec features are semantic information features related to code execution logic in an assembly file;
the device comprises a pre-training module, a pre-training module and a position encoder, wherein the pre-training module is used for pre-training an improved pre-training model by taking malicious codes contained in a preset data set and a constructed feature set as training data sets to obtain a final model, the improved pre-training model comprises a prefix tuning structure, an embedding layer, a one-dimensional convolutional neural network, the position encoder and a transducer encoding layer, the prefix tuning structure is used for segmenting an output vector into a plurality of sections of input transducer encoding layers, the embedding layer is used for embedding three-dimensional tensors of operation features in a subprogram to obtain four-dimensional tensors, the one-dimensional convolutional neural network is used for obtaining a new feature sequence vector according to the four-dimensional tensors obtained by the embedding layer, the position encoder is used for encoding position information of each word or mark in an input sequence into the new feature sequence vector, the transducer encoding layer comprises a plurality of stacked encoders, each encoder comprises a multi-head attention layer and a feedforward network layer, and the feedforward network layer is used for performing attention calculation according to the input vector to obtain an output vector;
and the classification module is used for inputting the codes to be detected into the final model to obtain category probability distribution, and selecting the category with the highest probability according to the category probability distribution as a final prediction result.
Based on the same inventive concept, a third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed, implements the method of the first aspect.
Based on the same inventive concept, a fourth aspect of the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the method according to the first aspect when executing said program.
Compared with the prior art, the invention has the following advantages and beneficial technical effects:
the invention provides a malicious code classification method and device based on pre-training, which comprises the steps of firstly extracting an operation code and shallow features of a subroutine (sub-program) in a malicious code: TF-IDF features and Asm2Vec features. The subtutine is used for pre-training the input samples of the pre-training model, so that the generalization capability of the model can be improved, and the training speed and effect of the model can be improved. The three-dimensional input (batch, seq, emudding) form of the pre-training model is changed into four-dimensional input (batch, sub, seq, emudding), so that the input scale of the model can be enlarged without enlarging the parameter number of the model, and the dilemma that the input sequence length is insufficient, and thus large-scale malicious codes are difficult to input the pre-training model is solved. Meanwhile, the improved pre-training model comprises a prefix optimizing structure, the output vector is segmented into a plurality of segments of input transform coding layers, a pre-training method is adopted, shallow layer characteristics can be used as pre-fix input, and pre-fix is trained based on the pre-training model, so that the parameter scale required to be trained in the model training process can be reduced, the universality of the pre-training model can be improved, and the performance equivalent to that of a pre-training-fine tuning paradigm can be realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a pre-training based malicious code classification method in an embodiment of the invention;
FIG. 2 is a schematic diagram of the process of extracting opcodes in shallow features and subroutines according to an embodiment of the present invention.
Detailed Description
The invention provides a pre-training-based malicious code classification method and device, which are characterized in that shallow features in malicious codes and operation codes in subtropine are extracted at first to construct a feature set; then, an improved pre-training model is built, a pre-training task is carried out, and a final model is obtained through training; and finally, inputting the code to be tested into a final model to obtain category probability distribution, and selecting the category with the highest probability as a final prediction result. The invention firstly extracts the operation code sequence of the subtitine in the malicious code, and then extracts the shallow features of TF-IDF and Asm2Vec. And the subtropine is used for pre-training the input samples of the pre-training model, so that the generalization capability of the model can be improved, and the training speed and effect of the model can be improved. The shallow layer features are used as the pre-fix, so that the parameter scale required to be trained in the model training process can be reduced, the universality of the pre-training model can be improved, and the performance equivalent to the pre-training-fine tuning paradigm can be realized.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
The embodiment of the invention provides a malicious code classification method based on pre-training, referring to fig. 1, the method comprises the following steps:
s1: extracting shallow features and operation codes in subroutines from malicious codes contained in a preset data set, and constructing a feature set according to the shallow features and the operation codes, wherein the shallow features comprise TF-IDF features and Asm2Vec features, the TF-IDF features comprise readable character string sequence features, and the Asm2Vec features are semantic information features logically related to code execution in an assembly file;
s2: pre-training an improved pre-training model by taking a malicious code contained in a preset data set and a constructed feature set as a training data set to obtain a final model, wherein the improved pre-training model comprises a prefix tuning structure, an embedding layer, a one-dimensional convolutional neural network, a position encoder and a transducer encoding layer, the prefix tuning structure is used for segmenting an output vector into a plurality of sections of input transducer encoding layers, the embedding layer is used for embedding three-dimensional tensors of operation features in a subprogram to obtain four-dimensional tensors, the one-dimensional convolutional neural network is used for obtaining a new feature sequence vector according to the four-dimensional tensors obtained by the embedding layer, the position encoder is used for encoding position information of each word or mark in an input sequence into the new feature sequence vector, the transducer encoding layer consists of a plurality of stacked encoders, each encoder comprises a multi-head attention layer and a feedforward network layer, the multi-head attention layer is used for performing attention calculation according to the input vector to obtain an output vector, and the feedforward network layer is used for obtaining the encoded vector according to the input vector and the output vector;
s3: and inputting the code to be tested into a final model to obtain category probability distribution, and selecting the category with the highest probability according to the category probability distribution as a final prediction result.
Specifically, regarding feature extraction in S1, it can be achieved by:
for the operation code sequence, the sub-program in the code segment is selected as a mode of dividing the code segment, and because the program is generally divided into a plurality of sub-programs, each sub-program is responsible for completing a specific task, the program structure is clear, the readability is high, and the organization and the maintenance of the code are convenient. And extracting the operation code of each sub-grouping section, arranging the operation codes into an operation code sequence according to the sequence of the operation codes, and then juxtaposing the operation code sequences of all sub-grouping sections of the sample as a two-dimensional sequence.
For TF-IDF characteristics, extracting an operation code, a register name of a first operand and annotation content of an assembly file from a row with the operation code in an assembly file 'Segment type: pure code' Segment, and adding a semicolon sign before each function or main program; making word segmentation;
for Asm2Vec features, extracting the operation code semantics of 'reduced' operation code in the row with operation code in the section of the assembly file 'Segment type: pure code', including the operation code, the register of the first operand and the annotation content of the assembly file, abstracting each sub-function into a sentence as a corpus, and the extraction process is shown in FIG. 2.
In fig. 2, malicious code is shown on the left, and sub-program is used as a unit for dividing code segments, and extracted operation codes (sequences) representing the extracted operation codes are shown on the right, and TF-IDF features and Asm2Vec features are shown on the right.
In one embodiment, the prefix tuning structure is a prefix_tuning model structure, and is formed by a multi-layer fully-connected neural network, input layer nodes are set to be the sequence length of prefix, length conversion is performed through a hidden layer, the length of output layer nodes is set to be the node length suitable for deepfreex processing, and the deepfreex processing is to divide vectors output by the output layer into a plurality of multi-head self-attention layers used in a transform coding layer of an input pre-training model, and splice the multi-head self-attention layers with key vectors K and value vectors V used in the multi-head attention layers.
In one embodiment, pre-training the improved pre-training model with malicious code contained in a preset data set and the constructed feature set as training data sets to obtain a final model, including:
inputting the operation codes in the training data set into the improved pre-training model for training;
freezing parameters of the pre-training model, and inputting the extracted shallow features as prefixes into a prefix tuning structure for training.
Specifically, in the training process, firstly Embedding the extracted operation code features in the survivine by using an Embedding layer, then sending the obtained embedded four-dimensional tensor into a one-dimensional convolutional neural network, and carrying out convolutional operation on an input sequence by the convolutional layer by sliding a convolutional kernel to obtain a new feature sequence, wherein the output of the convolutional layer is regarded as the similarity between different positions in the input sequence and the convolutional kernel, and the output of the convolutional layer is expressed by the following formula:
wherein,is an input sequence,/->Is the output sequence of the convolutional layer,/>Is a weight parameter of the convolution kernel, +.>Is a bias item->Is the length of the convolution kernel, +.>Is an activation function. The formula represents the convolution kernel in the input sequence +.>Upper position->Start, and->Is>To->The values of the individual positions are weighted and summed and added with the bias term +.>Then by activating the function->Nonlinear transformation is performed to obtain an output->
The tensor obtained from the one-dimensional convolution layer is augmented with its position information by a position encoder for encoding each word or marker position information in the input sequence into a vector representation, and finally the tensor augmented with the position information is input to a transducer encoding layer consisting of a plurality of stacked encoders, each encoder comprising two sub-layers, a multi-headed self-attention layer and a feed forward network layer, respectively.
In one embodiment, the embedding layer is specifically configured to find an embedding vector for each element in a given input from an embedding matrix having a size v×d, where V represents the number of rows of the embedding matrix and D represents the dimension of each embedding vector; the input of the embedding layer is a three-dimensional tensor of the operating characteristics in the subroutine, which is a tensor of the shape (B, S, L), where B is the number of samples processed in the batch, S is the number of subroutines contained in the batch of samples, L is the number of opcodes contained in the subroutine, and the three-dimensional tensor contains an index of each word to be retrieved from the embedding matrix; the output of the embedded layer is in the shape ofContains an embedded vector for each input word.
In one embodiment, the calculation formula of the position encoder is:
where pos denotes the position of the word or token in the sequence, m denotes the dimension of the output vector of the position encoder, d denotes the dimension of the input vector, the above formula denotes that the position encoder encodes each position as a vector, each element of the vector is a sine or cosine function, and the coefficients in the function are different depending on the position and dimension.
In one embodiment, the transform coding layer has the following formula:
wherein,representing the input vector +.>Representing the normalization layer, FFN is the feed forward network layer, and the above formula represents that each encoder processes the input vector through the multi-headed self-attention layer first>Obtaining an output vector +.>The method comprises the steps of carrying out a first treatment on the surface of the Then +_input vector>And output vector->Adding and normalizing; and finally taking the normalized vector as the input of a feedforward network layer, and calculating again through a stacked encoder until the final output of a transducer coding layer is obtained.
In one embodiment, the pre-training task in the pre-training process adopts an MLM task, wherein the MLM task refers to randomly masking some words or marks in an input sequence, and then enabling a model to predict the masked words or marks;
the training target of the model is that the relation between the operation code representations is learned through the MLM task, and the calculation formula of the pre-training task is as follows:
where n represents the number of training samples, l represents the number of covered words or tokens in each sample,indicate->Sample No.)>Actual value of individual covered words or marks, < >>Representing the probability that the model predicts the word or token, < +.>Is the loss function of the MLM task.
The method according to the invention is described below by way of specific examples.
A malicious code classification method comprising the steps of:
1. shallow features in malicious codes and operation codes in subtertine are extracted, and a data set for training is constructed.
Microsoft in 2015 published a data set named "malicious code data set" (Malware Classification Dataset) for research and evaluation of malware (malicious code) classification and detection tasks. The embodiment of the invention adopts the data set, is widely applied to the classification and detection tasks of the malicious software, and provides a standard data set for algorithm research and performance evaluation for researchers and scholars. The method can help researchers develop effective malicious software detection algorithms and improve network security and security of computer systems.
Shallow features of malicious code include word frequency-inverse file frequency (TF-IDF) and Asm2Vec.
The readable character string and the operation code sequence are used as word frequency-inverse file frequency (TF-IDF). The more a word or piece of assembly code appears in a sample, the fewer the number of occurrences in all samples, the more representative the sample.
For the operation code sequence (feature), the operation code of each sub-sequence section is extracted and arranged into one operation code sequence according to the sequence of the operation code, and then the operation code sequences of all sub-sequences of the sample are juxtaposed as a two-dimensional sequence.
For the readability string feature TF-IDF, the operation code, the register name of the first operand, and the annotated content of the assembly file are extracted in the row with the operation code in the assembly file 'Segment type: pure code'. And added before each function or main program, and the symbols are used for word segmentation.
For Asm2Vec characteristics: the operation code semantics (operation code, register, annotation content) of the 'reduced' operation code are extracted from the lines in the section of the assembly file 'Segment type: pure code', and each sub-sentence is abstracted into a sentence as a corpus.
2. And constructing an improved pre-training model, performing a pre-training task, and obtaining a final model through training.
The improved pre-training model is that a pre-fix_tuning model structure is added on the original pre-training model, the pre-fix_tuning model structure is composed of a plurality of layers of fully-connected neural networks, input layer nodes are set to be the sequence length of pre-fix, and the length of a hidden layer is transformed; the output layer node length is set to a node length suitable for a deepfreex process, which is a multi-head self-attention layer that splits the vector output by the output layer into multiple segments for input to the pre-training model.
The training specifically comprises the following steps:
(1) Inputting the operation code in the dataset subtutine into the improved pre-training model for training
The operation code features extracted from the subtanning are first embedded by an Embedding layer.
The Embedding layer is used for decoding a frame with a size ofAn embedded vector for each element in a given input is looked up, where V is the number of rows (i.e., vocabulary size) of the embedded matrix and D is the dimension of each embedded vector.
The embedded four-dimensional tensor obtained in the previous step (Embedding layer) is then fed into a one-dimensional convolutional neural network.
The tensor obtained from the one-dimensional convolutional layer is then added with its position information by a position encoder. The position encoder is used to encode each word or tag position information in the input sequence into a vector representation.
Finally, the tensor added with the position information is input to the transducer coding layer. Which consists of a plurality of stacked encoders. Each encoder includes two sub-layers, a multi-headed self-attention layer and a feed-forward network layer, respectively.
In this embodiment, the three-dimensional input of the conventional transducer is changed to four-dimensional input, and the number of tokens (token is generally a basic unit in source code, and the present invention is a "flag") of the model processing is increased without increasing the number of model parameters.
The pre-training task uses Masked Language Model (MLM), which is to randomly mask some words or tokens in the input sequence and then let the model predict the masked words or tokens. The training goal of the Bert model is to learn the relationships between opcode representations through this task.
(2) Freezing parameters of the pre-training model, and inputting the extracted shallow features as a pre-fix into a pre-fix_tuning model structure for training.
The parameters of the pre-training model are frozen, so that the training parameter scale of the model can be reduced, and the pre-training model can be applied to different pre-fix_tuning tasks. In the specific implementation process, the TF-IDF and Asm2Vec features are converted into vectors with fixed length through TtffVectors and Word2Vec, the vectors are input into a prefix_tuning model structure in an improved pre-training model for training, and the vectors output by the prefix_tuning model structure are input into a multi-head self-attention layer of the pre-training model after being cut through deepfreex processing; and splicing the vector subjected to deepfreex processing segmentation with a key vector K and a value vector V used in a self-attention sub-layer in the pre-training model.
The split vector is spliced with a key vector K and a value vector V used in a multi-head self-attention sub-layer in the pre-training model. The key vector K is a vector used by the self-attention layer to calculate the attention weight, and represents the representation of the currently input representation in the key space. Similar to the query vector, the key vector K is obtained by multiplying the input vector by a key matrix (key matrix), which is also a trainable parameter of the model. The value vector V is a vector used by the self-attention layer to calculate the output vector, which represents the representation of the current input in the value space. Similar to the query vector and the key vector, the value vector is also obtained by multiplying the input vector by a value matrix (value matrix).
In the self-attention layer, the computation of the query vector, key vector and value vector are all independent and they are all obtained by multiplying the input vector by different matrices. Then, the attention weight calculated by the query vector and the key vector is multiplied by the value vector, and the result is weighted and summed to obtain an output vector. The dimension of the output vector is the same as the dimension of the value vector.
The deep prefix is a vector obtained by splicing the prefix vector with the vector K and the value vector V in all attention sublayers in the pre-training model, so that the performance of the model is improved.
3. And inputting the code to be tested into a final model to obtain category probability distribution, and selecting the category with the highest probability as a final prediction result.
Model training
The trained models of malicious codes of the same category are output as vectors and should have similar feature vectors. Model training can minimize cross entropy between model predictions and real labels through cross entropy loss functions so that the model can better fit data. The formula of the cross entropy loss function is as follows:
wherein,representing the number of categories>One-hot coding, which is a genuine tag,/>Is the +.o in the probability distribution of model output>Probability of class.
Model prediction for code search
After the model is trained, a vector can be obtained by inputting a subttine operation code and shallow features of malicious codes, and the vector is converted into probability distribution through a softmax function, so that the probability of each category is between 0 and 1 and the sum is 1. Based on the resulting class probability distribution, the class with the highest probability may be selected as the final prediction result. Thresholds may also be set to determine the prediction of the category.
The invention firstly extracts the operation code sequence of the subtitine in the malicious code, and then extracts the shallow features of TF-IDF and Asm2Vec. The subtutine is used for pre-training the input samples of the pre-training model, so that the generalization capability of the model can be improved, and the training speed and effect of the model can be improved. The shallow layer features are used as the pre-fix, so that the parameter scale required to be trained in the model training process can be reduced, the universality of the pre-training model can be improved, and the performance equivalent to the pre-training-fine tuning paradigm can be realized.
Example two
Based on the same inventive concept, the embodiment discloses a malicious code classification device based on pre-training, which comprises:
the feature extraction module is used for extracting shallow features and operation codes in subroutines from malicious codes contained in a preset data set, and constructing a feature set according to the shallow features and the operation codes, wherein the shallow features comprise TF-IDF features and Asm2Vec features, the TF-IDF features comprise readable character string sequence features, and the Asm2Vec features are semantic information features related to code execution logic in an assembly file;
the device comprises a pre-training module, a pre-training module and a position encoder, wherein the pre-training module is used for pre-training an improved pre-training model by taking malicious codes contained in a preset data set and a constructed feature set as training data sets to obtain a final model, the improved pre-training model comprises a prefix tuning structure, an embedding layer, a one-dimensional convolutional neural network, the position encoder and a transducer encoding layer, the prefix tuning structure is used for segmenting an output vector into a plurality of sections of input transducer encoding layers, the embedding layer is used for embedding three-dimensional tensors of operation features in a subprogram to obtain four-dimensional tensors, the one-dimensional convolutional neural network is used for obtaining a new feature sequence vector according to the four-dimensional tensors obtained by the embedding layer, the position encoder is used for encoding position information of each word or mark in an input sequence into the new feature sequence vector, the transducer encoding layer comprises a plurality of stacked encoders, each encoder comprises a multi-head attention layer and a feedforward network layer, and the feedforward network layer is used for performing attention calculation according to the input vector to obtain an output vector;
and the classification module is used for inputting the codes to be detected into the final model to obtain category probability distribution, and selecting the category with the highest probability according to the category probability distribution as a final prediction result.
Since the device described in the second embodiment of the present invention is a device for implementing the pretrained malicious code classification method in the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the device, and therefore, the description thereof is omitted herein. All devices used in the method of the first embodiment of the present invention are within the scope of the present invention.
Example III
Based on the same inventive concept, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed, implements the method as described in embodiment one.
Because the computer readable storage medium described in the third embodiment of the present invention is a computer readable storage medium used for implementing the pretrained malicious code classification method in the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the modification of the computer readable storage medium, and therefore, the description thereof is omitted here. All computer readable storage media used in the method according to the first embodiment of the present invention are included in the scope of protection.
Example IV
Based on the same inventive concept, the present application also provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method in the first embodiment when executing the program.
Because the computer device described in the fourth embodiment of the present invention is a computer device used for implementing the pretrained malicious code classification method in the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the computer device, and therefore, the description thereof is omitted herein. All computer devices used in the method of the first embodiment of the present invention are within the scope of the present invention.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims and the equivalents thereof, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A pre-training based malicious code classification method, comprising:
extracting shallow features and operation codes in subroutines from malicious codes contained in a preset data set, and constructing a feature set according to the shallow features and the operation codes, wherein the shallow features comprise TF-IDF features and Asm2Vec features, the TF-IDF features comprise readable character string sequence features, and the Asm2Vec features are semantic information features logically related to code execution in an assembly file;
pre-training an improved pre-training model by taking a malicious code contained in a preset data set and a constructed feature set as a training data set to obtain a final model, wherein the improved pre-training model comprises a prefix tuning structure, an embedding layer, a one-dimensional convolutional neural network, a position encoder and a transducer encoding layer, the prefix tuning structure is used for segmenting an output vector into a plurality of sections of input transducer encoding layers, the embedding layer is used for embedding three-dimensional tensors of operation features in a subprogram to obtain four-dimensional tensors, the one-dimensional convolutional neural network is used for obtaining a new feature sequence vector according to the four-dimensional tensors obtained by the embedding layer, the position encoder is used for encoding position information of each word or mark in an input sequence into the new feature sequence vector, the transducer encoding layer consists of a plurality of stacked encoders, each encoder comprises a multi-head attention layer and a feedforward network layer, the multi-head attention layer is used for performing attention calculation according to the input vector to obtain an output vector, and the feedforward network layer is used for obtaining the encoded vector according to the input vector and the output vector;
and inputting the code to be tested into a final model to obtain category probability distribution, and selecting the category with the highest probability according to the category probability distribution as a final prediction result.
2. The method for classifying malicious codes based on pre-training according to claim 1, wherein the prefix tuning structure is a pre-fix_tuning model structure and is composed of a plurality of layers of fully-connected neural networks, input layer nodes are set to be the sequence length of pre-fix, length conversion is performed through a hidden layer, the length of output layer nodes is set to be the node length suitable for deepdiffix processing, the deepdiffix processing is to divide vectors output from the output layer into a plurality of sections for inputting a multi-head self-attention layer in a transform coding layer of the pre-training model, and the multi-head self-attention layer is spliced with key vectors K and value vectors V used in the multi-head attention layer.
3. The pretrained malicious code classification method according to claim 1, wherein pretraining the improved pretrained model with the malicious code contained in the preset data set and the constructed feature set as the training data set to obtain a final model comprises:
inputting the operation codes in the training data set into the improved pre-training model for training;
freezing parameters of the pre-training model, and inputting the extracted shallow features as prefixes into a prefix tuning structure for training.
4. The pretrained malicious code classification method according to claim 1, wherein the embedding layer is specifically configured to find an embedding vector for each element in a given input from an embedding matrix of size V x D, where V represents the number of rows of the embedding matrix and D represents the dimension of each embedding vector; the input of the embedded layer is a three-dimensional tensor of the operation characteristics in the subprogram, which is a shape of [ ]B, S, L), where B is the number of samples processed in the batch, S is the number of subroutines contained in the batch of samples, L is the number of opcodes contained in the subroutines, and the three-dimensional tensor contains an index for each word to be retrieved from the embedding matrix; the output of the embedded layer is in the shape ofContains an embedded vector for each input word.
5. The pretrained malicious code classification method according to claim 1, wherein the calculation formula of the position encoder is:
where pos denotes the position of the word or token in the sequence, m denotes the dimension of the output vector of the position encoder, d denotes the dimension of the input vector, the above formula denotes that the position encoder encodes each position as a vector, each element of the vector is a sine or cosine function, and the coefficients in the function are different depending on the position and dimension.
6. The pretrained malicious code classification method according to claim 1, wherein the calculation formula of the transducer coding layer is:
wherein,representing the input vector +.>Representing the normalization layer, FFN is the feed forward network layer, and the above formula represents that each encoder processes the input vector through the multi-headed self-attention layer first>Obtaining an output vector +.>The method comprises the steps of carrying out a first treatment on the surface of the Then +_input vector>And output vector->Adding and normalizing; and finally taking the normalized vector as the input of a feedforward network layer, and calculating again through a stacked encoder until the final output of a transducer coding layer is obtained.
7. The pretraining-based malicious code classification method according to claim 1, wherein the pretraining task adopts an MLM task, the MLM task is to randomly cover some words or marks in an input sequence, and then a model predicts the covered words or marks;
the training target of the model is that the relation between the operation code representations is learned through the MLM task, and the calculation formula of the pre-training task is as follows:
where n represents the number of training samples, l represents the number of covered words or tokens in each sample,represent the firstSample No.)>Actual value of individual covered words or marks, < >>The probability that the representation model predicts the word or token,is the loss function of the MLM task.
8. A pretrained malicious code classification apparatus, comprising:
the feature extraction module is used for extracting shallow features and operation codes in subroutines from malicious codes contained in a preset data set, and constructing a feature set according to the shallow features and the operation codes, wherein the shallow features comprise TF-IDF features and Asm2Vec features, the TF-IDF features comprise readable character string sequence features, and the Asm2Vec features are semantic information features related to code execution logic in an assembly file;
the device comprises a pre-training module, a pre-training module and a position encoder, wherein the pre-training module is used for pre-training an improved pre-training model by taking malicious codes contained in a preset data set and a constructed feature set as training data sets to obtain a final model, the improved pre-training model comprises a prefix tuning structure, an embedding layer, a one-dimensional convolutional neural network, the position encoder and a transducer encoding layer, the prefix tuning structure is used for segmenting an output vector into a plurality of sections of input transducer encoding layers, the embedding layer is used for embedding three-dimensional tensors of operation features in a subprogram to obtain four-dimensional tensors, the one-dimensional convolutional neural network is used for obtaining a new feature sequence vector according to the four-dimensional tensors obtained by the embedding layer, the position encoder is used for encoding position information of each word or mark in an input sequence into the new feature sequence vector, the transducer encoding layer comprises a plurality of stacked encoders, each encoder comprises a multi-head attention layer and a feedforward network layer, and the feedforward network layer is used for performing attention calculation according to the input vector to obtain an output vector;
and the classification module is used for inputting the codes to be detected into the final model to obtain category probability distribution, and selecting the category with the highest probability according to the category probability distribution as a final prediction result.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when executed, implements the method of any one of claims 1 to 7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 7 when the program is executed.
CN202311610887.1A 2023-11-29 2023-11-29 Malicious code classification method and device based on pre-training Active CN117332419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311610887.1A CN117332419B (en) 2023-11-29 2023-11-29 Malicious code classification method and device based on pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311610887.1A CN117332419B (en) 2023-11-29 2023-11-29 Malicious code classification method and device based on pre-training

Publications (2)

Publication Number Publication Date
CN117332419A true CN117332419A (en) 2024-01-02
CN117332419B CN117332419B (en) 2024-02-20

Family

ID=89293778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311610887.1A Active CN117332419B (en) 2023-11-29 2023-11-29 Malicious code classification method and device based on pre-training

Country Status (1)

Country Link
CN (1) CN117332419B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515742A (en) * 2020-04-12 2021-10-19 南京理工大学 Internet of things malicious code detection method based on behavior semantic fusion extraction
CN113987209A (en) * 2021-11-04 2022-01-28 浙江大学 Natural language processing method and device based on knowledge-guided prefix fine tuning, computing equipment and storage medium
CN114065199A (en) * 2021-11-18 2022-02-18 山东省计算中心(国家超级计算济南中心) Cross-platform malicious code detection method and system
CN114386511A (en) * 2022-01-11 2022-04-22 广州大学 Malicious software family classification method based on multi-dimensional feature fusion and model integration
CN114647723A (en) * 2022-04-18 2022-06-21 北京理工大学 Few-sample abstract generation method based on pre-training soft prompt
US20220398462A1 (en) * 2021-06-14 2022-12-15 Microsoft Technology Licensing, Llc. Automated fine-tuning and deployment of pre-trained deep learning models
US20230161567A1 (en) * 2021-11-24 2023-05-25 Microsoft Technology Licensing, Llc. Custom models for source code generation via prefix-tuning
CN116720184A (en) * 2023-04-27 2023-09-08 厦门农芯数字科技有限公司 Malicious code analysis method and system based on generation type AI
CN117113349A (en) * 2023-08-25 2023-11-24 杭州电子科技大学 Malicious software detection method based on malicious behavior enhancement pre-training model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515742A (en) * 2020-04-12 2021-10-19 南京理工大学 Internet of things malicious code detection method based on behavior semantic fusion extraction
US20220398462A1 (en) * 2021-06-14 2022-12-15 Microsoft Technology Licensing, Llc. Automated fine-tuning and deployment of pre-trained deep learning models
CN113987209A (en) * 2021-11-04 2022-01-28 浙江大学 Natural language processing method and device based on knowledge-guided prefix fine tuning, computing equipment and storage medium
CN114065199A (en) * 2021-11-18 2022-02-18 山东省计算中心(国家超级计算济南中心) Cross-platform malicious code detection method and system
US20230161567A1 (en) * 2021-11-24 2023-05-25 Microsoft Technology Licensing, Llc. Custom models for source code generation via prefix-tuning
CN114386511A (en) * 2022-01-11 2022-04-22 广州大学 Malicious software family classification method based on multi-dimensional feature fusion and model integration
CN114647723A (en) * 2022-04-18 2022-06-21 北京理工大学 Few-sample abstract generation method based on pre-training soft prompt
CN116720184A (en) * 2023-04-27 2023-09-08 厦门农芯数字科技有限公司 Malicious code analysis method and system based on generation type AI
CN117113349A (en) * 2023-08-25 2023-11-24 杭州电子科技大学 Malicious software detection method based on malicious behavior enhancement pre-training model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WENHAO MA ET AL.: "Pre-trained Model Based Feature Envy Detection", 2023 IEEE/ACM 20TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR) *
XIAOMING RUAN ET AL.: "Prompt Learning for Developing Software Exploits", INTERNETWARE \'23: PROCEEDINGS OF THE 14TH ASIA-PACIFIC SYMPOSIUM ON INTERNETWARE *
刘恒讯;艾中良;: "一种基于词向量的恶意代码分类模型", 电子设计工程, no. 06 *

Also Published As

Publication number Publication date
CN117332419B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
Wang et al. Learning to extract attribute value from product via question answering: A multi-task approach
Meng et al. Research on denoising sparse autoencoder
CN113596007B (en) Vulnerability attack detection method and device based on deep learning
Yu et al. LSTM-based end-to-end framework for biomedical event extraction
CN115982403B (en) Multi-mode hash retrieval method and device
CN111931935A (en) Network security knowledge extraction method and device based on One-shot learning
CN113282714A (en) Event detection method based on differential word vector representation
Yin et al. Intrusion detection for capsule networks based on dual routing mechanism
EP4004827A1 (en) A computer-implemented method, a system and a computer program for identifying a malicious file
Chen et al. Survey on ai sustainability: Emerging trends on learning algorithms and research challenges
CN110969015A (en) Automatic label identification method and equipment based on operation and maintenance script
Şahin Malware detection using transformers-based model GPT-2
Pei et al. Combining multi-features with a neural joint model for Android malware detection
CN117332419B (en) Malicious code classification method and device based on pre-training
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
CN113326371B (en) Event extraction method integrating pre-training language model and anti-noise interference remote supervision information
Al-Jamal et al. Image captioning techniques: A review
Meng et al. A survey on machine learning-based detection and classification technology of malware
Otsubo et al. Compiler provenance recovery for multi-cpu architectures using a centrifuge mechanism
CN111199170B (en) Formula file identification method and device, electronic equipment and storage medium
Sharma et al. Optical Character Recognition Using Hybrid CRNN Based Lexicon-Free Approach with Grey Wolf Hyperparameter Optimization
Li et al. Prior knowledge integrated with self-attention for event detection
Rastogi et al. Dimensionality Reduction Approach for High Dimensional Data using HGA based Bio Inspired Algorithm
Vadavalli et al. Deep Learning based truth discovery algorithm for research the genuineness of given text corpus
Jiang et al. Multi-label Detection Method for Smart Contract Vulnerabilities Based on Expert Knowledge and Pre-training Technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant