CN115879109A

CN115879109A - Malicious software identification method based on visual transform

Info

Publication number: CN115879109A
Application number: CN202310063452.3A
Authority: CN
Inventors: 刘广起; 王志文; 韩晓晖; 左文波
Original assignee: Qilu University of Technology; Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Qilu University of Technology; Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2023-02-06
Filing date: 2023-02-06
Publication date: 2023-03-31
Anticipated expiration: 2043-02-06
Also published as: CN115879109B

Abstract

A malicious software identification method based on visual Transformer belongs to the technical field of software security protection, and comprises the steps of visualizing an executable file of benign/malicious software into RGB images, and constructing a malicious software image dataset; pre-training a visual Transformer by adopting an ImageNet-21K image data set, and finely adjusting by adopting a malicious software image data set; constructing a lightweight visual Transformer for actual deployment on lightweight equipment; migrating the knowledge of the well-trained visual Transformer to a lightweight visual Transformer based on knowledge distillation to reduce the performance difference between the two models; detection and family classification of malware is performed using lightweight visual transformers. The detection efficiency of the model, lower hardware resource occupation, detection of the model and family classification precision are guaranteed.

Description

Malicious software identification method based on visual Transformer

Technical Field

The invention relates to the technical field of software security protection, in particular to a malicious software identification method based on a visual Transformer.

Background

With the rise of the internet of things, the types and the number of the devices of the internet of things are exponentially increased. An embedded system carried by the internet of things equipment often lacks consideration of security factors, and has a wider attack surface compared with a mature Windows and Linux system, and the focus of attention of a malicious software manufacturer is gradually shifted to the internet of things equipment. Therefore, a method for detecting malware that is faster and more effective is needed to protect internet of things devices from malware. Currently, most antivirus vendors employ signature-based or rule-based techniques to detect malware, relying on constant updates to the malicious signature and rule libraries to detect more malware, but their low generalization performance makes them inadequate to cope with the growing new types of network threats. Machine learning-based malware identification has become one of the research hotspots in recent years, extracting features from software, and relying on machine learning algorithms to automatically perform malware detection or classification. At present, a method of visualizing software as a gray-scale map and then automatically performing feature extraction using a Convolutional Neural Network (CNN) end-to-end has proven to be one of the most effective methods. However, the inherent locality, translation and other degeneration of CNN induces preference, and has natural advantages for processing natural images, while software can be visualized as a gray-scale image, a 1D byte sequence is forcibly converted into a 2D gray-scale image, and longitudinal pixel points of the 2D gray-scale image do not have any correlation. Therefore, the way of processing software gray maps with CNN has some irrationality, so that its result may be suboptimal.

Under the condition of sufficient training data, the model with high complexity has stronger pattern recognition capability than the model with low complexity. However, the hardware resources and time cost inside the device, such as a large amount of memory and computational power consumed by the high-complexity model, make it unfavorable for deployment on the lightweight device. Most of the internet of things equipment is lightweight equipment with extremely limited hardware resources, and low resource occupation is one of necessary conditions of a safety protection model deployed in the internet of things equipment. Therefore, under the condition that the occupation of model resources is low enough, how to quickly, accurately and effectively detect the malicious software and judge the family to which the malicious software belongs to adopt different coping methods is one of the problems to be solved urgently.

Disclosure of Invention

In order to overcome the defects of the technologies, the invention provides a method which can accurately detect malicious software and judge the family to which the malicious software belongs while ensuring that the hardware occupation of the model is low.

The technical scheme adopted by the invention for overcoming the technical problems is as follows:

a malware identification method based on visual transform comprises the following steps:

(a) Acquiring an ImageNet-21K image dataset and an executable file dataset of application software, wherein the executable file dataset comprises an executable file of benign software and a malicious software executable file comprising a family tag, and visualizing all samples in the executable file dataset into RGB images to construct a malicious software image dataset;

(b) Building a visual Transformer model containing an X-layer encoder, carrying out classification pre-training on the visual Transformer model by adopting an ImageNet-21K image data set, changing a full connection layer in the visual Transformer model after the classification pre-training into an ordered double-task classifier for carrying out malicious software detection and family classification, and carrying out fine tuning on the visual Transformer model by adopting a malicious software image data set;

(c) Constructing a lightweight visual Transformer model for actual deployment;

(d) Taking the trimmed vision Transformer model as a teacher model, taking the lightweight vision Transformer model as a student model, and performing distillation training on the student model by taking a self-attention matrix and a hidden layer state of the teacher model and a predicted logits of the double-task classifier as supervision information of the student model;

(e) And (3) judging benign software or malicious software and judging the family label of the malicious software by using the distillation-trained lightweight visual Transformer model on the unknown software.

Further, the step of visualizing all samples in the executable file data set as RGB images in step (a) comprises:

(a-1) reading the executable file of the application software in hexadecimal, and converting the hexadecimal number into a decimal number to represent the executable file of the application software as a decimal number sequence with a value range of [0,255 ];

(a-2) a decimal value sequence length of

The length of the sequence->

Has a width of ^ 4>

，

，/>

To round down;

(a-3) adjacent three decimal numbers in the decimal value sequence as R of a single pixel in sequenceObtaining the visual RGB image of the executable file by the channel value, the G channel value and the B channel value

，/>

，/>

Is a real number space, is>

For a high image, 3 is the number of channels of the image, and the visualized RGB images of all executable files constitute the malware image dataset.

Further, the step (b) comprises the steps of:

(b-1) the visual Transformer model sequentially comprises 12 layers of encoders and a multilayer perceptron MLP, and each encoder sequentially comprises a first normalization layer LayerNorm, a multi-Head self-Attention mechanism Muti-Head Attention, a first residual connecting layer, a second normalization layer LayerNorm, a multilayer perceptron MLP and a second residual connecting layer;

(b-2) RGB image to be visualized

Zooming to obtain zoomed visual RGB image

Wherein->

High,. For a zoomed visualized RGB image>

For the width of the scaled visualized RGB image, the scaled visualized RGB image ≦ based on the Flatten function in the torch library>

Middle and fifth>

Row pixel values

The flattening treatment is->

，/>

Visual RGB image of 3D

Conversion into a 2D line sequence>

，/>

；

(b-3) 2D line sequence

Mapping each element in the sequence to ≥ via a linear layer>

Dimension resulting row embedding>

，

Learnable class label tensor using cat function in a torch library>

And line embedding>

Splicing to obtain spliced tensor, and embedding the spliced tensor and the learnable absolute position

Make an addition to obtain a tensor->

，/>

；

(b-4) tensor

Inputting the data into a first normalization layer LayerNorm of a layer 1 encoder of a visual Transformer model for normalization to obtain tensor->

The multi-headed self-Attention mechanism Muti-Head Attention of the layer 1 encoder includes ^ H>

Attention head, will tensor->

Respectively input into a multi-Head self-Attention mechanism Muti-Head Attention, and the fifth/fifth judgment>

Individual attention head pair tensor->

Respectively carrying out linear mapping to obtain query matrixes

Key matrix>

Value matrix->

，/>

，

，/>

，/>

，/>

、/>

、/>

A weight matrix which is a linear transformation, is->

、/>

、/>

Are all bias vectors, by formula>

Calculating an embedding ÷ based on fusion of global attention>

，/>

In the formula>

For attention scoring, is based on>

，/>

，

For transposition, in>

For Softmax activation function, pass cat function in the torch libraryNumber will>

Global attention fused embed of individual attention head outputs>

Splicing is carried out, and the splicing result and the tensor are greater or less>

Sequentially inputting the data into a first residual connecting layer and a second normalization layer LayerNorm of a layer 1 encoder, and outputting to obtain a tensor

，/>

Will have a tensor->

Input into the multi-layer perceptron MLP of the layer 1 encoder through a formula

The tensor is calculated>

In the formula>

For the GELU activation function, <' >>

Is a weight matrix of the neurons in the first layer of the multi-layer perceptron MLP->

，/>

Is a weight matrix of the neurons in the second layer of the multi-layer perceptron MLP->

，/>

For bias vectors in first layer neurons in a multi-layer perceptron MLP @>

，/>

For bias vectors in first layer neurons in a multi-layer perceptron MLP @>

Will have a tensor->

Inputting the data into a second residual connecting layer of the layer 1 encoder, and outputting to obtain the output tensor of the layer 1 encoder>

，/>

Embedding dimension of the first layer of neurons in the multi-layer perceptron MLP;

(b-5) tensor

Tensor in alternative step (b-4)>

Repeating the step (b-4) to obtain an output tensor { } for the layer 2 encoder>

；

(b-6) tensor

The tensor in the alternative step (b-5) is/are based>

Repeating step (b-5) for output tensor/based on layer 3 encoder>

；

(b-7) the first

The output of each encoder is taken as the ^ h->

An input of an encoder, based on the number of the encoder units>

Repeating the step (b-6) to obtain the tensor/device output by the encoder of the 12 th layer>

，/>

；

(b-8) tensor

The vector for the 0 th position in is the learnable classification mark tensor->

In a combined block of embedded vectors>

，/>

Will embed the vector &>

Inputting the data into a multi-layer perceptron MLP of a vision Transformer model, and outputting the data to obtain tensor>

，/>

Will tensor->

Input into the full connection layer FC to obtainA classification result output by the vision Transformer model;

(b-9) carrying out classification pre-training on the visual transform model by adopting ImageNet-21K image data set.

Further, in the step (b-3), the 2D lines are sequenced

Is input into a linear layer and is processed by the formula>

Calculate line insert->

In the formula>

For the weight matrix of the linear mapping layer, <' >>

，/>

Is a bias vector>

。

Further, in the step (b), the step of changing the full link layer in the visual Transformer model after classification pre-training into an ordered double-task classifier for malware detection and family classification comprises the following steps:

(b-10) changing the full link FC in the step (b-8) into an ordered double-task classifier, wherein the ordered double-task classifier comprises a detection task for detecting malicious software and a family classification task for judging the family of the malicious software, and the detection task and the family classification task are both formed by two full link FCs;

(b-11) tensor

Input into the detection task and based on the formula>

The prediction logits of the detection task is calculated>

，/>

In the formula>

For the detection of the weight matrix of the first fully connected layer FC>

，/>

For the detection task, the weight matrix of the second full link layer FC>

，

For the first offset vector of the full connection layer FC of the detection task->

，/>

For the second offset vector of the fully connected layer FC of the detection task->

；

(b-12) tensor

Input into family classification task by formula

Calculating to obtain the predicted logit of the family classification task

，/>

In the formula>

For the weight matrix of the first fully-connected layer FC of the family classification task, <>

，

A weight matrix for the second fully-connected layer FC for the family classification task, <>

，/>

For the bias vector of the first fully-connected layer FC of the family classification task, <>

，/>

For the bias vector of the second fully-connected layer FC of the family classification task, <>

，/>

Is the number of families.

Further, the step of fine-tuning the visual Transformer model by using the malware image dataset in the step (b) comprises the following steps:

(b-13) by the formula

Calculated loss>

In the formula>

For Sigmoid activation function, <' > based on>

In order to be a binary cross-entropy loss,

for cross entropy loss>

For detecting task tags, in conjunction with a timer>

0 means benign, 1 means malicious and/or->

Is a malicious sample family one-hot tag, based on the presence of a specific marker>

。

Further, the step (c) comprises the steps of:

(c-1) the lightweight visual Transformer model sequentially comprises 3 layers of encoders and a multilayer perceptron MLP, each encoder sequentially comprises a first layer normalization layer LayerNorm, a multi-Head self-Attention mechanism Muti-Head Attention, a first residual connecting layer, a second layer normalization layer LayerNorm, a multilayer perceptron MLP and a second residual connecting layer, and the number of Attention heads of the multi-Head self-Attention mechanism Muti-Head Attention is

，/>

The internal embedding dimension of the lightweight visual transform model is ≥>

，/>

；

(c-2) 2D line sequence

Mapping each element in the sequence to ≥ via a linear layer>

Get line embedding->

，/>

Learnable class label tensor is ≦ using the cat function in the store>

And line embedding>

Make an addition to obtain a tensor->

，/>

；

(c-3) tensor

The tensor is obtained by normalization in a first normalization layer LayerNorm of a layer 1 encoder which is input into a vision Transformer model>

Attention head, will tensor->

Individual attention head pair tensor->

Respectively carrying out linear mapping to obtain query matrixes

And key matrix->

Value matrix->

，

，/>

,/>

，/>

，/>

、/>

、

A weight matrix which is a linear transformation, is->

、/>

、/>

Are all biasedPut a vector, by means of the formula>

The calculated embedded->

，/>

In the formula>

In order to be a point of attention score,

，/>

will be ≧ by a cat function in the torch library>

Global attention fused embed of individual attention head outputs>

Sequentially inputting the signals into a first residual connecting layer and a second layer normalization layer LayerNorm of a layer 1 encoder, and outputting the signals to obtain tensor +>

，/>

Will tensor->

Is input into a multi-layer perceptron MLP of a layer 1 encoder through a formula->

The tensor is calculated>

，/>

，/>

，/>

Bias vectors for neurons in the first layer of the multi-layered perceptron MLP @>

，/>

Will make the vector->

Inputting the residual signal into a second residual connecting layer of the layer 1 encoder, and outputting the residual signal to obtain the output tensor of the layer 1 encoder>

，/>

(c-4) tensor

The tensor in the alternative step (c-3) is/are based>

Repeating step (c-3) to obtain the output tensor ^ greater than or equal to the layer 2 encoder>

；

(c-5) tensor

The tensor in the alternative step (c-4) is/are based>

Repeating step (c-4) to obtain the output tensor ^ 4 of the layer 3 encoder>

,/>

；/>

(c-6) tensor

Is embedded vector pick>

，/>

Will embed the vector->

，/>

；

(c-7) tensor

Input into the detection task and based on the formula>

The prediction logits of the detection task is calculated>

，/>

In the formula>

For the detection of the weight matrix of the first fully connected layer FC>

，/>

To detect the weight matrix of the second fully-connected layer FC of a task,

，/>

，/>

；

(c-8) tensor

Inputting into family classification task by formula

Calculating the predicted logit @ofthe family classification task>

,/>

In the formula>

For the weight matrix of the first fully-connected layer FC of the family classification task,

，/>

，/>

，/>

。

Further, the step (d) comprises the steps of:

(d-1) by the formula

The loss of predicted logits distillation is calculated>

In the formula>

For classifying an influencing factor for a loss ratio>

For L2 loss, based on>

For detecting a temperature over-parameter of the classifier distillation in question, is selected>

Temperature over-parameter for family classification task classifier distillation;

(d-2) by the formula

Calculating a loss from attention distillation>

，/>

For the correlation matrix between the self-attention matrices in the student model @>

Is based on the fifth->

Line and/or combination>

Is a correlation matrix between the self-attention matrices in the teacher model @>

In a first or second section>

Line,. Or>

，

Is a multi-head self-attention device for the teacher model>

Query matrix spliced by individual self-attention heads

，/>

Is a multi-head self-attention device for the teacher model>

Individual self-attention head spliced key matrix->

，/>

Is a multi-head self-attention device for the teacher model>

Value matrix which is stitched together by individual self-attention heads>

，/>

Is transposed and is up and down>

，/>

，

，/>

，/>

For the multi-head self-attention device of the student model>

Query matrix combined by individual self-attention heads>

，/>

For multi-head self-attention system of student model>

Individual self-attention head spliced key matrix->

，/>

For multi-head self-attention system of student model>

Value matrix spliced by individual self-attention head

，/>

For transposition, in>

，/>

，

；

(d-3) by the formula

Calculating to obtain the distillation loss of the hidden layer state

In the formula>

Hidden layer state association matrix for a student model>

Is based on the fifth->

Line,. Or>

，

Hidden layer state association matrix ^ for teacher model>

Is based on the fifth->

Line,. Or>

；

(d-4) supervising the layer 1 encoder of the student model with the layer 4 encoder of the teacher model, supervising the layer 2 encoder of the student model with the layer 8 encoder of the teacher model, supervising the layer 3 encoder of the student model with the layer 12 encoder of the teacher model;

(d-5) by the formula

Calculating to obtain the total loss of the student model training

In the formula>

For the weight value of the distillation loss of self attention>

The weight of distillation loss in a hidden layer state is obtained;

(d-6) Total loss by

And (4) carrying out iterative training on the lightweight visual Transformer model to obtain the lightweight visual Transformer model after distillation training.

Further, the step (e) comprises the steps of:

(e-1) visualizing the unknown software as an RGB image

；

(e-2) RGB image to be visualized

Zooming into zoomed visualized RGB image>

RGB image scaled visualization using the Flatten function in the torch library

Middle and fifth>

Row pixel value->

Flattening processing is>

，

RGB image for visualization in 3D>

Conversion into a 2D line sequence->

，/>

；

(e-3) sequencing the 2D rows

Inputting the result into a lightweight visual Transformer model after distillation training to obtain the predicted logits->

And predicted logit @ofa family classification task>

If +>

Then the unknown software is determined to be malware, if &>

Judging the unknown software as benign software, and judging the family to which the malware belongs as being ^ or greater than or equal to when the distillation-trained lightweight vision Transformer model judges the input unknown software as the malware>

The family corresponding to the highest value of the middle.

The beneficial effects of the invention are: the method only processes the executable file of the software based on static analysis, avoids the time cost of introducing disassembling, dynamic running or manual feature extraction, and is suitable for detection tasks with high timeliness. The visual Transformer is adopted to automatically extract the characteristics of the RGB image after software visualization, and the pixel value of each line of the image is used as a sequence element input by a model, so that the problem of suboptimal result based on CNN recognition caused by the lack of correlation among longitudinal pixel points of the visualized image is effectively solved. The method is characterized in that malware detection and classification are performed based on ordered multitask combination, and the malware can be classified while being detected so as to generate early warnings of different levels for malware of different families and adopt corresponding measures. Furthermore, the ordered multitasking alleviates the negative impact of the relatively difficult malware family classification task on the cost-sensitive malware detection task, relative to a single task (treating benign software as an innocuous malware family to jointly perform malware detection and classification). Three knowledge distillations are adopted to transfer the knowledge of a large-scale teacher model to a small-scale student model, and the performance gain of the student model is maximized.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is an exemplary illustration of an executable file of the present invention being visualized as an RGB image;

FIG. 3 is a schematic structural diagram of a visual Transformer according to the present invention;

FIG. 4 is a schematic diagram of the structure of the distillation of the knowledge of the present invention.

Detailed Description

The invention will be further described with reference to fig. 1 to 4.

As shown in fig. 1, a malware identification method based on visual Transformer includes the following steps:

(a) Acquiring an ImageNet-21K image dataset and an executable file dataset of application software, wherein the executable file dataset comprises an executable file of benign software and a malware executable file comprising a family tag, and visualizing all samples in the executable file dataset into RGB images to construct a malware image dataset.

(b) The method comprises the steps of building a visual Transformer model comprising an X-layer encoder, carrying out classification pre-training on the visual Transformer model by adopting an ImageNet-21K image data set, changing a full connection layer in the visual Transformer model after classification pre-training into an ordered double-task classifier for malicious software detection and family classification, and carrying out fine adjustment on the visual Transformer model by adopting a malicious software image data set.

(c) And (4) building a lightweight visual Transformer model for actual deployment.

(d) And taking the trimmed vision Transformer model as a teacher model and taking the light-weight vision Transformer model as a student model. In order to enable the performance of the lightweight model to be equivalent to that of a large-scale model and increase the feasibility of the model in the deployment of lightweight equipment, knowledge distillation is introduced to greatly improve the performance of the lightweight model. Specifically, the self-attention matrix and the hidden layer state of the teacher model and the prediction logits of the double-task classifier are used as the supervision information of the student model to carry out distillation training on the student model.

The invention processes the executable file of the software only based on the static analysis and adopts the lightweight model to execute reasoning, thereby ensuring the detection efficiency of the model and lower hardware resource occupation. The visual Transformer is adopted to automatically extract the characteristics of the visual image of the executable file, so that the problem of no correlation between longitudinal pixels of the visual image is solved; knowledge distillation is adopted to further improve the performance of the model, and the detection and family classification precision of the model are ensured.

In one embodiment of the present invention, as shown in fig. 2, the step (a) of visualizing all samples in the executable file data set as RGB images comprises:

(a-1) reading the executable file of the application software in hexadecimal, namely a binary file, and converting the hexadecimal number into a decimal number to enable the executable file of the application software to be represented as a decimal number sequence with the value range of [0,255 ].

(a-2) a decimal value sequence length of

The length of the sequence->

Has a width of ^ 4>

，

，/>

To round down.

(a-3) consecutive three decimal numbers in the decimal value sequence as R of a single pixel in orderObtaining the visual RGB image of the executable file by the channel value, the G channel value and the B channel value

，/>

，/>

Is a real number space, is>

For high, 3 is the number of channels of the image, and the visualized RGB images of all executables constitute the malware image dataset.

In one embodiment of the present invention, as shown in fig. 3, the step (b) comprises the steps of:

(b-1) the visual Transformer model sequentially comprises 12 layers of encoders and a multilayer perceptron MLP, the multilayer perceptron MLP of the visual Transformer model is used for classification, and each encoder sequentially comprises a first normalization layer LayerNorm, a multi-Head self-Attention mechanism Muti-Head Attention, a first residual connecting layer, a second normalization layer LayerNorm, the multilayer perceptron MLP and a second residual connecting layer.

(b-2) RGB image to be visualized

Zooming to obtain zoomed visual RGB image

Wherein->

High,. For a zoomed visualized RGB image>

Middle and fifth>

Row pixel value

The flattening treatment is->

，/>

Visual RGB image of 3D

Conversion into a 2D line sequence>

，/>

。

(b-3) 2D line sequence

Mapping each element in the sequence to ≥ via a linear layer>

Get line embedding->

，

Learnable class label tensor is ≦ using the cat function in the store>

Is embedded in a row>

Make an addition to obtain a tensor->

，/>

. Learnable class label tensor>

And a learnable absolute position embedding>

Are prior art and are essentially learnable parameters.

(b-4) tensor

Individual heads of attention, each individual head of attention being on a tensor @>

Operations are performed to extract features from multiple different perspectives, and after operations, stitching fusion is performed. Will tensor & lt>

Respectively input into a multi-Head self-Attention mechanism Muti-Head Attention, and the fifth or the sixth gear>

Individual attention head pair tensor->

Respectively proceed to lineObtaining a query matrix by sexual mapping

And key matrix->

And value matrix>

，/>

，

，/>

，/>

，/>

、/>

、/>

A weight matrix which is a linear transformation, is->

、/>

、/>

Are all bias vectors, are based on the formula>

Calculating an embedding ÷ based on fusion of global attention>

，/>

Embedding of fused global attention>

Is a matrix of values weighting the sum, the weight being the attention score. In the formula>

For attention scoring, is based on>

，/>

，/>

In order to be transposed, the device is provided with a plurality of groups of parallel connection terminals,

the function is activated for Softmax, which maps the attention scores of each row in the matrix to [0, 1%]In the range, and the sum is 1. Will ≧ be by cat function in the torch library>

Global attention fused embedding of individual attention head outputs

，/>

Will tensor->

Input to layer 1 codingThe multi-layer perceptron MLP of the device is judged by a formula->

The tensor is calculated>

In the formula>

For the GELU activation function, <' >>

Is a weight matrix of the first layer of neurons in the multi-layer perceptron MLP,

，/>

，/>

For the bias vector of the first layer neuron in the multi-layer perceptron MLP, <' >>

，/>

Will have a tensor->

，/>

Is the embedding dimension of the first layer of neurons in the multi-layer perceptron MLP.

(b-5) tensor

The tensor in the alternative step (b-4) is/are based>

Repeating step (b-4) to obtain the output tensor ^ greater than or equal to the layer 2 encoder>

。

(b-6) tensor

The tensor in the alternative step (b-5) is/are based>

Repeating step (b-5) to obtain the output tensor ^ greater than or equal to the layer 3 encoder>

。

(b-7) the first

The output of each encoder is taken as the ^ h->

An input of an encoder, is asserted>

And (c) repeating the step (b-6) to obtain the tensor based on the 12 th layer encoder output>

，/>

。

(b-8) tensor

The vector of the 0 th position in is the learnable classification mark tensor @>

In a combined block of embedded vectors>

，/>

Will embed the vector->

，/>

Will tensor->

And inputting the classification result into the full connection layer FC to obtain the classification result output by the visual Transformer model.

(b-9) carrying out classification pre-training on the visual transform model by adopting ImageNet-21K image data set. The induction preference of the loss of the denaturation such as locality, translation and the like can be compensated to a certain extent.

In one embodiment of the present invention, in step (b-3), the 2D rows are sequenced

Is input into a linear layer and is processed by the formula>

Calculate line insert->

In the formula>

Is a weight matrix for the linear mapping layer,

，/>

is a bias vector>

。

In an embodiment of the present invention, the step (b) of modifying the full-link layer in the visual Transformer model after classification pre-training into an ordered dual task classifier for malware detection and family classification includes:

(b-10) changing the full link FC in the step (b-8) into an ordered double-task classifier, wherein the ordered double-task classifier comprises a detection task for detecting the malicious software and a family classification task for judging the family of the malicious software, and the detection task and the family classification task are both formed by two full link FCs. In addition, because the malware family classification task is performed based on the condition that the input is malicious, the output state of the first fully-connected layer of the detection task is used as one of the inputs of the second fully-connected layer of the family classification task.

(b-11) tensor

Is input into the detection task and is judged by a formula>

Calculating a predicted logits { (X } for the detection task>

，/>

In the formula>

For the first weight matrix of the fully connected layer FC of the detection task, ->

，/>

For the detection of the weight matrix of the second fully connected layer FC>

，/>

，/>

。

(b-12) tensor

Inputting into family classification task by formula

Calculating to obtain the predicted logit of the family classification task

，/>

In the formula>

A weight matrix for the first full link layer FC of a family classification task>

，

A weight matrix for the second full link layer FC of the family classification task>

，/>

Bias vectors for the first fully-connected layer FC of a family classification task>

，/>

，/>

Is the number of families.

In an embodiment of the present invention, the step of fine-tuning the visual Transformer model by using the malware image dataset in the step (b) includes:

(b-13) by the formula

A loss is calculated>

In the formula>

For Sigmoid activation function, <' > based on>

Is a binary cross entropy loss>

For cross entropy loss>

For detecting task tags, in conjunction with a timer>

0 for benign, 1 for malicious, and>

。

In one embodiment of the present invention, step (c) comprises the steps of:

(c-1) the lightweight visual Transformer model sequentially comprises 3 layers of encoders and a multilayer perceptron MLP, each encoder sequentially comprises a first normalization layer LayerNorm, a multi-Head self-Attention mechanism Muti-Head Attention, a first residual connecting layer, a second normalization layer LayerNorm, a multilayer perceptron MLP and a second residual connecting layer, and the number of the Attention heads of the multi-Head self-Attention mechanism Muti-Head Attention is

，/>

，/>

。

(c-2) 2D line sequence

Mapping each element in a sequence to ÷ via a linear layer>

Dimension resulting row embedding>

，/>

Learnable class label tensor using cat function in a torch library>

And line embedding>

Make an addition to obtain a tensor->

，/>

。

(c-3) tensor

Attention head, the tensor>

Individual attention head pair tensor->

Respectively carrying out linear mapping to obtain query matrixes

Key matrix>

Value matrix->

，

，/>

,/>

，/>

，/>

、/>

、

A weight matrix which is a linear transformation, is->

、/>

、/>

Are all bias vectors, are based on the formula>

The calculated embedded->

，/>

In the formula>

In order to be a point of attention score,

，/>

will be ≧ by a cat function in the torch library>

Global attention fused embed of individual attention head outputs>

Splicing, and combining the splicing result with tensor>

，/>

Will tensor->

The tensor is calculated>

，/>

，/>

，/>

，/>

Will make the vector->

，/>

(c-4) tensor

Tensor in alternative step (c-3)>

Repeating the step (c-3) to obtain an output tensor { } for the layer 2 encoder>

。

(c-5) tensor

The tensor in the alternative step (c-4) is/are based>

Repeating step (c-4) to obtain the output tensor ^ 4 of the layer 3 encoder>

,/>

。

(c-6) tensor

Is embedded vector pick>

，/>

Will embed the vector->

，/>

。

(c-7) tensor

Is input into the detection task and is judged by a formula>

Calculating a predicted logits { (X } for the detection task>

，/>

In the formula>

For the detection of the weight matrix of the first fully connected layer FC>

，/>

To detect the weight matrix of the second fully connected layer FC of a task,

，/>

，/>

A bias vector for the second fully-connected layer FC for detection tasks>

。

(c-8) tensor

Inputting into family classification task by formula

Calculating a predicted logit @fora family classification task>

,/>

In the formula>

The weight matrix for the first fully-connected layer FC of the family classification task,

，/>

，/>

，/>

A bias vector for the second full link layer FC of a family classification task>

。

The teacher model is used to supervise training of the student models such that the student models mimic the teacher model representation to achieve performance comparable to the teacher model. In order to make the representation capability of the student model approach to the teacher model as much as possible, three distillation methods are adopted: predicted logits distillation, self-attentive distillation and cryptic state distillation. The predicted logits distillation is to adopt the predicted logits of two classification layers of the teacher model to supervise and train the student model. Thus, in one embodiment of the present invention, as shown in FIG. 4, step (d) comprises the steps of:

(d-1) by the formula

The loss of predicted logits distillation is calculated>

In the formula>

For classifying an influencing factor for a loss ratio>

In order to obtain a loss of L2,

Temperature over-parameter for family classification task classifier distillation.

(d-2) number of heads due to multi-head self-attention in teacher model encoder

And an embedding dimension->

Inconsistent with student model encoders, therefore distillation is performed taking the correlation between the head of attention, in particular, by formula

Calculating a loss from attention distillation>

，/>

Is based on the fifth->

Line and/or combination>

For the correlation matrix between the self-attention matrices in the teacher model &>

Is based on the fifth->

Line,. Or>

，/>

In multi-head self-attention device for teacher model>

Query matrix for individual self-attention head stitching>

，/>

Is a multi-head self-attention device for the teacher model>

Individual self-attention head spliced key matrix->

，/>

In multi-head self-attention device for teacher model>

Value matrix which is stitched together by individual self-attention heads>

，

For transposition, in>

，/>

，/>

，/>

，

For multi-head self-attention system of student model>

Query matrix spliced by individual self-attention head

，/>

For multi-head self-attention system of student model>

Key matrix combined with individual self-attention head>

，/>

For multi-head self-attention system of student model>

Value matrix for individual self-attention head stitching>

，/>

For transposition, in>

，

，/>

。

(d-3) by the formula

Calculating to obtain the distillation loss of the hidden layer state

In the formula>

Hidden layer state association matrix for a student model>

In a first or second section>

Line,. Or>

，

Hidden layer state association matrix ^ for teacher model>

Is based on the fifth->

Line,. Or>

。

(d-4) because the teacher model and the student models have different numbers of encoders, the self-attention distillation and the hidden-layer state distillation cannot correspond to each other one by one, so that the 4 th layer of encoder of the teacher model supervises the 1 st layer of encoder of the student models, the 8 th layer of encoder of the teacher model supervises the 2 nd layer of encoder of the student models, and the 12 th layer of encoder of the teacher model supervises the 3 rd layer of encoder of the student models;

(d-5) by the formula

Calculating to obtain the total of the student model trainingLoss of power

In the formula>

Is a weight lost from attention distillation>

The weight of distillation loss in a hidden layer state is obtained;

(d-6) Total loss by

In one embodiment of the present invention, step (e) comprises the steps of:

(e-1) visualizing the unknown software as an RGB image

。

(e-2) RGB image to be visualized

Zooming into zoomed visualized RGB image>

RGB image scaled for visualization using the Flatten function in the torch library

In a fifth or fifth sun>

Row pixel value->

Flattening processing is>

，

RGB image for visualization in 3D>

Conversion to 2D line sequences

，/>

。

(e-3) sequencing the 2D lines

And predicted logit @offamily classification task>

If->

Then the unknown software is determined to be malware and if ≥ is present>

The family corresponding to the highest value in (c).

The improvements of this patent over the prior art are illustrated by the following table:

data set: babuk, blackMatter, cerber, chaos, conti, darkSide, gandCrab, globeimpser, lockBit, locky, magniber, makop, medusa Locker, nemty, phobos, sodinkibe, teslaCrypt, thanos, 18 malicious families of Lesoxol software and BlackMoon, gafgyt botnet were crawled from the malware sharing platform, 20 malicious families and 11841 malicious samples were counted. Furthermore, 9833 benign executables were collected as benign categories in the Windows10 system. The experiment divided 80% of each class of samples into training sets and the remaining 20% of samples into test sets to evaluate model performance. Due to the fact that the malicious family samples are different in size and have certain data imbalance factors, the Macro-F1 value is added to the evaluation index besides the accuracy.

Table one comparison result of the performance of the lightweight visual Transformer and the classic CNN network.

Table two comparison of distillation process performance.

Table three ordered dual tasks compare with single task performance.

The performance of the lightweight visual Transformer is superior to that of a classic CNN network, and the Macro-F1 value and accuracy are improved by 1.71% and 1.42% at least. It can be seen from table two that compared with the non-distillation method, the three distillation methods can all bring certain performance gain to the student model, but the three methods jointly carry out distillation to bring the maximum improvement. The performance of the student model after combined distillation is extremely close to that of the teacher model, and the difference between the Macro-F1 value and the accuracy rate is only 0.30 percent and 0.38 percent. It can be seen from table three that the ordered double tasks have certain advantages compared with the single task of jointly detecting and classifying the malicious software by regarding benign software as a harmless malicious software family, and the ordered double tasks are higher than the single tasks by 0.71% and 0.38% in Macro-F1 value and accuracy.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described above, or equivalents may be substituted for elements thereof. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A malware identification method based on visual transform is characterized by comprising the following steps:

(a) Acquiring an ImageNet-21K image dataset and an executable file dataset of application software, wherein the executable file dataset comprises an executable file of benign software and a malicious software executable file comprising a family tag, and all samples in the executable file dataset can be visualized as RGB images to construct a malicious software image dataset;

(c) Constructing a lightweight visual Transformer model for actual deployment;

(e) And (3) carrying out discrimination of benign software or malicious software and judgment of a family label of the malicious software on unknown software by using the distillation-trained lightweight vision Transformer model.

2. The visual Transformer-based malware recognition method of claim 1, wherein the step of visualizing all samples in the executable file dataset as RGB images in step (a) is:

(a-2) a decimal value sequence length of

Length of the sequence>

Has a width of ^ 4>

，/>

，

To round down;

(a-3) sequentially using three adjacent decimal numbers in the decimal value sequence as the R channel value, the G channel value and the B channel value of a single pixel to obtain a visual RGB image of the executable file

，/>

，/>

In real space, are>

Is the height of the image, 3 is the number of channels of the image, allThe visualized RGB image of the executable file of (a) constitutes a malware image dataset.

3. The visual Transformer-based malware identification method of claim 2, wherein step (b) comprises the steps of:

(b-2) RGB image to be visualized

Zooming to obtain zoomed visual RGB image

In which>

High,. For a zoomed visualized RGB image>

In a fifth or fifth sun>

Row pixel value

Flattening processing is>

，/>

Visual RGB image of 3D

Conversion into a 2D line sequence>

，/>

；

(b-3) 2D line sequence

Mapping each element in the sequence to ≥ via a linear layer>

Dimension resulting row embedding>

，/>

Learnable class label tensor is ≦ using the cat function in the store>

Is embedded in a row>

Add to make a tensor>

，/>

；

(b-4) tensor

The multi-headed self-Attention mechanism Muti-Head Attention of the layer 1 encoder includes >>

Attention head, the tensor>

Individual attention head pair tensor->

Respectively carrying out linear mapping to obtain query matrixes

Key matrix>

Value matrix->

，/>

，

，/>

，/>

，/>

、/>

、/>

Weight matrix, which is a linear transformation, in each case>

、/>

、/>

Are all bias vectors, by formula>

The calculated embedded->

，/>

In the formula>

For attention scoring, is based on>

，/>

，/>

Is transposed and is up and down>

For the Softmax activation function, will be ≧ by a cat function in the torch library>

Global attention fused embed of individual attention head outputs>

，/>

Will tensor->

Input to the multi-layer perceptron MLP of the layer 1 encoder by the formula

The tensor is calculated>

In the formula>

For a GELU activation function, <' > based on>

，/>

For a weight matrix for a second layer of neurons in a multi-layer perceptron MLP @>

，/>

For bias vectors in first layer neurons in a multi-layer perceptron MLP @>

，/>

Will have a tensor->

，/>

(b-5) tensor

The tensor in the alternative step (b-4) is/are based>

And (c) repeatedly executing the step (b-4) to obtain a layer 2 braidThe output tensor of the encoder>

；

(b-6) tensor

The tensor in the alternative step (b-5) is/are based>

Repeating step (b-5) for output tensor/based on layer 3 encoder>

；

(b-7) the first

The output of each encoder is taken as the ^ h->

An input of an encoder, based on the number of the encoder units>

，/>

；/>

(b-8) tensor

Embedded vector of

，/>

Will embed the vector->

，/>

Will tensor->

Inputting the classification result into a full connection layer FC to obtain a classification result output by a visual Transformer model;

4. The visual Transformer-based malware identification method of claim 3, wherein: in step (b-3), the 2D rows are sequenced

Input into a linear layer by means of a formula>

Calculate line insert->

In the formula>

For the weight matrix of the linear mapping layer, <' >>

，/>

Is a bias vector>

。

5. The visual Transformer-based malware identification method of claim 4, wherein the step of modifying the full link layer in the visual Transformer model after classification pre-training into an ordered double task classifier for malware detection and family classification in the step (b) comprises the steps of:

(b-10) changing the full link FC in the step (b-8) into an ordered double-task classifier, wherein the ordered double-task classifier comprises a detection task for detecting the malicious software and a family classification task for judging the family of the malicious software, and the detection task and the family classification task are both composed of two full link FCs;

(b-11) tensor

Is input into the detection task and is judged by a formula>

The prediction logits of the detection task is calculated>

，/>

In the formula>

For the detection of the weight matrix of the first fully connected layer FC>

，/>

For the detection task, the weight matrix of the second full link layer FC>

，/>

，/>

；

(b-12) tensor

Inputting into family classification task by formula

Calculating to obtain the predicted logit of the family classification task

，/>

In the formula>

，

，/>

，/>

，/>

Is the number of families.

6. The visual Transformer-based malware identification method of claim 5, wherein the step of fine-tuning the visual Transformer model by using the malware image dataset in the step (b) comprises the steps of:

(b-13) by the formula

Calculated loss>

In the formula>

For Sigmoid activation function, <' >>

Is a binary cross entropy loss>

For a cross entropy loss, is>

For detecting a task tag, is asserted>

0 means benign, 1 means malicious and/or->

。

7. The visual Transformer-based malware identification method of claim 5, wherein step (c) comprises the steps of:

，/>

，/>

；

(c-2) 2D line sequence

Mapping each element in the sequence to ≥ via a linear layer>

Dimension resulting row embedding>

，

Learnable class label tensor using cat function in a torch library>

And line embedding>

Add to make a tensor>

，/>

；

(c-3) tensor

Attention head, the tensor>

Individual attention head pair tensor->

Respectively carrying out linear mapping to obtain query matrixes

And key matrix->

Value matrix->

，

，/>

,/>

，/>

， />

、/>

、

A weight matrix which is a linear transformation, is->

、/>

、/>

Are all bias vectors, are based on the formula>

Calculating an embedding ÷ based on fusion of global attention>

，/>

In the formula>

In order to be a fraction of attention,

，/>

will be ≧ by a cat function in the torch library>

Global attention fused embedding of individual attention head outputs>

Splicing, and combining the splicing result with tensor>

，/>

Will tensor->

The tensor is calculated>

，/>

，/>

，/>

，/>

Combining the vector>

，/>

(c-4) tensor

Tensor in alternative step (c-3)>

；

(c-5) tensor

The tensor in the alternative step (c-4) is/are based>

Repeating step (c-4) to obtain the output tensor ^ 4 of the layer 3 encoder>

,/>

；

(c-6) tensor

Embedded vector of

，/>

Will embed the vector->

，/>

；

(c-7) tensor

Is input into the detection task and is judged by a formula>

The prediction logits of the detection task is calculated>

，/>

In the formula>

，/>

To detect the weight matrix of the second fully connected layer FC of a task,

，/>

bias vectors for the first fully-connected layer FC for detection tasks>

，/>

；

(c-8) tensor

Input into family classification task by formula

Calculating the predicted logit @ofthe family classification task>

,/>

In the formula>

，/>

，/>

，/>

。

8. The visual Transformer-based malware identification method of claim 7, wherein step (d) comprises the steps of:

(d-1) by the formula

The loss of predicted logits distillation is calculated>

In the formula>

For classifying an influencing factor for a loss ratio>

Is L2 lost, is>

For detecting a temperature override of a task classifier distillation>

Temperature over-parameter for family classification task classifier distillation; />

(d-2) by the formula

Calculating a loss from attention distillation>

，/>

In a first or second section>

Line,. Or>

Is based on the fifth->

Line and/or combination>

，/>

Is a multi-head self-attention device for the teacher model>

Query matrix spliced by individual self-attention heads

，/>

Is a multi-head self-attention device for the teacher model>

Individual self-attention head spliced key matrix->

，/>

In multi-head self-attention device for teacher model>

Value matrix which is stitched together by individual self-attention heads>

，/>

Is transposed and is up and down>

，/>

，

，/>

，/>

For multi-head self-attention system of student model>

Query matrix combined by individual self-attention heads>

，/>

For multi-head self-attention system of student model>

Key matrix combined with individual self-attention head>

，/>

For the multi-head self-attention device of the student model>

Value matrix spliced by individual self-attention head

，/>

For transposition, in>

，/>

，

；

(d-3) by the formula

Calculating loss for distilling in hidden state>

In the formula>

Hidden layer state association matrix for a student model>

Is based on the fifth->

Line and/or combination>

，

Hidden layer status association matrix for teacher model>

In a first or second section>

Line and/or combination>

；

(d-5) by the formula

Calculating the total loss of the student model training>

In the formula>

Is a weight lost from attention distillation>

The weight of distillation loss in a hidden layer state is obtained; />

(d-6) Total loss by

And (5) carrying out iterative training on the lightweight visual Transformer model to obtain the lightweight visual Transformer model after distillation training.

9. The visual Transformer-based malware identification method of claim 3, wherein step (e) comprises the steps of:

(e-1) visualizing the unknown software as an RGB image