CN115879109A - Malicious software identification method based on visual transform - Google Patents

Malicious software identification method based on visual transform Download PDF

Info

Publication number
CN115879109A
CN115879109A CN202310063452.3A CN202310063452A CN115879109A CN 115879109 A CN115879109 A CN 115879109A CN 202310063452 A CN202310063452 A CN 202310063452A CN 115879109 A CN115879109 A CN 115879109A
Authority
CN
China
Prior art keywords
layer
tensor
attention
model
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310063452.3A
Other languages
Chinese (zh)
Other versions
CN115879109B (en
Inventor
刘广起
王志文
韩晓晖
左文波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Qilu University of Technology
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology, Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Qilu University of Technology
Priority to CN202310063452.3A priority Critical patent/CN115879109B/en
Publication of CN115879109A publication Critical patent/CN115879109A/en
Application granted granted Critical
Publication of CN115879109B publication Critical patent/CN115879109B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

A malicious software identification method based on visual Transformer belongs to the technical field of software security protection, and comprises the steps of visualizing an executable file of benign/malicious software into RGB images, and constructing a malicious software image dataset; pre-training a visual Transformer by adopting an ImageNet-21K image data set, and finely adjusting by adopting a malicious software image data set; constructing a lightweight visual Transformer for actual deployment on lightweight equipment; migrating the knowledge of the well-trained visual Transformer to a lightweight visual Transformer based on knowledge distillation to reduce the performance difference between the two models; detection and family classification of malware is performed using lightweight visual transformers. The detection efficiency of the model, lower hardware resource occupation, detection of the model and family classification precision are guaranteed.

Description

Malicious software identification method based on visual Transformer
Technical Field
The invention relates to the technical field of software security protection, in particular to a malicious software identification method based on a visual Transformer.
Background
With the rise of the internet of things, the types and the number of the devices of the internet of things are exponentially increased. An embedded system carried by the internet of things equipment often lacks consideration of security factors, and has a wider attack surface compared with a mature Windows and Linux system, and the focus of attention of a malicious software manufacturer is gradually shifted to the internet of things equipment. Therefore, a method for detecting malware that is faster and more effective is needed to protect internet of things devices from malware. Currently, most antivirus vendors employ signature-based or rule-based techniques to detect malware, relying on constant updates to the malicious signature and rule libraries to detect more malware, but their low generalization performance makes them inadequate to cope with the growing new types of network threats. Machine learning-based malware identification has become one of the research hotspots in recent years, extracting features from software, and relying on machine learning algorithms to automatically perform malware detection or classification. At present, a method of visualizing software as a gray-scale map and then automatically performing feature extraction using a Convolutional Neural Network (CNN) end-to-end has proven to be one of the most effective methods. However, the inherent locality, translation and other degeneration of CNN induces preference, and has natural advantages for processing natural images, while software can be visualized as a gray-scale image, a 1D byte sequence is forcibly converted into a 2D gray-scale image, and longitudinal pixel points of the 2D gray-scale image do not have any correlation. Therefore, the way of processing software gray maps with CNN has some irrationality, so that its result may be suboptimal.
Under the condition of sufficient training data, the model with high complexity has stronger pattern recognition capability than the model with low complexity. However, the hardware resources and time cost inside the device, such as a large amount of memory and computational power consumed by the high-complexity model, make it unfavorable for deployment on the lightweight device. Most of the internet of things equipment is lightweight equipment with extremely limited hardware resources, and low resource occupation is one of necessary conditions of a safety protection model deployed in the internet of things equipment. Therefore, under the condition that the occupation of model resources is low enough, how to quickly, accurately and effectively detect the malicious software and judge the family to which the malicious software belongs to adopt different coping methods is one of the problems to be solved urgently.
Disclosure of Invention
In order to overcome the defects of the technologies, the invention provides a method which can accurately detect malicious software and judge the family to which the malicious software belongs while ensuring that the hardware occupation of the model is low.
The technical scheme adopted by the invention for overcoming the technical problems is as follows:
a malware identification method based on visual transform comprises the following steps:
(a) Acquiring an ImageNet-21K image dataset and an executable file dataset of application software, wherein the executable file dataset comprises an executable file of benign software and a malicious software executable file comprising a family tag, and visualizing all samples in the executable file dataset into RGB images to construct a malicious software image dataset;
(b) Building a visual Transformer model containing an X-layer encoder, carrying out classification pre-training on the visual Transformer model by adopting an ImageNet-21K image data set, changing a full connection layer in the visual Transformer model after the classification pre-training into an ordered double-task classifier for carrying out malicious software detection and family classification, and carrying out fine tuning on the visual Transformer model by adopting a malicious software image data set;
(c) Constructing a lightweight visual Transformer model for actual deployment;
(d) Taking the trimmed vision Transformer model as a teacher model, taking the lightweight vision Transformer model as a student model, and performing distillation training on the student model by taking a self-attention matrix and a hidden layer state of the teacher model and a predicted logits of the double-task classifier as supervision information of the student model;
(e) And (3) judging benign software or malicious software and judging the family label of the malicious software by using the distillation-trained lightweight visual Transformer model on the unknown software.
Further, the step of visualizing all samples in the executable file data set as RGB images in step (a) comprises:
(a-1) reading the executable file of the application software in hexadecimal, and converting the hexadecimal number into a decimal number to represent the executable file of the application software as a decimal number sequence with a value range of [0,255 ];
(a-2) a decimal value sequence length of
Figure SMS_1
The length of the sequence->
Figure SMS_2
Has a width of ^ 4>
Figure SMS_3
Figure SMS_4
,/>
Figure SMS_5
To round down;
(a-3) adjacent three decimal numbers in the decimal value sequence as R of a single pixel in sequenceObtaining the visual RGB image of the executable file by the channel value, the G channel value and the B channel value
Figure SMS_6
,/>
Figure SMS_7
,/>
Figure SMS_8
Is a real number space, is>
Figure SMS_9
For a high image, 3 is the number of channels of the image, and the visualized RGB images of all executable files constitute the malware image dataset.
Further, the step (b) comprises the steps of:
(b-1) the visual Transformer model sequentially comprises 12 layers of encoders and a multilayer perceptron MLP, and each encoder sequentially comprises a first normalization layer LayerNorm, a multi-Head self-Attention mechanism Muti-Head Attention, a first residual connecting layer, a second normalization layer LayerNorm, a multilayer perceptron MLP and a second residual connecting layer;
(b-2) RGB image to be visualized
Figure SMS_12
Zooming to obtain zoomed visual RGB image
Figure SMS_15
Wherein->
Figure SMS_18
High,. For a zoomed visualized RGB image>
Figure SMS_13
For the width of the scaled visualized RGB image, the scaled visualized RGB image ≦ based on the Flatten function in the torch library>
Figure SMS_14
Middle and fifth>
Figure SMS_17
Row pixel values
Figure SMS_20
The flattening treatment is->
Figure SMS_10
,/>
Figure SMS_16
Visual RGB image of 3D
Figure SMS_19
Conversion into a 2D line sequence>
Figure SMS_21
,/>
Figure SMS_11
(b-3) 2D line sequence
Figure SMS_23
Mapping each element in the sequence to ≥ via a linear layer>
Figure SMS_27
Dimension resulting row embedding>
Figure SMS_29
Figure SMS_24
Learnable class label tensor using cat function in a torch library>
Figure SMS_25
And line embedding>
Figure SMS_28
Splicing to obtain spliced tensor, and embedding the spliced tensor and the learnable absolute position
Figure SMS_30
Make an addition to obtain a tensor->
Figure SMS_22
,/>
Figure SMS_26
(b-4) tensor
Figure SMS_49
Inputting the data into a first normalization layer LayerNorm of a layer 1 encoder of a visual Transformer model for normalization to obtain tensor->
Figure SMS_53
The multi-headed self-Attention mechanism Muti-Head Attention of the layer 1 encoder includes ^ H>
Figure SMS_56
Attention head, will tensor->
Figure SMS_33
Respectively input into a multi-Head self-Attention mechanism Muti-Head Attention, and the fifth/fifth judgment>
Figure SMS_35
Individual attention head pair tensor->
Figure SMS_41
Respectively carrying out linear mapping to obtain query matrixes
Figure SMS_45
Key matrix>
Figure SMS_60
Value matrix->
Figure SMS_64
,/>
Figure SMS_69
Figure SMS_73
,/>
Figure SMS_63
,/>
Figure SMS_67
,/>
Figure SMS_71
、/>
Figure SMS_74
、/>
Figure SMS_48
A weight matrix which is a linear transformation, is->
Figure SMS_52
、/>
Figure SMS_57
、/>
Figure SMS_61
Are all bias vectors, by formula>
Figure SMS_31
Calculating an embedding ÷ based on fusion of global attention>
Figure SMS_37
,/>
Figure SMS_40
In the formula>
Figure SMS_44
For attention scoring, is based on>
Figure SMS_34
,/>
Figure SMS_38
Figure SMS_39
For transposition, in>
Figure SMS_43
For Softmax activation function, pass cat function in the torch libraryNumber will>
Figure SMS_47
Global attention fused embed of individual attention head outputs>
Figure SMS_51
Splicing is carried out, and the splicing result and the tensor are greater or less>
Figure SMS_55
Sequentially inputting the data into a first residual connecting layer and a second normalization layer LayerNorm of a layer 1 encoder, and outputting to obtain a tensor
Figure SMS_59
,/>
Figure SMS_65
Will have a tensor->
Figure SMS_68
Input into the multi-layer perceptron MLP of the layer 1 encoder through a formula
Figure SMS_72
The tensor is calculated>
Figure SMS_76
In the formula>
Figure SMS_66
For the GELU activation function, <' >>
Figure SMS_70
Is a weight matrix of the neurons in the first layer of the multi-layer perceptron MLP->
Figure SMS_75
,/>
Figure SMS_77
Is a weight matrix of the neurons in the second layer of the multi-layer perceptron MLP->
Figure SMS_32
,/>
Figure SMS_36
For bias vectors in first layer neurons in a multi-layer perceptron MLP @>
Figure SMS_42
,/>
Figure SMS_46
For bias vectors in first layer neurons in a multi-layer perceptron MLP @>
Figure SMS_50
Will have a tensor->
Figure SMS_54
Inputting the data into a second residual connecting layer of the layer 1 encoder, and outputting to obtain the output tensor of the layer 1 encoder>
Figure SMS_58
,/>
Figure SMS_62
Embedding dimension of the first layer of neurons in the multi-layer perceptron MLP;
(b-5) tensor
Figure SMS_78
Tensor in alternative step (b-4)>
Figure SMS_79
Repeating the step (b-4) to obtain an output tensor { } for the layer 2 encoder>
Figure SMS_80
(b-6) tensor
Figure SMS_81
The tensor in the alternative step (b-5) is/are based>
Figure SMS_82
Repeating step (b-5) for output tensor/based on layer 3 encoder>
Figure SMS_83
(b-7) the first
Figure SMS_84
The output of each encoder is taken as the ^ h->
Figure SMS_85
An input of an encoder, based on the number of the encoder units>
Figure SMS_86
Repeating the step (b-6) to obtain the tensor/device output by the encoder of the 12 th layer>
Figure SMS_87
,/>
Figure SMS_88
(b-8) tensor
Figure SMS_90
The vector for the 0 th position in is the learnable classification mark tensor->
Figure SMS_92
In a combined block of embedded vectors>
Figure SMS_95
,/>
Figure SMS_91
Will embed the vector &>
Figure SMS_93
Inputting the data into a multi-layer perceptron MLP of a vision Transformer model, and outputting the data to obtain tensor>
Figure SMS_94
,/>
Figure SMS_96
Will tensor->
Figure SMS_89
Input into the full connection layer FC to obtainA classification result output by the vision Transformer model;
(b-9) carrying out classification pre-training on the visual transform model by adopting ImageNet-21K image data set.
Further, in the step (b-3), the 2D lines are sequenced
Figure SMS_97
Is input into a linear layer and is processed by the formula>
Figure SMS_98
Calculate line insert->
Figure SMS_99
In the formula>
Figure SMS_100
For the weight matrix of the linear mapping layer, <' >>
Figure SMS_101
,/>
Figure SMS_102
Is a bias vector>
Figure SMS_103
Further, in the step (b), the step of changing the full link layer in the visual Transformer model after classification pre-training into an ordered double-task classifier for malware detection and family classification comprises the following steps:
(b-10) changing the full link FC in the step (b-8) into an ordered double-task classifier, wherein the ordered double-task classifier comprises a detection task for detecting malicious software and a family classification task for judging the family of the malicious software, and the detection task and the family classification task are both formed by two full link FCs;
(b-11) tensor
Figure SMS_105
Input into the detection task and based on the formula>
Figure SMS_110
The prediction logits of the detection task is calculated>
Figure SMS_112
,/>
Figure SMS_107
In the formula>
Figure SMS_109
For the detection of the weight matrix of the first fully connected layer FC>
Figure SMS_113
,/>
Figure SMS_114
For the detection task, the weight matrix of the second full link layer FC>
Figure SMS_104
Figure SMS_108
For the first offset vector of the full connection layer FC of the detection task->
Figure SMS_111
,/>
Figure SMS_115
For the second offset vector of the fully connected layer FC of the detection task->
Figure SMS_106
(b-12) tensor
Figure SMS_117
Input into family classification task by formula
Figure SMS_122
Calculating to obtain the predicted logit of the family classification task
Figure SMS_124
,/>
Figure SMS_119
In the formula>
Figure SMS_123
For the weight matrix of the first fully-connected layer FC of the family classification task, <>
Figure SMS_126
Figure SMS_128
A weight matrix for the second fully-connected layer FC for the family classification task, <>
Figure SMS_116
,/>
Figure SMS_120
For the bias vector of the first fully-connected layer FC of the family classification task, <>
Figure SMS_125
,/>
Figure SMS_127
For the bias vector of the second fully-connected layer FC of the family classification task, <>
Figure SMS_118
,/>
Figure SMS_121
Is the number of families.
Further, the step of fine-tuning the visual Transformer model by using the malware image dataset in the step (b) comprises the following steps:
(b-13) by the formula
Figure SMS_131
Calculated loss>
Figure SMS_134
In the formula>
Figure SMS_136
For Sigmoid activation function, <' > based on>
Figure SMS_130
In order to be a binary cross-entropy loss,
Figure SMS_132
for cross entropy loss>
Figure SMS_135
For detecting task tags, in conjunction with a timer>
Figure SMS_137
0 means benign, 1 means malicious and/or->
Figure SMS_129
Is a malicious sample family one-hot tag, based on the presence of a specific marker>
Figure SMS_133
Further, the step (c) comprises the steps of:
(c-1) the lightweight visual Transformer model sequentially comprises 3 layers of encoders and a multilayer perceptron MLP, each encoder sequentially comprises a first layer normalization layer LayerNorm, a multi-Head self-Attention mechanism Muti-Head Attention, a first residual connecting layer, a second layer normalization layer LayerNorm, a multilayer perceptron MLP and a second residual connecting layer, and the number of Attention heads of the multi-Head self-Attention mechanism Muti-Head Attention is
Figure SMS_138
,/>
Figure SMS_139
The internal embedding dimension of the lightweight visual transform model is ≥>
Figure SMS_140
,/>
Figure SMS_141
(c-2) 2D line sequence
Figure SMS_142
Mapping each element in the sequence to ≥ via a linear layer>
Figure SMS_147
Get line embedding->
Figure SMS_149
,/>
Figure SMS_144
Learnable class label tensor is ≦ using the cat function in the store>
Figure SMS_145
And line embedding>
Figure SMS_148
Splicing to obtain spliced tensor, and embedding the spliced tensor and the learnable absolute position
Figure SMS_150
Make an addition to obtain a tensor->
Figure SMS_143
,/>
Figure SMS_146
(c-3) tensor
Figure SMS_169
The tensor is obtained by normalization in a first normalization layer LayerNorm of a layer 1 encoder which is input into a vision Transformer model>
Figure SMS_173
The multi-headed self-Attention mechanism Muti-Head Attention of the layer 1 encoder includes ^ H>
Figure SMS_178
Attention head, will tensor->
Figure SMS_152
Respectively input into a multi-Head self-Attention mechanism Muti-Head Attention, and the fifth/fifth judgment>
Figure SMS_157
Individual attention head pair tensor->
Figure SMS_160
Respectively carrying out linear mapping to obtain query matrixes
Figure SMS_165
And key matrix->
Figure SMS_154
Value matrix->
Figure SMS_158
Figure SMS_161
,/>
Figure SMS_164
,/>
Figure SMS_168
,/>
Figure SMS_172
,/>
Figure SMS_176
、/>
Figure SMS_179
Figure SMS_171
A weight matrix which is a linear transformation, is->
Figure SMS_175
、/>
Figure SMS_180
、/>
Figure SMS_183
Are all biasedPut a vector, by means of the formula>
Figure SMS_151
The calculated embedded->
Figure SMS_156
,/>
Figure SMS_162
In the formula>
Figure SMS_166
In order to be a point of attention score,
Figure SMS_184
,/>
Figure SMS_187
will be ≧ by a cat function in the torch library>
Figure SMS_190
Global attention fused embed of individual attention head outputs>
Figure SMS_192
Splicing is carried out, and the splicing result and the tensor are greater or less>
Figure SMS_185
Sequentially inputting the signals into a first residual connecting layer and a second layer normalization layer LayerNorm of a layer 1 encoder, and outputting the signals to obtain tensor +>
Figure SMS_188
,/>
Figure SMS_191
Will tensor->
Figure SMS_194
Is input into a multi-layer perceptron MLP of a layer 1 encoder through a formula->
Figure SMS_170
The tensor is calculated>
Figure SMS_174
,/>
Figure SMS_177
Is a weight matrix of the neurons in the first layer of the multi-layer perceptron MLP->
Figure SMS_181
,/>
Figure SMS_182
Is a weight matrix of the neurons in the second layer of the multi-layer perceptron MLP->
Figure SMS_186
,/>
Figure SMS_189
Bias vectors for neurons in the first layer of the multi-layered perceptron MLP @>
Figure SMS_193
,/>
Figure SMS_153
Bias vectors for neurons in the first layer of the multi-layered perceptron MLP @>
Figure SMS_155
Will make the vector->
Figure SMS_159
Inputting the residual signal into a second residual connecting layer of the layer 1 encoder, and outputting the residual signal to obtain the output tensor of the layer 1 encoder>
Figure SMS_163
,/>
Figure SMS_167
Embedding dimension of the first layer of neurons in the multi-layer perceptron MLP;
(c-4) tensor
Figure SMS_195
The tensor in the alternative step (c-3) is/are based>
Figure SMS_196
Repeating step (c-3) to obtain the output tensor ^ greater than or equal to the layer 2 encoder>
Figure SMS_197
(c-5) tensor
Figure SMS_198
The tensor in the alternative step (c-4) is/are based>
Figure SMS_199
Repeating step (c-4) to obtain the output tensor ^ 4 of the layer 3 encoder>
Figure SMS_200
,/>
Figure SMS_201
;/>
(c-6) tensor
Figure SMS_202
The vector for the 0 th position in is the learnable classification mark tensor->
Figure SMS_203
Is embedded vector pick>
Figure SMS_204
,/>
Figure SMS_205
Will embed the vector->
Figure SMS_206
Inputting the data into a multi-layer perceptron MLP of a vision Transformer model, and outputting the data to obtain tensor>
Figure SMS_207
,/>
Figure SMS_208
(c-7) tensor
Figure SMS_210
Input into the detection task and based on the formula>
Figure SMS_214
The prediction logits of the detection task is calculated>
Figure SMS_217
,/>
Figure SMS_212
In the formula>
Figure SMS_215
For the detection of the weight matrix of the first fully connected layer FC>
Figure SMS_218
,/>
Figure SMS_220
To detect the weight matrix of the second fully-connected layer FC of a task,
Figure SMS_209
,/>
Figure SMS_213
for the first offset vector of the full connection layer FC of the detection task->
Figure SMS_216
,/>
Figure SMS_219
For the second offset vector of the fully connected layer FC of the detection task->
Figure SMS_211
(c-8) tensor
Figure SMS_223
Inputting into family classification task by formula
Figure SMS_226
Calculating the predicted logit @ofthe family classification task>
Figure SMS_230
,/>
Figure SMS_224
In the formula>
Figure SMS_227
For the weight matrix of the first fully-connected layer FC of the family classification task,
Figure SMS_229
,/>
Figure SMS_231
a weight matrix for the second fully-connected layer FC for the family classification task, <>
Figure SMS_221
,/>
Figure SMS_225
For the bias vector of the first fully-connected layer FC of the family classification task, <>
Figure SMS_228
,/>
Figure SMS_232
For the bias vector of the second fully-connected layer FC of the family classification task, <>
Figure SMS_222
Further, the step (d) comprises the steps of:
(d-1) by the formula
Figure SMS_233
The loss of predicted logits distillation is calculated>
Figure SMS_234
In the formula>
Figure SMS_235
For classifying an influencing factor for a loss ratio>
Figure SMS_236
For L2 loss, based on>
Figure SMS_237
For detecting a temperature over-parameter of the classifier distillation in question, is selected>
Figure SMS_238
Temperature over-parameter for family classification task classifier distillation;
(d-2) by the formula
Figure SMS_256
Calculating a loss from attention distillation>
Figure SMS_262
,/>
Figure SMS_266
For the correlation matrix between the self-attention matrices in the student model @>
Figure SMS_240
Is based on the fifth->
Figure SMS_243
Line and/or combination>
Figure SMS_250
Is a correlation matrix between the self-attention matrices in the teacher model @>
Figure SMS_252
In a first or second section>
Figure SMS_260
Line,. Or>
Figure SMS_264
Figure SMS_268
Is a multi-head self-attention device for the teacher model>
Figure SMS_271
Query matrix spliced by individual self-attention heads
Figure SMS_269
,/>
Figure SMS_272
Is a multi-head self-attention device for the teacher model>
Figure SMS_273
Individual self-attention head spliced key matrix->
Figure SMS_274
,/>
Figure SMS_257
Is a multi-head self-attention device for the teacher model>
Figure SMS_261
Value matrix which is stitched together by individual self-attention heads>
Figure SMS_265
,/>
Figure SMS_270
Is transposed and is up and down>
Figure SMS_239
,/>
Figure SMS_244
Figure SMS_248
,/>
Figure SMS_253
,/>
Figure SMS_242
For the multi-head self-attention device of the student model>
Figure SMS_245
Query matrix combined by individual self-attention heads>
Figure SMS_247
,/>
Figure SMS_251
For multi-head self-attention system of student model>
Figure SMS_255
Individual self-attention head spliced key matrix->
Figure SMS_259
,/>
Figure SMS_263
For multi-head self-attention system of student model>
Figure SMS_267
Value matrix spliced by individual self-attention head
Figure SMS_241
,/>
Figure SMS_246
For transposition, in>
Figure SMS_249
,/>
Figure SMS_254
Figure SMS_258
(d-3) by the formula
Figure SMS_276
Calculating to obtain the distillation loss of the hidden layer state
Figure SMS_279
In the formula>
Figure SMS_283
Hidden layer state association matrix for a student model>
Figure SMS_277
Is based on the fifth->
Figure SMS_278
Line,. Or>
Figure SMS_281
Figure SMS_284
Hidden layer state association matrix ^ for teacher model>
Figure SMS_275
Is based on the fifth->
Figure SMS_280
Line,. Or>
Figure SMS_282
(d-4) supervising the layer 1 encoder of the student model with the layer 4 encoder of the teacher model, supervising the layer 2 encoder of the student model with the layer 8 encoder of the teacher model, supervising the layer 3 encoder of the student model with the layer 12 encoder of the teacher model;
(d-5) by the formula
Figure SMS_285
Calculating to obtain the total loss of the student model training
Figure SMS_286
In the formula>
Figure SMS_287
For the weight value of the distillation loss of self attention>
Figure SMS_288
The weight of distillation loss in a hidden layer state is obtained;
(d-6) Total loss by
Figure SMS_289
And (4) carrying out iterative training on the lightweight visual Transformer model to obtain the lightweight visual Transformer model after distillation training.
Further, the step (e) comprises the steps of:
(e-1) visualizing the unknown software as an RGB image
Figure SMS_290
(e-2) RGB image to be visualized
Figure SMS_291
Zooming into zoomed visualized RGB image>
Figure SMS_294
RGB image scaled visualization using the Flatten function in the torch library
Figure SMS_297
Middle and fifth>
Figure SMS_293
Row pixel value->
Figure SMS_296
Flattening processing is>
Figure SMS_299
Figure SMS_300
RGB image for visualization in 3D>
Figure SMS_292
Conversion into a 2D line sequence->
Figure SMS_295
,/>
Figure SMS_298
(e-3) sequencing the 2D rows
Figure SMS_301
Inputting the result into a lightweight visual Transformer model after distillation training to obtain the predicted logits->
Figure SMS_302
And predicted logit @ofa family classification task>
Figure SMS_303
If +>
Figure SMS_304
Then the unknown software is determined to be malware, if &>
Figure SMS_305
Judging the unknown software as benign software, and judging the family to which the malware belongs as being ^ or greater than or equal to when the distillation-trained lightweight vision Transformer model judges the input unknown software as the malware>
Figure SMS_306
The family corresponding to the highest value of the middle.
The beneficial effects of the invention are: the method only processes the executable file of the software based on static analysis, avoids the time cost of introducing disassembling, dynamic running or manual feature extraction, and is suitable for detection tasks with high timeliness. The visual Transformer is adopted to automatically extract the characteristics of the RGB image after software visualization, and the pixel value of each line of the image is used as a sequence element input by a model, so that the problem of suboptimal result based on CNN recognition caused by the lack of correlation among longitudinal pixel points of the visualized image is effectively solved. The method is characterized in that malware detection and classification are performed based on ordered multitask combination, and the malware can be classified while being detected so as to generate early warnings of different levels for malware of different families and adopt corresponding measures. Furthermore, the ordered multitasking alleviates the negative impact of the relatively difficult malware family classification task on the cost-sensitive malware detection task, relative to a single task (treating benign software as an innocuous malware family to jointly perform malware detection and classification). Three knowledge distillations are adopted to transfer the knowledge of a large-scale teacher model to a small-scale student model, and the performance gain of the student model is maximized.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is an exemplary illustration of an executable file of the present invention being visualized as an RGB image;
FIG. 3 is a schematic structural diagram of a visual Transformer according to the present invention;
FIG. 4 is a schematic diagram of the structure of the distillation of the knowledge of the present invention.
Detailed Description
The invention will be further described with reference to fig. 1 to 4.
As shown in fig. 1, a malware identification method based on visual Transformer includes the following steps:
(a) Acquiring an ImageNet-21K image dataset and an executable file dataset of application software, wherein the executable file dataset comprises an executable file of benign software and a malware executable file comprising a family tag, and visualizing all samples in the executable file dataset into RGB images to construct a malware image dataset.
(b) The method comprises the steps of building a visual Transformer model comprising an X-layer encoder, carrying out classification pre-training on the visual Transformer model by adopting an ImageNet-21K image data set, changing a full connection layer in the visual Transformer model after classification pre-training into an ordered double-task classifier for malicious software detection and family classification, and carrying out fine adjustment on the visual Transformer model by adopting a malicious software image data set.
(c) And (4) building a lightweight visual Transformer model for actual deployment.
(d) And taking the trimmed vision Transformer model as a teacher model and taking the light-weight vision Transformer model as a student model. In order to enable the performance of the lightweight model to be equivalent to that of a large-scale model and increase the feasibility of the model in the deployment of lightweight equipment, knowledge distillation is introduced to greatly improve the performance of the lightweight model. Specifically, the self-attention matrix and the hidden layer state of the teacher model and the prediction logits of the double-task classifier are used as the supervision information of the student model to carry out distillation training on the student model.
(e) And (3) judging benign software or malicious software and judging the family label of the malicious software by using the distillation-trained lightweight visual Transformer model on the unknown software.
The invention processes the executable file of the software only based on the static analysis and adopts the lightweight model to execute reasoning, thereby ensuring the detection efficiency of the model and lower hardware resource occupation. The visual Transformer is adopted to automatically extract the characteristics of the visual image of the executable file, so that the problem of no correlation between longitudinal pixels of the visual image is solved; knowledge distillation is adopted to further improve the performance of the model, and the detection and family classification precision of the model are ensured.
In one embodiment of the present invention, as shown in fig. 2, the step (a) of visualizing all samples in the executable file data set as RGB images comprises:
(a-1) reading the executable file of the application software in hexadecimal, namely a binary file, and converting the hexadecimal number into a decimal number to enable the executable file of the application software to be represented as a decimal number sequence with the value range of [0,255 ].
(a-2) a decimal value sequence length of
Figure SMS_307
The length of the sequence->
Figure SMS_308
Has a width of ^ 4>
Figure SMS_309
Figure SMS_310
,/>
Figure SMS_311
To round down.
(a-3) consecutive three decimal numbers in the decimal value sequence as R of a single pixel in orderObtaining the visual RGB image of the executable file by the channel value, the G channel value and the B channel value
Figure SMS_312
,/>
Figure SMS_313
,/>
Figure SMS_314
Is a real number space, is>
Figure SMS_315
For high, 3 is the number of channels of the image, and the visualized RGB images of all executables constitute the malware image dataset.
In one embodiment of the present invention, as shown in fig. 3, the step (b) comprises the steps of:
(b-1) the visual Transformer model sequentially comprises 12 layers of encoders and a multilayer perceptron MLP, the multilayer perceptron MLP of the visual Transformer model is used for classification, and each encoder sequentially comprises a first normalization layer LayerNorm, a multi-Head self-Attention mechanism Muti-Head Attention, a first residual connecting layer, a second normalization layer LayerNorm, the multilayer perceptron MLP and a second residual connecting layer.
(b-2) RGB image to be visualized
Figure SMS_318
Zooming to obtain zoomed visual RGB image
Figure SMS_321
Wherein->
Figure SMS_324
High,. For a zoomed visualized RGB image>
Figure SMS_319
For the width of the scaled visualized RGB image, the scaled visualized RGB image ≦ based on the Flatten function in the torch library>
Figure SMS_322
Middle and fifth>
Figure SMS_325
Row pixel value
Figure SMS_327
The flattening treatment is->
Figure SMS_316
,/>
Figure SMS_320
Visual RGB image of 3D
Figure SMS_323
Conversion into a 2D line sequence>
Figure SMS_326
,/>
Figure SMS_317
(b-3) 2D line sequence
Figure SMS_329
Mapping each element in the sequence to ≥ via a linear layer>
Figure SMS_331
Get line embedding->
Figure SMS_334
Figure SMS_330
Learnable class label tensor is ≦ using the cat function in the store>
Figure SMS_332
Is embedded in a row>
Figure SMS_335
Splicing to obtain spliced tensor, and embedding the spliced tensor and the learnable absolute position
Figure SMS_337
Make an addition to obtain a tensor->
Figure SMS_328
,/>
Figure SMS_333
. Learnable class label tensor>
Figure SMS_336
And a learnable absolute position embedding>
Figure SMS_338
Are prior art and are essentially learnable parameters.
(b-4) tensor
Figure SMS_372
The tensor is obtained by normalization in a first normalization layer LayerNorm of a layer 1 encoder which is input into a vision Transformer model>
Figure SMS_375
The multi-headed self-Attention mechanism Muti-Head Attention of the layer 1 encoder includes ^ H>
Figure SMS_377
Individual heads of attention, each individual head of attention being on a tensor @>
Figure SMS_380
Operations are performed to extract features from multiple different perspectives, and after operations, stitching fusion is performed. Will tensor & lt>
Figure SMS_382
Respectively input into a multi-Head self-Attention mechanism Muti-Head Attention, and the fifth or the sixth gear>
Figure SMS_384
Individual attention head pair tensor->
Figure SMS_386
Respectively proceed to lineObtaining a query matrix by sexual mapping
Figure SMS_340
And key matrix->
Figure SMS_343
And value matrix>
Figure SMS_347
,/>
Figure SMS_352
Figure SMS_356
,/>
Figure SMS_360
,/>
Figure SMS_364
,/>
Figure SMS_368
、/>
Figure SMS_350
、/>
Figure SMS_354
A weight matrix which is a linear transformation, is->
Figure SMS_358
、/>
Figure SMS_362
、/>
Figure SMS_366
Are all bias vectors, are based on the formula>
Figure SMS_371
Calculating an embedding ÷ based on fusion of global attention>
Figure SMS_374
,/>
Figure SMS_378
Embedding of fused global attention>
Figure SMS_370
Is a matrix of values weighting the sum, the weight being the attention score. In the formula>
Figure SMS_373
For attention scoring, is based on>
Figure SMS_376
,/>
Figure SMS_379
,/>
Figure SMS_381
In order to be transposed, the device is provided with a plurality of groups of parallel connection terminals,
Figure SMS_383
the function is activated for Softmax, which maps the attention scores of each row in the matrix to [0, 1%]In the range, and the sum is 1. Will ≧ be by cat function in the torch library>
Figure SMS_385
Global attention fused embedding of individual attention head outputs
Figure SMS_387
Splicing is carried out, and the splicing result and the tensor are greater or less>
Figure SMS_342
Sequentially inputting the signals into a first residual connecting layer and a second layer normalization layer LayerNorm of a layer 1 encoder, and outputting the signals to obtain tensor +>
Figure SMS_345
,/>
Figure SMS_349
Will tensor->
Figure SMS_353
Input to layer 1 codingThe multi-layer perceptron MLP of the device is judged by a formula->
Figure SMS_357
The tensor is calculated>
Figure SMS_361
In the formula>
Figure SMS_365
For the GELU activation function, <' >>
Figure SMS_369
Is a weight matrix of the first layer of neurons in the multi-layer perceptron MLP,
Figure SMS_339
,/>
Figure SMS_344
is a weight matrix of the neurons in the second layer of the multi-layer perceptron MLP->
Figure SMS_348
,/>
Figure SMS_351
For the bias vector of the first layer neuron in the multi-layer perceptron MLP, <' >>
Figure SMS_355
,/>
Figure SMS_359
For the bias vector of the first layer neuron in the multi-layer perceptron MLP, <' >>
Figure SMS_363
Will have a tensor->
Figure SMS_367
Inputting the residual signal into a second residual connecting layer of the layer 1 encoder, and outputting the residual signal to obtain the output tensor of the layer 1 encoder>
Figure SMS_341
,/>
Figure SMS_346
Is the embedding dimension of the first layer of neurons in the multi-layer perceptron MLP.
(b-5) tensor
Figure SMS_388
The tensor in the alternative step (b-4) is/are based>
Figure SMS_389
Repeating step (b-4) to obtain the output tensor ^ greater than or equal to the layer 2 encoder>
Figure SMS_390
(b-6) tensor
Figure SMS_391
The tensor in the alternative step (b-5) is/are based>
Figure SMS_392
Repeating step (b-5) to obtain the output tensor ^ greater than or equal to the layer 3 encoder>
Figure SMS_393
(b-7) the first
Figure SMS_394
The output of each encoder is taken as the ^ h->
Figure SMS_395
An input of an encoder, is asserted>
Figure SMS_396
And (c) repeating the step (b-6) to obtain the tensor based on the 12 th layer encoder output>
Figure SMS_397
,/>
Figure SMS_398
(b-8) tensor
Figure SMS_400
The vector of the 0 th position in is the learnable classification mark tensor @>
Figure SMS_402
In a combined block of embedded vectors>
Figure SMS_404
,/>
Figure SMS_401
Will embed the vector->
Figure SMS_403
Inputting the data into a multi-layer perceptron MLP of a vision Transformer model, and outputting the data to obtain tensor>
Figure SMS_405
,/>
Figure SMS_406
Will tensor->
Figure SMS_399
And inputting the classification result into the full connection layer FC to obtain the classification result output by the visual Transformer model.
(b-9) carrying out classification pre-training on the visual transform model by adopting ImageNet-21K image data set. The induction preference of the loss of the denaturation such as locality, translation and the like can be compensated to a certain extent.
In one embodiment of the present invention, in step (b-3), the 2D rows are sequenced
Figure SMS_407
Is input into a linear layer and is processed by the formula>
Figure SMS_408
Calculate line insert->
Figure SMS_409
In the formula>
Figure SMS_410
Is a weight matrix for the linear mapping layer,
Figure SMS_411
,/>
Figure SMS_412
is a bias vector>
Figure SMS_413
In an embodiment of the present invention, the step (b) of modifying the full-link layer in the visual Transformer model after classification pre-training into an ordered dual task classifier for malware detection and family classification includes:
(b-10) changing the full link FC in the step (b-8) into an ordered double-task classifier, wherein the ordered double-task classifier comprises a detection task for detecting the malicious software and a family classification task for judging the family of the malicious software, and the detection task and the family classification task are both formed by two full link FCs. In addition, because the malware family classification task is performed based on the condition that the input is malicious, the output state of the first fully-connected layer of the detection task is used as one of the inputs of the second fully-connected layer of the family classification task.
(b-11) tensor
Figure SMS_415
Is input into the detection task and is judged by a formula>
Figure SMS_418
Calculating a predicted logits { (X } for the detection task>
Figure SMS_421
,/>
Figure SMS_417
In the formula>
Figure SMS_419
For the first weight matrix of the fully connected layer FC of the detection task, ->
Figure SMS_422
,/>
Figure SMS_424
For the detection of the weight matrix of the second fully connected layer FC>
Figure SMS_414
,/>
Figure SMS_420
For the first offset vector of the full connection layer FC of the detection task->
Figure SMS_423
,/>
Figure SMS_425
For the second offset vector of the fully connected layer FC of the detection task->
Figure SMS_416
(b-12) tensor
Figure SMS_427
Inputting into family classification task by formula
Figure SMS_432
Calculating to obtain the predicted logit of the family classification task
Figure SMS_435
,/>
Figure SMS_428
In the formula>
Figure SMS_431
A weight matrix for the first full link layer FC of a family classification task>
Figure SMS_436
Figure SMS_438
A weight matrix for the second full link layer FC of the family classification task>
Figure SMS_426
,/>
Figure SMS_430
Bias vectors for the first fully-connected layer FC of a family classification task>
Figure SMS_434
,/>
Figure SMS_437
For the bias vector of the second fully-connected layer FC of the family classification task, <>
Figure SMS_429
,/>
Figure SMS_433
Is the number of families.
In an embodiment of the present invention, the step of fine-tuning the visual Transformer model by using the malware image dataset in the step (b) includes:
(b-13) by the formula
Figure SMS_440
A loss is calculated>
Figure SMS_442
In the formula>
Figure SMS_446
For Sigmoid activation function, <' > based on>
Figure SMS_441
Is a binary cross entropy loss>
Figure SMS_443
For cross entropy loss>
Figure SMS_445
For detecting task tags, in conjunction with a timer>
Figure SMS_447
0 for benign, 1 for malicious, and>
Figure SMS_439
is a malicious sample family one-hot tag, based on the presence of a specific marker>
Figure SMS_444
In one embodiment of the present invention, step (c) comprises the steps of:
(c-1) the lightweight visual Transformer model sequentially comprises 3 layers of encoders and a multilayer perceptron MLP, each encoder sequentially comprises a first normalization layer LayerNorm, a multi-Head self-Attention mechanism Muti-Head Attention, a first residual connecting layer, a second normalization layer LayerNorm, a multilayer perceptron MLP and a second residual connecting layer, and the number of the Attention heads of the multi-Head self-Attention mechanism Muti-Head Attention is
Figure SMS_448
,/>
Figure SMS_449
The internal embedding dimension of the lightweight visual transform model is ≥>
Figure SMS_450
,/>
Figure SMS_451
(c-2) 2D line sequence
Figure SMS_453
Mapping each element in a sequence to ÷ via a linear layer>
Figure SMS_455
Dimension resulting row embedding>
Figure SMS_458
,/>
Figure SMS_454
Learnable class label tensor using cat function in a torch library>
Figure SMS_456
And line embedding>
Figure SMS_459
Splicing to obtain spliced tensor, and embedding the spliced tensor and the learnable absolute position
Figure SMS_460
Make an addition to obtain a tensor->
Figure SMS_452
,/>
Figure SMS_457
(c-3) tensor
Figure SMS_480
The tensor is obtained by normalization in a first normalization layer LayerNorm of a layer 1 encoder which is input into a vision Transformer model>
Figure SMS_484
The multi-headed self-Attention mechanism Muti-Head Attention of the layer 1 encoder includes ^ H>
Figure SMS_487
Attention head, the tensor>
Figure SMS_463
Respectively input into a multi-Head self-Attention mechanism Muti-Head Attention, and the fifth/fifth judgment>
Figure SMS_468
Individual attention head pair tensor->
Figure SMS_472
Respectively carrying out linear mapping to obtain query matrixes
Figure SMS_476
Key matrix>
Figure SMS_491
Value matrix->
Figure SMS_495
Figure SMS_499
,/>
Figure SMS_502
,/>
Figure SMS_494
,/>
Figure SMS_497
,/>
Figure SMS_501
、/>
Figure SMS_503
Figure SMS_478
A weight matrix which is a linear transformation, is->
Figure SMS_481
、/>
Figure SMS_485
、/>
Figure SMS_489
Are all bias vectors, are based on the formula>
Figure SMS_461
The calculated embedded->
Figure SMS_465
,/>
Figure SMS_469
In the formula>
Figure SMS_473
In order to be a point of attention score,
Figure SMS_462
,/>
Figure SMS_466
will be ≧ by a cat function in the torch library>
Figure SMS_470
Global attention fused embed of individual attention head outputs>
Figure SMS_474
Splicing, and combining the splicing result with tensor>
Figure SMS_479
Sequentially inputting the signals into a first residual connecting layer and a second layer normalization layer LayerNorm of a layer 1 encoder, and outputting the signals to obtain tensor +>
Figure SMS_482
,/>
Figure SMS_486
Will tensor->
Figure SMS_490
Is input into a multi-layer perceptron MLP of a layer 1 encoder through a formula->
Figure SMS_483
The tensor is calculated>
Figure SMS_488
,/>
Figure SMS_492
Is a weight matrix of the neurons in the first layer of the multi-layer perceptron MLP->
Figure SMS_496
,/>
Figure SMS_493
Is a weight matrix of the neurons in the second layer of the multi-layer perceptron MLP->
Figure SMS_498
,/>
Figure SMS_500
Bias vectors for neurons in the first layer of the multi-layered perceptron MLP @>
Figure SMS_504
,/>
Figure SMS_464
Bias vectors for neurons in the first layer of the multi-layered perceptron MLP @>
Figure SMS_467
Will make the vector->
Figure SMS_471
Inputting the residual signal into a second residual connecting layer of the layer 1 encoder, and outputting the residual signal to obtain the output tensor of the layer 1 encoder>
Figure SMS_475
,/>
Figure SMS_477
Is the embedding dimension of the first layer of neurons in the multi-layer perceptron MLP.
(c-4) tensor
Figure SMS_505
Tensor in alternative step (c-3)>
Figure SMS_506
Repeating the step (c-3) to obtain an output tensor { } for the layer 2 encoder>
Figure SMS_507
(c-5) tensor
Figure SMS_508
The tensor in the alternative step (c-4) is/are based>
Figure SMS_509
Repeating step (c-4) to obtain the output tensor ^ 4 of the layer 3 encoder>
Figure SMS_510
,/>
Figure SMS_511
(c-6) tensor
Figure SMS_512
The vector for the 0 th position in is the learnable classification mark tensor->
Figure SMS_513
Is embedded vector pick>
Figure SMS_514
,/>
Figure SMS_515
Will embed the vector->
Figure SMS_516
Inputting the data into a multi-layer perceptron MLP of a vision Transformer model, and outputting the data to obtain tensor>
Figure SMS_517
,/>
Figure SMS_518
(c-7) tensor
Figure SMS_520
Is input into the detection task and is judged by a formula>
Figure SMS_524
Calculating a predicted logits { (X } for the detection task>
Figure SMS_527
,/>
Figure SMS_521
In the formula>
Figure SMS_525
For the detection of the weight matrix of the first fully connected layer FC>
Figure SMS_528
,/>
Figure SMS_530
To detect the weight matrix of the second fully connected layer FC of a task,
Figure SMS_519
,/>
Figure SMS_523
for the first offset vector of the full connection layer FC of the detection task->
Figure SMS_526
,/>
Figure SMS_529
A bias vector for the second fully-connected layer FC for detection tasks>
Figure SMS_522
(c-8) tensor
Figure SMS_533
Inputting into family classification task by formula
Figure SMS_536
Calculating a predicted logit @fora family classification task>
Figure SMS_538
,/>
Figure SMS_534
In the formula>
Figure SMS_537
The weight matrix for the first fully-connected layer FC of the family classification task,
Figure SMS_539
,/>
Figure SMS_541
a weight matrix for the second full link layer FC of the family classification task>
Figure SMS_531
,/>
Figure SMS_535
For the bias vector of the first fully-connected layer FC of the family classification task, <>
Figure SMS_540
,/>
Figure SMS_542
A bias vector for the second full link layer FC of a family classification task>
Figure SMS_532
The teacher model is used to supervise training of the student models such that the student models mimic the teacher model representation to achieve performance comparable to the teacher model. In order to make the representation capability of the student model approach to the teacher model as much as possible, three distillation methods are adopted: predicted logits distillation, self-attentive distillation and cryptic state distillation. The predicted logits distillation is to adopt the predicted logits of two classification layers of the teacher model to supervise and train the student model. Thus, in one embodiment of the present invention, as shown in FIG. 4, step (d) comprises the steps of:
(d-1) by the formula
Figure SMS_543
The loss of predicted logits distillation is calculated>
Figure SMS_544
In the formula>
Figure SMS_545
For classifying an influencing factor for a loss ratio>
Figure SMS_546
In order to obtain a loss of L2,
Figure SMS_547
for detecting a temperature over-parameter of the classifier distillation in question, is selected>
Figure SMS_548
Temperature over-parameter for family classification task classifier distillation.
(d-2) number of heads due to multi-head self-attention in teacher model encoder
Figure SMS_574
And an embedding dimension->
Figure SMS_579
Inconsistent with student model encoders, therefore distillation is performed taking the correlation between the head of attention, in particular, by formula
Figure SMS_582
Calculating a loss from attention distillation>
Figure SMS_550
,/>
Figure SMS_553
For the correlation matrix between the self-attention matrices in the student model @>
Figure SMS_566
Is based on the fifth->
Figure SMS_570
Line and/or combination>
Figure SMS_576
For the correlation matrix between the self-attention matrices in the teacher model &>
Figure SMS_578
Is based on the fifth->
Figure SMS_581
Line,. Or>
Figure SMS_584
,/>
Figure SMS_580
In multi-head self-attention device for teacher model>
Figure SMS_583
Query matrix for individual self-attention head stitching>
Figure SMS_585
,/>
Figure SMS_586
Is a multi-head self-attention device for the teacher model>
Figure SMS_565
Individual self-attention head spliced key matrix->
Figure SMS_567
,/>
Figure SMS_571
In multi-head self-attention device for teacher model>
Figure SMS_575
Value matrix which is stitched together by individual self-attention heads>
Figure SMS_549
Figure SMS_554
For transposition, in>
Figure SMS_559
,/>
Figure SMS_562
,/>
Figure SMS_552
,/>
Figure SMS_556
Figure SMS_557
For multi-head self-attention system of student model>
Figure SMS_561
Query matrix spliced by individual self-attention head
Figure SMS_564
,/>
Figure SMS_568
For multi-head self-attention system of student model>
Figure SMS_573
Key matrix combined with individual self-attention head>
Figure SMS_577
,/>
Figure SMS_551
For multi-head self-attention system of student model>
Figure SMS_555
Value matrix for individual self-attention head stitching>
Figure SMS_558
,/>
Figure SMS_560
For transposition, in>
Figure SMS_563
Figure SMS_569
,/>
Figure SMS_572
(d-3) by the formula
Figure SMS_588
Calculating to obtain the distillation loss of the hidden layer state
Figure SMS_590
In the formula>
Figure SMS_593
Hidden layer state association matrix for a student model>
Figure SMS_589
In a first or second section>
Figure SMS_591
Line,. Or>
Figure SMS_594
Figure SMS_596
Hidden layer state association matrix ^ for teacher model>
Figure SMS_587
Is based on the fifth->
Figure SMS_592
Line,. Or>
Figure SMS_595
(d-4) because the teacher model and the student models have different numbers of encoders, the self-attention distillation and the hidden-layer state distillation cannot correspond to each other one by one, so that the 4 th layer of encoder of the teacher model supervises the 1 st layer of encoder of the student models, the 8 th layer of encoder of the teacher model supervises the 2 nd layer of encoder of the student models, and the 12 th layer of encoder of the teacher model supervises the 3 rd layer of encoder of the student models;
(d-5) by the formula
Figure SMS_597
Calculating to obtain the total of the student model trainingLoss of power
Figure SMS_598
In the formula>
Figure SMS_599
Is a weight lost from attention distillation>
Figure SMS_600
The weight of distillation loss in a hidden layer state is obtained;
(d-6) Total loss by
Figure SMS_601
And (4) carrying out iterative training on the lightweight visual Transformer model to obtain the lightweight visual Transformer model after distillation training.
In one embodiment of the present invention, step (e) comprises the steps of:
(e-1) visualizing the unknown software as an RGB image
Figure SMS_602
(e-2) RGB image to be visualized
Figure SMS_604
Zooming into zoomed visualized RGB image>
Figure SMS_607
RGB image scaled for visualization using the Flatten function in the torch library
Figure SMS_609
In a fifth or fifth sun>
Figure SMS_605
Row pixel value->
Figure SMS_606
Flattening processing is>
Figure SMS_610
Figure SMS_612
RGB image for visualization in 3D>
Figure SMS_603
Conversion to 2D line sequences
Figure SMS_608
,/>
Figure SMS_611
(e-3) sequencing the 2D lines
Figure SMS_613
Inputting the result into a lightweight visual Transformer model after distillation training to obtain the predicted logits->
Figure SMS_614
And predicted logit @offamily classification task>
Figure SMS_615
If->
Figure SMS_616
Then the unknown software is determined to be malware and if ≥ is present>
Figure SMS_617
Judging the unknown software as benign software, and judging the family to which the malware belongs as being ^ or greater than or equal to when the distillation-trained lightweight vision Transformer model judges the input unknown software as the malware>
Figure SMS_618
The family corresponding to the highest value in (c).
The improvements of this patent over the prior art are illustrated by the following table:
data set: babuk, blackMatter, cerber, chaos, conti, darkSide, gandCrab, globeimpser, lockBit, locky, magniber, makop, medusa Locker, nemty, phobos, sodinkibe, teslaCrypt, thanos, 18 malicious families of Lesoxol software and BlackMoon, gafgyt botnet were crawled from the malware sharing platform, 20 malicious families and 11841 malicious samples were counted. Furthermore, 9833 benign executables were collected as benign categories in the Windows10 system. The experiment divided 80% of each class of samples into training sets and the remaining 20% of samples into test sets to evaluate model performance. Due to the fact that the malicious family samples are different in size and have certain data imbalance factors, the Macro-F1 value is added to the evaluation index besides the accuracy.
Table one comparison result of the performance of the lightweight visual Transformer and the classic CNN network.
Figure SMS_619
Table two comparison of distillation process performance.
Figure SMS_620
Table three ordered dual tasks compare with single task performance.
Figure SMS_621
The performance of the lightweight visual Transformer is superior to that of a classic CNN network, and the Macro-F1 value and accuracy are improved by 1.71% and 1.42% at least. It can be seen from table two that compared with the non-distillation method, the three distillation methods can all bring certain performance gain to the student model, but the three methods jointly carry out distillation to bring the maximum improvement. The performance of the student model after combined distillation is extremely close to that of the teacher model, and the difference between the Macro-F1 value and the accuracy rate is only 0.30 percent and 0.38 percent. It can be seen from table three that the ordered double tasks have certain advantages compared with the single task of jointly detecting and classifying the malicious software by regarding benign software as a harmless malicious software family, and the ordered double tasks are higher than the single tasks by 0.71% and 0.38% in Macro-F1 value and accuracy.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described above, or equivalents may be substituted for elements thereof. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A malware identification method based on visual transform is characterized by comprising the following steps:
(a) Acquiring an ImageNet-21K image dataset and an executable file dataset of application software, wherein the executable file dataset comprises an executable file of benign software and a malicious software executable file comprising a family tag, and all samples in the executable file dataset can be visualized as RGB images to construct a malicious software image dataset;
(b) Building a visual Transformer model containing an X-layer encoder, carrying out classification pre-training on the visual Transformer model by adopting an ImageNet-21K image data set, changing a full connection layer in the visual Transformer model after the classification pre-training into an ordered double-task classifier for carrying out malicious software detection and family classification, and carrying out fine tuning on the visual Transformer model by adopting a malicious software image data set;
(c) Constructing a lightweight visual Transformer model for actual deployment;
(d) Taking the trimmed vision Transformer model as a teacher model, taking the lightweight vision Transformer model as a student model, and performing distillation training on the student model by taking a self-attention matrix and a hidden layer state of the teacher model and a predicted logits of the double-task classifier as supervision information of the student model;
(e) And (3) carrying out discrimination of benign software or malicious software and judgment of a family label of the malicious software on unknown software by using the distillation-trained lightweight vision Transformer model.
2. The visual Transformer-based malware recognition method of claim 1, wherein the step of visualizing all samples in the executable file dataset as RGB images in step (a) is:
(a-1) reading the executable file of the application software in hexadecimal, and converting the hexadecimal number into a decimal number to represent the executable file of the application software as a decimal number sequence with a value range of [0,255 ];
(a-2) a decimal value sequence length of
Figure QLYQS_1
Length of the sequence>
Figure QLYQS_2
Has a width of ^ 4>
Figure QLYQS_3
,/>
Figure QLYQS_4
Figure QLYQS_5
To round down;
(a-3) sequentially using three adjacent decimal numbers in the decimal value sequence as the R channel value, the G channel value and the B channel value of a single pixel to obtain a visual RGB image of the executable file
Figure QLYQS_6
,/>
Figure QLYQS_7
,/>
Figure QLYQS_8
In real space, are>
Figure QLYQS_9
Is the height of the image, 3 is the number of channels of the image, allThe visualized RGB image of the executable file of (a) constitutes a malware image dataset.
3. The visual Transformer-based malware identification method of claim 2, wherein step (b) comprises the steps of:
(b-1) the visual Transformer model sequentially comprises 12 layers of encoders and a multilayer perceptron MLP, and each encoder sequentially comprises a first normalization layer LayerNorm, a multi-Head self-Attention mechanism Muti-Head Attention, a first residual connecting layer, a second normalization layer LayerNorm, a multilayer perceptron MLP and a second residual connecting layer;
(b-2) RGB image to be visualized
Figure QLYQS_12
Zooming to obtain zoomed visual RGB image
Figure QLYQS_15
In which>
Figure QLYQS_19
High,. For a zoomed visualized RGB image>
Figure QLYQS_13
For the width of the scaled visualized RGB image, the scaled visualized RGB image ≦ based on the Flatten function in the torch library>
Figure QLYQS_16
In a fifth or fifth sun>
Figure QLYQS_17
Row pixel value
Figure QLYQS_20
Flattening processing is>
Figure QLYQS_10
,/>
Figure QLYQS_14
Visual RGB image of 3D
Figure QLYQS_18
Conversion into a 2D line sequence>
Figure QLYQS_21
,/>
Figure QLYQS_11
(b-3) 2D line sequence
Figure QLYQS_23
Mapping each element in the sequence to ≥ via a linear layer>
Figure QLYQS_27
Dimension resulting row embedding>
Figure QLYQS_28
,/>
Figure QLYQS_24
Learnable class label tensor is ≦ using the cat function in the store>
Figure QLYQS_26
Is embedded in a row>
Figure QLYQS_29
Splicing to obtain spliced tensor, and embedding the spliced tensor and the learnable absolute position
Figure QLYQS_30
Add to make a tensor>
Figure QLYQS_22
,/>
Figure QLYQS_25
(b-4) tensor
Figure QLYQS_48
Inputting the data into a first normalization layer LayerNorm of a layer 1 encoder of a visual Transformer model for normalization to obtain tensor->
Figure QLYQS_52
The multi-headed self-Attention mechanism Muti-Head Attention of the layer 1 encoder includes >>
Figure QLYQS_56
Attention head, the tensor>
Figure QLYQS_33
Respectively input into a multi-Head self-Attention mechanism Muti-Head Attention, and the fifth or the sixth gear>
Figure QLYQS_38
Individual attention head pair tensor->
Figure QLYQS_40
Respectively carrying out linear mapping to obtain query matrixes
Figure QLYQS_44
Key matrix>
Figure QLYQS_60
Value matrix->
Figure QLYQS_64
,/>
Figure QLYQS_68
Figure QLYQS_72
,/>
Figure QLYQS_69
,/>
Figure QLYQS_73
,/>
Figure QLYQS_75
、/>
Figure QLYQS_77
、/>
Figure QLYQS_49
Weight matrix, which is a linear transformation, in each case>
Figure QLYQS_54
、/>
Figure QLYQS_59
、/>
Figure QLYQS_63
Are all bias vectors, by formula>
Figure QLYQS_31
The calculated embedded->
Figure QLYQS_35
,/>
Figure QLYQS_41
In the formula>
Figure QLYQS_46
For attention scoring, is based on>
Figure QLYQS_34
,/>
Figure QLYQS_36
,/>
Figure QLYQS_39
Is transposed and is up and down>
Figure QLYQS_42
For the Softmax activation function, will be ≧ by a cat function in the torch library>
Figure QLYQS_45
Global attention fused embed of individual attention head outputs>
Figure QLYQS_51
Splicing is carried out, and the splicing result and the tensor are greater or less>
Figure QLYQS_53
Sequentially inputting the data into a first residual connecting layer and a second normalization layer LayerNorm of a layer 1 encoder, and outputting to obtain a tensor
Figure QLYQS_57
,/>
Figure QLYQS_61
Will tensor->
Figure QLYQS_65
Input to the multi-layer perceptron MLP of the layer 1 encoder by the formula
Figure QLYQS_67
The tensor is calculated>
Figure QLYQS_71
In the formula>
Figure QLYQS_66
For a GELU activation function, <' > based on>
Figure QLYQS_70
Is a weight matrix of the neurons in the first layer of the multi-layer perceptron MLP->
Figure QLYQS_74
,/>
Figure QLYQS_76
For a weight matrix for a second layer of neurons in a multi-layer perceptron MLP @>
Figure QLYQS_32
,/>
Figure QLYQS_37
For bias vectors in first layer neurons in a multi-layer perceptron MLP @>
Figure QLYQS_43
,/>
Figure QLYQS_47
For the bias vector of the first layer neuron in the multi-layer perceptron MLP, <' >>
Figure QLYQS_50
Will have a tensor->
Figure QLYQS_55
Inputting the residual signal into a second residual connecting layer of the layer 1 encoder, and outputting the residual signal to obtain the output tensor of the layer 1 encoder>
Figure QLYQS_58
,/>
Figure QLYQS_62
Embedding dimension of the first layer of neurons in the multi-layer perceptron MLP;
(b-5) tensor
Figure QLYQS_78
The tensor in the alternative step (b-4) is/are based>
Figure QLYQS_79
And (c) repeatedly executing the step (b-4) to obtain a layer 2 braidThe output tensor of the encoder>
Figure QLYQS_80
(b-6) tensor
Figure QLYQS_81
The tensor in the alternative step (b-5) is/are based>
Figure QLYQS_82
Repeating step (b-5) for output tensor/based on layer 3 encoder>
Figure QLYQS_83
(b-7) the first
Figure QLYQS_84
The output of each encoder is taken as the ^ h->
Figure QLYQS_85
An input of an encoder, based on the number of the encoder units>
Figure QLYQS_86
Repeating the step (b-6) to obtain the tensor/device output by the encoder of the 12 th layer>
Figure QLYQS_87
,/>
Figure QLYQS_88
;/>
(b-8) tensor
Figure QLYQS_90
The vector for the 0 th position in is the learnable classification mark tensor->
Figure QLYQS_93
Embedded vector of
Figure QLYQS_95
,/>
Figure QLYQS_91
Will embed the vector->
Figure QLYQS_92
Inputting the data into a multi-layer perceptron MLP of a vision Transformer model, and outputting the data to obtain tensor>
Figure QLYQS_94
,/>
Figure QLYQS_96
Will tensor->
Figure QLYQS_89
Inputting the classification result into a full connection layer FC to obtain a classification result output by a visual Transformer model;
(b-9) carrying out classification pre-training on the visual transform model by adopting ImageNet-21K image data set.
4. The visual Transformer-based malware identification method of claim 3, wherein: in step (b-3), the 2D rows are sequenced
Figure QLYQS_97
Input into a linear layer by means of a formula>
Figure QLYQS_98
Calculate line insert->
Figure QLYQS_99
In the formula>
Figure QLYQS_100
For the weight matrix of the linear mapping layer, <' >>
Figure QLYQS_101
,/>
Figure QLYQS_102
Is a bias vector>
Figure QLYQS_103
5. The visual Transformer-based malware identification method of claim 4, wherein the step of modifying the full link layer in the visual Transformer model after classification pre-training into an ordered double task classifier for malware detection and family classification in the step (b) comprises the steps of:
(b-10) changing the full link FC in the step (b-8) into an ordered double-task classifier, wherein the ordered double-task classifier comprises a detection task for detecting the malicious software and a family classification task for judging the family of the malicious software, and the detection task and the family classification task are both composed of two full link FCs;
(b-11) tensor
Figure QLYQS_105
Is input into the detection task and is judged by a formula>
Figure QLYQS_110
The prediction logits of the detection task is calculated>
Figure QLYQS_113
,/>
Figure QLYQS_106
In the formula>
Figure QLYQS_109
For the detection of the weight matrix of the first fully connected layer FC>
Figure QLYQS_112
,/>
Figure QLYQS_115
For the detection task, the weight matrix of the second full link layer FC>
Figure QLYQS_104
,/>
Figure QLYQS_108
For the first offset vector of the full connection layer FC of the detection task->
Figure QLYQS_111
,/>
Figure QLYQS_114
For the second offset vector of the fully connected layer FC of the detection task->
Figure QLYQS_107
(b-12) tensor
Figure QLYQS_116
Inputting into family classification task by formula
Figure QLYQS_121
Calculating to obtain the predicted logit of the family classification task
Figure QLYQS_125
,/>
Figure QLYQS_118
In the formula>
Figure QLYQS_120
For the weight matrix of the first fully-connected layer FC of the family classification task, <>
Figure QLYQS_124
Figure QLYQS_127
A weight matrix for the second fully-connected layer FC for the family classification task, <>
Figure QLYQS_119
,/>
Figure QLYQS_123
For the bias vector of the first fully-connected layer FC of the family classification task, <>
Figure QLYQS_126
,/>
Figure QLYQS_128
For the bias vector of the second fully-connected layer FC of the family classification task, <>
Figure QLYQS_117
,/>
Figure QLYQS_122
Is the number of families.
6. The visual Transformer-based malware identification method of claim 5, wherein the step of fine-tuning the visual Transformer model by using the malware image dataset in the step (b) comprises the steps of:
(b-13) by the formula
Figure QLYQS_130
Calculated loss>
Figure QLYQS_134
In the formula>
Figure QLYQS_136
For Sigmoid activation function, <' >>
Figure QLYQS_131
Is a binary cross entropy loss>
Figure QLYQS_133
For a cross entropy loss, is>
Figure QLYQS_135
For detecting a task tag, is asserted>
Figure QLYQS_137
0 means benign, 1 means malicious and/or->
Figure QLYQS_129
Is a malicious sample family one-hot tag, based on the presence of a specific marker>
Figure QLYQS_132
7. The visual Transformer-based malware identification method of claim 5, wherein step (c) comprises the steps of:
(c-1) the lightweight visual Transformer model sequentially comprises 3 layers of encoders and a multilayer perceptron MLP, each encoder sequentially comprises a first normalization layer LayerNorm, a multi-Head self-Attention mechanism Muti-Head Attention, a first residual connecting layer, a second normalization layer LayerNorm, a multilayer perceptron MLP and a second residual connecting layer, and the number of the Attention heads of the multi-Head self-Attention mechanism Muti-Head Attention is
Figure QLYQS_138
,/>
Figure QLYQS_139
The internal embedding dimension of the lightweight visual transform model is ≥>
Figure QLYQS_140
,/>
Figure QLYQS_141
(c-2) 2D line sequence
Figure QLYQS_144
Mapping each element in the sequence to ≥ via a linear layer>
Figure QLYQS_146
Dimension resulting row embedding>
Figure QLYQS_149
Figure QLYQS_143
Learnable class label tensor using cat function in a torch library>
Figure QLYQS_145
And line embedding>
Figure QLYQS_148
Splicing to obtain spliced tensor, and embedding the spliced tensor and the learnable absolute position
Figure QLYQS_150
Add to make a tensor>
Figure QLYQS_142
,/>
Figure QLYQS_147
(c-3) tensor
Figure QLYQS_168
The tensor is obtained by normalization in a first normalization layer LayerNorm of a layer 1 encoder which is input into a vision Transformer model>
Figure QLYQS_172
The multi-headed self-Attention mechanism Muti-Head Attention of the layer 1 encoder includes >>
Figure QLYQS_177
Attention head, the tensor>
Figure QLYQS_153
Respectively input into a multi-Head self-Attention mechanism Muti-Head Attention, and the fifth or the sixth gear>
Figure QLYQS_156
Individual attention head pair tensor->
Figure QLYQS_159
Respectively carrying out linear mapping to obtain query matrixes
Figure QLYQS_163
And key matrix->
Figure QLYQS_181
Value matrix->
Figure QLYQS_184
Figure QLYQS_187
,/>
Figure QLYQS_191
,/>
Figure QLYQS_186
,/>
Figure QLYQS_189
, />
Figure QLYQS_192
、/>
Figure QLYQS_194
Figure QLYQS_170
A weight matrix which is a linear transformation, is->
Figure QLYQS_174
、/>
Figure QLYQS_179
、/>
Figure QLYQS_183
Are all bias vectors, are based on the formula>
Figure QLYQS_151
Calculating an embedding ÷ based on fusion of global attention>
Figure QLYQS_158
,/>
Figure QLYQS_162
In the formula>
Figure QLYQS_166
In order to be a fraction of attention,
Figure QLYQS_154
,/>
Figure QLYQS_157
will be ≧ by a cat function in the torch library>
Figure QLYQS_161
Global attention fused embedding of individual attention head outputs>
Figure QLYQS_165
Splicing, and combining the splicing result with tensor>
Figure QLYQS_169
Sequentially inputting the signals into a first residual connecting layer and a second layer normalization layer LayerNorm of a layer 1 encoder, and outputting the signals to obtain tensor +>
Figure QLYQS_173
,/>
Figure QLYQS_176
Will tensor->
Figure QLYQS_180
Is input into a multi-layer perceptron MLP of a layer 1 encoder through a formula->
Figure QLYQS_171
The tensor is calculated>
Figure QLYQS_175
,/>
Figure QLYQS_178
Is a weight matrix of the neurons in the first layer of the multi-layer perceptron MLP->
Figure QLYQS_182
,/>
Figure QLYQS_185
For a weight matrix for a second layer of neurons in a multi-layer perceptron MLP @>
Figure QLYQS_188
,/>
Figure QLYQS_190
Bias vectors for neurons in the first layer of the multi-layered perceptron MLP @>
Figure QLYQS_193
,/>
Figure QLYQS_152
Bias vectors for neurons in the first layer of the multi-layered perceptron MLP @>
Figure QLYQS_155
Combining the vector>
Figure QLYQS_160
Inputting the residual signal into a second residual connecting layer of the layer 1 encoder, and outputting the residual signal to obtain the output tensor of the layer 1 encoder>
Figure QLYQS_164
,/>
Figure QLYQS_167
Embedding dimension of the first layer of neurons in the multi-layer perceptron MLP;
(c-4) tensor
Figure QLYQS_195
Tensor in alternative step (c-3)>
Figure QLYQS_196
Repeating step (c-3) to obtain the output tensor ^ greater than or equal to the layer 2 encoder>
Figure QLYQS_197
(c-5) tensor
Figure QLYQS_198
The tensor in the alternative step (c-4) is/are based>
Figure QLYQS_199
Repeating step (c-4) to obtain the output tensor ^ 4 of the layer 3 encoder>
Figure QLYQS_200
,/>
Figure QLYQS_201
(c-6) tensor
Figure QLYQS_202
The vector of the 0 th position in is the learnable classification mark tensor @>
Figure QLYQS_203
Embedded vector of
Figure QLYQS_204
,/>
Figure QLYQS_205
Will embed the vector->
Figure QLYQS_206
Inputting the data into a multi-layer perceptron MLP of a vision Transformer model, and outputting the data to obtain tensor>
Figure QLYQS_207
,/>
Figure QLYQS_208
(c-7) tensor
Figure QLYQS_210
Is input into the detection task and is judged by a formula>
Figure QLYQS_213
The prediction logits of the detection task is calculated>
Figure QLYQS_216
,/>
Figure QLYQS_211
In the formula>
Figure QLYQS_214
For the first weight matrix of the fully connected layer FC of the detection task, ->
Figure QLYQS_217
,/>
Figure QLYQS_219
To detect the weight matrix of the second fully connected layer FC of a task,
Figure QLYQS_209
,/>
Figure QLYQS_215
bias vectors for the first fully-connected layer FC for detection tasks>
Figure QLYQS_218
,/>
Figure QLYQS_220
For the second offset vector of the fully connected layer FC of the detection task->
Figure QLYQS_212
(c-8) tensor
Figure QLYQS_223
Input into family classification task by formula
Figure QLYQS_227
Calculating the predicted logit @ofthe family classification task>
Figure QLYQS_230
,/>
Figure QLYQS_224
In the formula>
Figure QLYQS_226
For the weight matrix of the first fully-connected layer FC of the family classification task,
Figure QLYQS_228
,/>
Figure QLYQS_231
a weight matrix for the second fully-connected layer FC for the family classification task, <>
Figure QLYQS_221
,/>
Figure QLYQS_225
For the bias vector of the first fully-connected layer FC of the family classification task, <>
Figure QLYQS_229
,/>
Figure QLYQS_232
For the bias vector of the second fully-connected layer FC of the family classification task, <>
Figure QLYQS_222
8. The visual Transformer-based malware identification method of claim 7, wherein step (d) comprises the steps of:
(d-1) by the formula
Figure QLYQS_233
The loss of predicted logits distillation is calculated>
Figure QLYQS_234
In the formula>
Figure QLYQS_235
For classifying an influencing factor for a loss ratio>
Figure QLYQS_236
Is L2 lost, is>
Figure QLYQS_237
For detecting a temperature override of a task classifier distillation>
Figure QLYQS_238
Temperature over-parameter for family classification task classifier distillation; />
(d-2) by the formula
Figure QLYQS_256
Calculating a loss from attention distillation>
Figure QLYQS_259
,/>
Figure QLYQS_263
For the correlation matrix between the self-attention matrices in the student model @>
Figure QLYQS_241
In a first or second section>
Figure QLYQS_243
Line,. Or>
Figure QLYQS_248
For the correlation matrix between the self-attention matrices in the teacher model &>
Figure QLYQS_251
Is based on the fifth->
Figure QLYQS_260
Line and/or combination>
Figure QLYQS_264
,/>
Figure QLYQS_269
Is a multi-head self-attention device for the teacher model>
Figure QLYQS_272
Query matrix spliced by individual self-attention heads
Figure QLYQS_267
,/>
Figure QLYQS_271
Is a multi-head self-attention device for the teacher model>
Figure QLYQS_273
Individual self-attention head spliced key matrix->
Figure QLYQS_274
,/>
Figure QLYQS_258
In multi-head self-attention device for teacher model>
Figure QLYQS_262
Value matrix which is stitched together by individual self-attention heads>
Figure QLYQS_265
,/>
Figure QLYQS_268
Is transposed and is up and down>
Figure QLYQS_239
,/>
Figure QLYQS_246
Figure QLYQS_250
,/>
Figure QLYQS_254
,/>
Figure QLYQS_240
For multi-head self-attention system of student model>
Figure QLYQS_244
Query matrix combined by individual self-attention heads>
Figure QLYQS_249
,/>
Figure QLYQS_253
For multi-head self-attention system of student model>
Figure QLYQS_257
Key matrix combined with individual self-attention head>
Figure QLYQS_261
,/>
Figure QLYQS_266
For the multi-head self-attention device of the student model>
Figure QLYQS_270
Value matrix spliced by individual self-attention head
Figure QLYQS_242
,/>
Figure QLYQS_245
For transposition, in>
Figure QLYQS_247
,/>
Figure QLYQS_252
Figure QLYQS_255
(d-3) by the formula
Figure QLYQS_276
Calculating loss for distilling in hidden state>
Figure QLYQS_279
In the formula>
Figure QLYQS_283
Hidden layer state association matrix for a student model>
Figure QLYQS_277
Is based on the fifth->
Figure QLYQS_278
Line and/or combination>
Figure QLYQS_281
Figure QLYQS_284
Hidden layer status association matrix for teacher model>
Figure QLYQS_275
In a first or second section>
Figure QLYQS_280
Line and/or combination>
Figure QLYQS_282
(d-4) supervising the layer 1 encoder of the student model with the layer 4 encoder of the teacher model, supervising the layer 2 encoder of the student model with the layer 8 encoder of the teacher model, supervising the layer 3 encoder of the student model with the layer 12 encoder of the teacher model;
(d-5) by the formula
Figure QLYQS_285
Calculating the total loss of the student model training>
Figure QLYQS_286
In the formula>
Figure QLYQS_287
Is a weight lost from attention distillation>
Figure QLYQS_288
The weight of distillation loss in a hidden layer state is obtained; />
(d-6) Total loss by
Figure QLYQS_289
And (5) carrying out iterative training on the lightweight visual Transformer model to obtain the lightweight visual Transformer model after distillation training.
9. The visual Transformer-based malware identification method of claim 3, wherein step (e) comprises the steps of:
(e-1) visualizing the unknown software as an RGB image
Figure QLYQS_290
(e-2) RGB image to be visualized
Figure QLYQS_292
Zooming to obtain zoomed visual RGB image
Figure QLYQS_294
RGB image scaled for visualization using the Flatten function in the torch library
Figure QLYQS_297
In a fifth or fifth sun>
Figure QLYQS_293
Line pixel values>
Figure QLYQS_296
The flattening treatment is->
Figure QLYQS_299
Figure QLYQS_300
Visualized RGB image ^ of 3D>
Figure QLYQS_291
Conversion to 2D line sequences
Figure QLYQS_295
,/>
Figure QLYQS_298
(e-3) sequencing the 2D lines
Figure QLYQS_301
Inputting the result into a lightweight visual Transformer model after distillation training to obtain the predicted logits->
Figure QLYQS_302
And predicted logit @ofa family classification task>
Figure QLYQS_303
If->
Figure QLYQS_304
Then the unknown software is determined to be malware, if &>
Figure QLYQS_305
Judging the unknown software as benign software, and judging the family to which the malware belongs as being ^ or greater than or equal to when the distillation-trained lightweight vision Transformer model judges the input unknown software as the malware>
Figure QLYQS_306
The family corresponding to the highest value in (c). />
CN202310063452.3A 2023-02-06 2023-02-06 Malicious software identification method based on visual transducer Active CN115879109B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310063452.3A CN115879109B (en) 2023-02-06 2023-02-06 Malicious software identification method based on visual transducer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310063452.3A CN115879109B (en) 2023-02-06 2023-02-06 Malicious software identification method based on visual transducer

Publications (2)

Publication Number Publication Date
CN115879109A true CN115879109A (en) 2023-03-31
CN115879109B CN115879109B (en) 2023-05-12

Family

ID=85758746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310063452.3A Active CN115879109B (en) 2023-02-06 2023-02-06 Malicious software identification method based on visual transducer

Country Status (1)

Country Link
CN (1) CN115879109B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160358312A1 (en) * 2015-06-05 2016-12-08 Mindaptiv LLC Digital quaternion logarithm signal processing system and method for images and other data types
US20180007074A1 (en) * 2015-01-14 2018-01-04 Virta Laboratories, Inc. Anomaly and malware detection using side channel analysis
CN110633570A (en) * 2019-07-24 2019-12-31 浙江工业大学 Black box attack defense method for malicious software assembly format detection model
CN114065199A (en) * 2021-11-18 2022-02-18 山东省计算中心(国家超级计算济南中心) Cross-platform malicious code detection method and system
CN114462039A (en) * 2022-01-27 2022-05-10 北京工业大学 Android malicious software detection method based on Transformer structure
CN114676769A (en) * 2022-03-22 2022-06-28 南通大学 Visual transform-based small sample insect image identification method
CN114694220A (en) * 2022-03-25 2022-07-01 上海大学 Double-flow face counterfeiting detection method based on Swin transform
CN114818826A (en) * 2022-05-19 2022-07-29 石家庄铁道大学 Fault diagnosis method based on lightweight Vision Transformer module
CN114913162A (en) * 2022-05-25 2022-08-16 广西大学 Bridge concrete crack detection method and device based on lightweight transform
CN114937016A (en) * 2022-05-25 2022-08-23 广西大学 Bridge concrete crack real-time detection method and device based on edge calculation and Transformer
CN115563327A (en) * 2022-08-30 2023-01-03 电子科技大学 Zero sample cross-modal retrieval method based on Transformer network selective distillation

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180007074A1 (en) * 2015-01-14 2018-01-04 Virta Laboratories, Inc. Anomaly and malware detection using side channel analysis
US20160358312A1 (en) * 2015-06-05 2016-12-08 Mindaptiv LLC Digital quaternion logarithm signal processing system and method for images and other data types
CN110633570A (en) * 2019-07-24 2019-12-31 浙江工业大学 Black box attack defense method for malicious software assembly format detection model
CN114065199A (en) * 2021-11-18 2022-02-18 山东省计算中心(国家超级计算济南中心) Cross-platform malicious code detection method and system
CN114462039A (en) * 2022-01-27 2022-05-10 北京工业大学 Android malicious software detection method based on Transformer structure
CN114676769A (en) * 2022-03-22 2022-06-28 南通大学 Visual transform-based small sample insect image identification method
CN114694220A (en) * 2022-03-25 2022-07-01 上海大学 Double-flow face counterfeiting detection method based on Swin transform
CN114818826A (en) * 2022-05-19 2022-07-29 石家庄铁道大学 Fault diagnosis method based on lightweight Vision Transformer module
CN114913162A (en) * 2022-05-25 2022-08-16 广西大学 Bridge concrete crack detection method and device based on lightweight transform
CN114937016A (en) * 2022-05-25 2022-08-23 广西大学 Bridge concrete crack real-time detection method and device based on edge calculation and Transformer
CN115563327A (en) * 2022-08-30 2023-01-03 电子科技大学 Zero sample cross-modal retrieval method based on Transformer network selective distillation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SHI CHEN 等: "Malicious Code Family Classification Method Based on Vision Transformer", 《2022 IEEE 10TH INTERNATIONAL CONFERENCE ON INFORMATION, COMMUNICATION AND NETWORKS (ICICN)》 *
徐至峰: "基于深度学习的Windows系统恶意软件检测研究", 《万方学位论文》 *
王志文 等: "基于机器学习的恶意软件识别研究综述", 《小型微型计算机系统》 *

Also Published As

Publication number Publication date
CN115879109B (en) 2023-05-12

Similar Documents

Publication Publication Date Title
CN111738315B (en) Image classification method based on countermeasure fusion multi-source transfer learning
Zhong et al. An end-to-end dense-inceptionnet for image copy-move forgery detection
CN108537742B (en) Remote sensing image panchromatic sharpening method based on generation countermeasure network
CN107392019A (en) A kind of training of malicious code family and detection method and device
CN111274869B (en) Method for classifying hyperspectral images based on parallel attention mechanism residual error network
WO2020046213A1 (en) A method and apparatus for training a neural network to identify cracks
CN109063649B (en) Pedestrian re-identification method based on twin pedestrian alignment residual error network
CN112131967A (en) Remote sensing scene classification method based on multi-classifier anti-transfer learning
CN108090447A (en) Hyperspectral image classification method and device under double branch&#39;s deep structures
CN108021947A (en) A kind of layering extreme learning machine target identification method of view-based access control model
CN115690479A (en) Remote sensing image classification method and system based on convolution Transformer
CN115830531A (en) Pedestrian re-identification method based on residual multi-channel attention multi-feature fusion
CN111626357B (en) Image identification method based on neural network model
CN115631365A (en) Cross-modal contrast zero sample learning method fusing knowledge graph
CN117011883A (en) Pedestrian re-recognition method based on pyramid convolution and transducer double branches
CN116310647A (en) Labor insurance object target detection method and system based on incremental learning
Khan et al. A hybrid defense method against adversarial attacks on traffic sign classifiers in autonomous vehicles
CN113792686A (en) Vehicle weight identification method based on cross-sensor invariance of visual representation
Chulif et al. Herbarium-Field Triplet Network for Cross-domain Plant Identification. NEUON Submission to LifeCLEF 2020 Plant.
CN116977725A (en) Abnormal behavior identification method and device based on improved convolutional neural network
Chen et al. Feature descriptor by convolution and pooling autoencoders
CN116994130A (en) Knowledge distillation-based high-precision light bridge crack identification method
CN115879109A (en) Malicious software identification method based on visual transform
CN114821200B (en) Image detection model and method applied to industrial vision detection field
CN115861306A (en) Industrial product abnormity detection method based on self-supervision jigsaw module

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant