CN114329472B

CN114329472B - BIOS malicious program detection method and device based on dual embedding and model pruning

Info

Publication number: CN114329472B
Application number: CN202111671081.4A
Authority: CN
Inventors: 李翔; 张豪杰; 赵建洋; 谢乾; 汪涛; 周国栋; 陈礼青; 寇海洲; 高尚兵; 束玮; 张宁; 丁婧娴
Original assignee: Jiangsu Zhuoyi Information Technology Co ltd; Nanjing Byosoft Co ltd; Huaiyin Institute of Technology
Current assignee: Huai'an Xinye Electric Power Design Consulting Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2023-05-19
Anticipated expiration: 2041-12-31
Also published as: CN114329472A

Abstract

The invention discloses a BIOS malicious program detection method and device based on dual embedding and model pruning. Firstly, reading a BIOS image file to construct an original data set, and performing binary translation processing on the original data set; then converting the translated data set into a two-dimensional matrix by using a B2M algorithm, so that the program file is mapped into an uncompressed gray level image extraction feature; inputting the original data set into a Bert pruning model with 6 layers of transformers, and connecting small-scale textCNNs in series in a cross-layer manner after the transformers, introducing uncertainty, and outputting a simple program in advance; and finally, splicing the image vector and the text vector of the BIOS program, and outputting a program detection result based on the fused vector. The method of the invention uses double embedding of text and gray level images to perform feature expansion, can effectively resist variant viruses in BIOS programs, and meanwhile, the deep learning model after pruning can improve program detection efficiency, so that the method can be better applied in actual scenes.

Description

BIOS malicious program detection method and device based on dual embedding and model pruning

Technical Field

The invention belongs to the technical field of text classification and multi-feature fusion, and particularly relates to a BIOS malicious program detection method and device based on dual embedding and model pruning.

Background

In recent years, the number of malicious code varieties is explosive, the rapid variety and confusion means of the malicious code make the malicious code more and more difficult to identify, the network security is a serious threat, and the detection of the malicious code has become a research hotspot.

The existing malicious program detection method has the following defects: 1. the detection of malicious programs is always dependent on a feature code detection method similar to the traditional virus, and the method cannot cope with novel powerful malicious programs; 2. current malicious program detection cannot cope with rapid variant malicious programs, and single characteristics in algorithm detection cannot resist variant forms; 3. the detection of the algorithm malicious program is easily interfered by the confusion technology, so that the detection accuracy is reduced.

Disclosure of Invention

The invention aims to: aiming at the problems, the invention provides a BIOS malicious program detection method and device based on double embedding and model pruning, which combines the image information and semantic structure information of the BIOS program, can effectively resist variant viruses, and simultaneously effectively improves the detection efficiency of the model by constructing a pruning model.

The invention is realized by the following technical scheme:

the invention provides a BIOS malicious program detection method and device based on double embedding and model pruning, comprising the following steps:

step 1: the BIOS image file is read, a BIOS program original data set D1 is constructed, and after data are cleaned, binary translation processing is carried out to obtain a data set D2, and the specific method is as follows:

step 1.1: reading a BIOS image file to obtain data to be cleaned, and defining a BIOS image program data set as D1, wherein D1= { D ₁ ，d ₂ ，d ₃ …d _n }，d _n Is the nth data to be cleaned;

step 1.2: data cleaning is carried out on the data set D1 to obtain a data set D1';

step 1.3: binary translation processing is carried out on the cleaned data set D1' to obtain a data set D2, D2= { Doc ₁ ，Doc ₂ ，Doc ₃ …Doc _n }, where Doc _n Is the nth data to be processed.

Step 2: converting the data set D2 into an uncompressed gray level image by using a B2M algorithm, extracting LBP textures and SIFT features expressed by a BoVW word bag model, and inputting the spliced SIFT features into an SVM classifier, wherein the specific method comprises the following steps of:

step 2.1: reading a binary file in the BIOS data set D2, converting an 8bit group into an unsigned integer, wherein the value interval is (0, 255), and converting the unsigned integer into a one-dimensional array A1;

step 2.2: defining a two-dimensional array A2, setting a fixed width with=2 ^m Taking the value of the one-dimensional array A1 as an element of A2 to obtain a fixed width matrix M1;

step 2.3: dumping the two-dimensional matrix with the fixed width into a gray level image G;

step 2.4: k-means clustering is carried out on the extracted whole SIFT feature to obtain K clustering centers serving as a visual word list;

step 2.5: taking the word list as a specification of the image, and calculating the distance between the image and each word in the word list for each SIFT feature point;

step 2.6: obtaining a feature vector F of the image, and obtaining a data set vector sequence F= { F ₁ ，f ₂ ，f ₃ …f _len(D2) Defining len (D2) as the data set D2 length;

step 2.7: calculating LBP characteristic images of the BIOS program images, and performing block processing;

step 2.8: calculating a histogram of each regional characteristic image and normalizing the histogram;

step 2.9: the histograms of the characteristic images of each block of area are arranged according to the space sequence of the blocks to obtain LBP characteristic vectors u;

step 2.10: obtaining a data set LBP eigenvector sequence U= { U ₁ ，u ₂ ，u ₃ …u _len(D2) }。

Step 2.11: splicing the LBP feature vector and the SIFT feature vector, inputting the LBP feature vector and the SIFT feature vector into an SVM classifier, and outputting an image vector sequence R= { R ₁ ，r ₂ ，r ₃ …r _len(D2) }。

Step 3: inputting the cleaned data set D1' of the BIOS program into an ebadd layer of the Bert, combining token information, segment information and position information to obtain a vector containing a program structure and semantics, wherein the specific method comprises the following steps of:

step 3.1: processing a data set D1' cleaned by a BIOS program, and defining a data set text= { t ₁ ，t ₂ ，t ₃ …t _len(D1′) }，t _j ＝{label，d _j }，j＜len(D1′)，d _j E, D1', len (D1 ') is the length of a data set D1', and label is a BIOS program data set label;

step 3.2: defining a first cyclic variable i, cyclically traversing the Text data set, giving the variable i an initial value of 1, defining len (S _i ) For the ith data length, defining len (Text) as the data set length, unifying the fixed Text length len_max;

step 3.3: if i < len (Text), jumping to step 3.4, otherwise jumping to step 3.12;

step 3.4: if len (S) _i ) The +2 is less than or equal to len_max, the sequence is zero-padded, otherwise, the sequence is truncated to unify the length of the sequence;

step 3.5: acquisition of a New sequence T _i Length is defined as len (T _i )；

Step 3.6: inputting token embedding layer, segment embedding layer and position embedding layer to obtain vector v ₁ ，v ₂ And v ₃ Defining a second cyclic variable Na and giving an initial value of 1;

step 3.7: if Na < len (T _i ) Jumping to step 3.8, otherwise jumping to step 3.10;

step 3.8: definition vector V _Na ＝v ₁ +v ₂ +v ₃ ；

Step 3.9: na=na+1, jump to step 3.7;

step 3.10: obtaining a vector y _i ＝{V ₁ ，V ₂ ，V ₃ …V _{len_max} }；

Step 3.11: i=i+1, jump to step 3.3;

step 3.12: outputting the final vector sequence y= { Y ₁ ，y ₂ ，y ₃ …y _len(Text) }；

Step 4: the obtained vector sequence Y is input into 6 layers of transformers after pruning, and the small-scale textCNN is connected in series in a cross-layer manner after the transformers are used for outputting simple samples in advance, and the specific method is as follows:

step 4.1: constructing a Bert pruning model of a 6-layer Transformer, and transmitting a vector sequence Y;

step 4.2: defining a third cyclic variable e, wherein the e gives an initial value of 1, and defines a threshold index Speed and Uncertainty Uncertainty;

step 4.3: if e < len (Y), jumping to step 4.4, otherwise jumping to step 4.9;

step 4.4: vector y _e Into the transducer layer, y _e E, Y, defining a fourth circulation variable g which is less than or equal to 3, wherein the initial value of g is 1;

step 4.5: if the cyclic variable g is less than 3, executing the steps 4.5.1-4.5.3, otherwise jumping to the step 4.7;

step 4.5.1: outputting a vector Pt at a layer 2g of a transducer layer, connecting small-scale textCNN in series at the layer 2g of the transducer layer, and inputting the vector Pt into a textCNN network;

step 4.5.2: outputting a prediction vector Ps by a convolution layer, a pooling layer and a Softmax layer of the convolution neural network;

step 4.5.3: calculating uncertainty

o is a traversal variable, n=len (Ps), if the noncertainty > Speed, transferring to the next layer of transducer, jumping to step 4.6, otherwise outputting a vector Ps;

step 4.6: g=g+1, jump to step 4.5;

step 4.7: the last layer of transducer layer outputs vector Pt, and vector Ps is outputted through a convolutional neural network;

step 4.8: e=e+1, jump to step 4.3;

step 4.9: outputting the full text vector sequence

Step 5: fusing the image vector sequence R and the text vector sequence H of the BIOS program data set, and outputting a program detection result based on the fused vector, wherein the specific method comprises the following steps of:

step 5.1: splicing vector sequences R and H to define variables q, ps _q Represents the q-th vector of the vector sequence H, r _q Representing the q-th vector of the vector sequence R;

step 5.2: splice vector Ps _q Vector of r _q ；

Step 5.3: and obtaining a new vector sequence B, and performing category prediction on an output layer to realize detection of the BIOS malicious program.

The invention is realized by the following technical proposal

The dual-embedding and model-based pruning BIOS malicious program detection device comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the dual-embedding and model-based pruning BIOS malicious program detection method is realized when the computer program is loaded into the processor.

The invention adopts the technical scheme and has the following beneficial effects:

1. the dual embedding and model pruning adopted by the invention has important effect and significance on the detection and classification of the traditional BIOS malicious programs. When the problem of single characteristic is faced, the binary bit stream reading is carried out on the BIOS program by using the B2M algorithm, the binary bit stream reading is converted into an uncompressed gray level diagram, and the image characteristic of the BIOS program is extracted. And extracting BIOS semantic and structural information through the ebedding layer of Bert, constructing a pruning model of a 6-layer Transformer, connecting a small-scale textCNN model in series across layers, introducing uncertainty, outputting a simple sample in advance, and improving efficiency. And finally, splicing the image features and the text features of the BIOS program, and outputting a BIOS program detection result based on the fused features.

2. The fusion of the image information and the program text information of the BIOS program can effectively resist the variant viruses with strong functions during the model detection;

3. the invention adopts the image SIFT feature represented by the BoVW model, is beneficial to large-scale image retrieval and the expandability of the SIFT feature, and can be conveniently combined with feature vectors of other forms.

4. According to the invention, a 6-layer Transformer pruning model is adopted to construct a cross-layer serial textCNN model, so that the model detection efficiency is improved better;

5. the invention can generate the dynamic vector by adopting the Tranformer, so that the extracted text information is better adapted to the situation;

6. the invention adopts the Bert model to extract the semantic and structural information of the BIOS program, so that the extracted text information is more abundant;

7. compared with a shallow network, the deep network has excellent performance, and the bi-directional language characterization of the depth of the Bert model has higher performance.

Drawings

FIG. 1 is a general flow chart of the present invention;

FIG. 2 is a flowchart of a BIOS program data set purge;

FIG. 3 is a flowchart for extracting features of a BIOS program image;

FIG. 4 is a flowchart of extracting BIOS program text information from the ebadd layer of the Bert model;

FIG. 5 is a flow chart for pruning model construction;

fig. 6 is a vector stitching flow diagram.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings 1-6, it being understood that these examples are intended to illustrate the invention and not to limit the scope of the invention, and that various equivalent modifications to the invention will fall within the scope of the appended claims to the present application after reading the invention.

The following takes a single BIOS image file as an example:

step 1: the BIOS image file is read and is subjected to binary translation processing, as shown in FIG. 2:

step 1.1: reading BIOS mirror image file, defining the BIOS mirror image program data as d ₁ ；

Step 1.2: mirroring program data d to BIOS ₁ And performing binary translation processing.

Step 2: data d using B2M algorithm ₁ Conversion to an uncompressed gray imageAnd extracting LBP textures and SIFT features represented by a BoVW word bag model, and inputting the spliced SIFT features into an SVM classifier, wherein the SIFT features are specifically shown in figure 3:

step 2.1: read data d ₁ The binary file of the code is converted into an unsigned integer from 8bit, the value interval is (0, 255), and the binary file is converted into a one-dimensional array A1;

step 2.2: defining a two-dimensional array A2, setting a fixed width with=2 ^m Taking the value of the one-dimensional array A1 and taking the fixed width as an element of A2 to obtain a fixed width matrix M1;

step 2.5: taking the image G as a specification of a word list, and calculating the distance between the image G and each word in the word list for each SIFT feature point;

step 2.6: obtaining a feature vector f1 of the image G;

step 2.7: calculating an LBP characteristic image of the BIOS program image G, and performing block processing;

step 2.9: the histograms of the characteristic images of each block of area are arranged according to the space sequence of the blocks to obtain LBP characteristic vectors u1;

step 2.10: splicing LBP feature vector and SIFT feature vector, inputting into SVM classifier, outputting vector r ₁ 。

Step 3: BIOS program data d ₁ Inputting the ebadd layer of the Bert, combining token information, segment information and position information to obtain a vector containing a program structure and semantics, as shown in fig. 4:

step 3.1: definition t ₁ ＝{label，d ₁ Label is a BIOS program dataset tag;

step 3.2: definition len (S) ₁ ) For data d ₁ Length, unifying fixed text length len_max;

and 3, step 3.3: if len (S) ₁ ) The +2 is less than or equal to len_max, the sequence is zero-padded, otherwise, the sequence is truncated to unify the length of the sequence;

step 3.4: acquisition of a New sequence T ₁ ；

Step 3.5: inputting token embedding layer, segment embedding layer and position embedding layer to obtain vector v ₁ ，v ₂ And v ₃ Defining a circulating variable Na and giving an initial value of 1;

step 3.6: if Na < len (T ₁ ) Jumping to step 3.7, otherwise jumping to step 3.9;

step 3.7: definition vector V _Na ＝v ₁ +v ₂ +v ₃ ；

Step 3.8: na=na+1, jump to step 3.6;

step 3.9: output vector y ₁ ＝{V ₁ ，V ₂ ，V ₃ …V _{len_max} }。

Step 4: the obtained vector is input into a 6-layer transducer model after pruning, and a small-scale TextCNN is connected in series in a cross-layer manner after transducer for outputting simple samples in advance, and the method is specifically shown in fig. 5:

step 4.1: constructing a Bert pruning model of a 6-layer transducer, and introducing a vector sequence y ₁ ；

Step 4.2: defining a threshold index Speed and Uncertainty Uancertainty;

step 4.3: vector y ₁ Transferring into a transducer layer, defining a cyclic variable g which is less than or equal to 3, wherein the initial value of g is 1;

step 4.4: if the cyclic variable g is less than 3, executing the steps 4.4.1-4.4.3, otherwise jumping to the step 4.6;

step 4.4.1: outputting a vector Pt at a layer 2g of a transducer layer, connecting small-scale textCNN in series at the layer 2g of the transducer layer, and inputting the vector Pt into a textCNN network;

step 4.4.2: the Softmax layer outputs the prediction vector Ps through the convolutional layer, the pooling layer, and the convolutional neural network ₁ ；

Step 4.4.3: calculating uncertainty

If Uncertainty>Speed, transfer to next layer transducer, jump to step 4.5, otherwise output vector Ps ₁ ；

Step 4.5: g=g+1, jump to step 4.5;

step 4.6: the final layer of transducer layer outputs vector Pt, and vector Ps is outputted through convolutional neural network ₁ 。

Step 5: the image vector and the text vector of the BIOS program data set are fused, and a program detection result is output based on the fused vector, specifically as shown in fig. 6:

step 5.1: splice vector Ps ₁ Vector of r ₁ ；

Step 5.3: obtaining a new vector b ₁ And performing category prediction at an output layer, and performing malicious virus detection on the BIOS program.

/>

/>

The invention can be combined with a computer system to form a BIOS malicious program detection device based on double embedding and model pruning, and the device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the BIOS malicious program detection method based on double embedding and model pruning is realized when the computer program is loaded to the processor.

Claims

1. The method for detecting the BIOS malicious program based on double embedding and model pruning is characterized by comprising the following steps:

step 1: reading a BIOS image file, constructing a BIOS program original data set D1, and performing binary translation processing on the data after cleaning to obtain a data set D2;

step 2: converting the data set D2 into an uncompressed gray level image by using a B2M algorithm, extracting LBP textures and SIFT features represented by a BoVW word bag model, inputting the spliced SIFT features into an SVM classifier, and finally outputting an image vector sequence R;

step 3: inputting the cleaned data set D1' of the BIOS program into an embellishing layer of the Bert model, and combining token information, segment information and position information to obtain a vector containing a program structure and semantics;

step 3.1: processing a data set D1' cleaned by a BIOS program, and defining a data set text= { t ₁ ，t ₂ ，t ₃ …t _len(D1′) }，t _j ＝{label，d _j }，j<len(D1′)，d _j E, D1', len (D1) is the length of a data set D1', and label is a BIOS program data set label;

step 3.3: if i < len (Text) then jump to step 3.4, otherwise jump to step 3.12;

step 3.7: if Na is<len(T _i ) Jumping to step 3.8, otherwise jumping to step 3.10;

step 3.8: definition vector V _Na ＝v ₁ +v ₂ +v ₃ ；

Step 3.9: na=na+1, jump to step 3.7;

Step 3.11: i=i+1, jump to step 3.3;

Step 4: inputting the obtained vector sequence Y into 6 layers of transformers after pruning, and connecting small-scale textCNNs in series in a cross-layer manner after the transformers for outputting simple samples in advance;

step 4.3: if e < len (Y) then jump to step 4.4, otherwise jump to step 4.9;

step 4.5: if the fourth circulation variable g is less than 3, executing the steps 4.5.1-4.5.3, otherwise jumping to the step 4.7;

step 4.5.3: calculating uncertainty

o is the traversal variable, n=len (Ps), if uncrtainty>Speed, transmitting to the next layer of transducer, jumping to step 4.6, otherwise outputting vector Ps;

step 4.6: g=g+1, jump to step 4.5;

step 4.8: e=e+1, jump to step 4.3;

step 4.9: outputting the whole text vector sequence H= { Ps ₁ ,Ps ₂ ,Ps ₃ …Ps _len(Y) }；

Step 5: and fusing the image vector sequence R of the BIOS program data set with the text vector sequence H, and outputting a program detection result based on the fused vector.

2. The dual embedding and model pruning based BIOS malicious program detection method according to claim 1, wherein the specific method in step 1 is as follows:

step 1.1: reading a BIOS image file to obtain data to be cleaned, and defining a BIOS image program data set as D1, wherein D1= { D ₁ ,d ₂ ,d ₂ …d _n }，d _n Is the nth data to be cleaned;

step 1.3: binary translation processing is carried out on the cleaned data set D1' to obtain a data set D2, D2= { Doc ₁ ,Doc ₂ ,Doc ₃ …Doc _n }, where Doc _n Is the nth data to be processed.

3. The dual embedding and model pruning based BIOS malicious program detection method according to claim 1, wherein the specific method of step 2 is as follows:

step 2.6: obtaining a feature vector F of the image, and obtaining a data set vector sequence F= { F ₁ ,f ₂ ,f ₃ …f _len(D2) Defining len (D2) as the data set D2 length;

step 2.10: obtaining a data set LBP eigenvector sequence U= { U ₁ ,u ₂ ,u ₃ …u _len(D2) }；

Step 2.11: splicing the LBP feature vector and the SIFT feature vector, inputting the LBP feature vector and the SIFT feature vector into an SVM classifier, and outputting an image vector sequence R= { R ₁ ,r ₂ ,r ₃ …r _len(D2) }。

4. The dual embedding and model pruning based BIOS malicious program detection method according to claim 1, wherein the specific method in step 5 is as follows:

step 5.1: splicing an image vector sequence R and a text vector sequence H to define variables q and Ps _q Represents the q-th vector of the vector sequence H, r _q Representing the q-th vector of the vector sequence R;

step 5.2: splice vector Ps _q Vector of r _q ；

5. The dual-embedding and model-pruning-based BIOS malicious program detection device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and is characterized in that the dual-embedding and model-based BIOS malicious program detection method according to any one of claims 1-4 is realized when the computer program is loaded to the processor.