Disclosure of Invention
The invention aims to: aiming at the problems, the invention provides a BIOS malicious program detection method and device based on double embedding and model pruning, which combines the image information and semantic structure information of the BIOS program, can effectively resist variant viruses, and simultaneously effectively improves the detection efficiency of the model by constructing a pruning model.
The invention is realized by the following technical scheme:
the invention provides a BIOS malicious program detection method and device based on double embedding and model pruning, comprising the following steps:
step 1: the BIOS image file is read, a BIOS program original data set D1 is constructed, and after data are cleaned, binary translation processing is carried out to obtain a data set D2, and the specific method is as follows:
step 1.1: reading a BIOS image file to obtain data to be cleaned, and defining a BIOS image program data set as D1, wherein D1= { D 1 ,d 2 ,d 3 …d n },d n Is the nth data to be cleaned;
step 1.2: data cleaning is carried out on the data set D1 to obtain a data set D1';
step 1.3: binary translation processing is carried out on the cleaned data set D1' to obtain a data set D2, D2= { Doc 1 ,Doc 2 ,Doc 3 …Doc n }, where Doc n Is the nth data to be processed.
Step 2: converting the data set D2 into an uncompressed gray level image by using a B2M algorithm, extracting LBP textures and SIFT features expressed by a BoVW word bag model, and inputting the spliced SIFT features into an SVM classifier, wherein the specific method comprises the following steps of:
step 2.1: reading a binary file in the BIOS data set D2, converting an 8bit group into an unsigned integer, wherein the value interval is (0, 255), and converting the unsigned integer into a one-dimensional array A1;
step 2.2: defining a two-dimensional array A2, setting a fixed width with=2 m Taking the value of the one-dimensional array A1 as an element of A2 to obtain a fixed width matrix M1;
step 2.3: dumping the two-dimensional matrix with the fixed width into a gray level image G;
step 2.4: k-means clustering is carried out on the extracted whole SIFT feature to obtain K clustering centers serving as a visual word list;
step 2.5: taking the word list as a specification of the image, and calculating the distance between the image and each word in the word list for each SIFT feature point;
step 2.6: obtaining a feature vector F of the image, and obtaining a data set vector sequence F= { F 1 ,f 2 ,f 3 …f len(D2) Defining len (D2) as the data set D2 length;
step 2.7: calculating LBP characteristic images of the BIOS program images, and performing block processing;
step 2.8: calculating a histogram of each regional characteristic image and normalizing the histogram;
step 2.9: the histograms of the characteristic images of each block of area are arranged according to the space sequence of the blocks to obtain LBP characteristic vectors u;
step 2.10: obtaining a data set LBP eigenvector sequence U= { U 1 ,u 2 ,u 3 …u len(D2) }。
Step 2.11: splicing the LBP feature vector and the SIFT feature vector, inputting the LBP feature vector and the SIFT feature vector into an SVM classifier, and outputting an image vector sequence R= { R 1 ,r 2 ,r 3 …r len(D2) }。
Step 3: inputting the cleaned data set D1' of the BIOS program into an ebadd layer of the Bert, combining token information, segment information and position information to obtain a vector containing a program structure and semantics, wherein the specific method comprises the following steps of:
step 3.1: processing a data set D1' cleaned by a BIOS program, and defining a data set text= { t 1 ,t 2 ,t 3 …t len(D1′) },t j ={label,d j },j<len(D1′),d j E, D1', len (D1 ') is the length of a data set D1', and label is a BIOS program data set label;
step 3.2: defining a first cyclic variable i, cyclically traversing the Text data set, giving the variable i an initial value of 1, defining len (S i ) For the ith data length, defining len (Text) as the data set length, unifying the fixed Text length len_max;
step 3.3: if i < len (Text), jumping to step 3.4, otherwise jumping to step 3.12;
step 3.4: if len (S) i ) The +2 is less than or equal to len_max, the sequence is zero-padded, otherwise, the sequence is truncated to unify the length of the sequence;
step 3.5: acquisition of a New sequence T i Length is defined as len (T i );
Step 3.6: inputting token embedding layer, segment embedding layer and position embedding layer to obtain vector v 1 ,v 2 And v 3 Defining a second cyclic variable Na and giving an initial value of 1;
step 3.7: if Na < len (T i ) Jumping to step 3.8, otherwise jumping to step 3.10;
step 3.8: definition vector V Na =v 1 +v 2 +v 3 ;
Step 3.9: na=na+1, jump to step 3.7;
step 3.10: obtaining a vector y i ={V 1 ,V 2 ,V 3 …V len_max };
Step 3.11: i=i+1, jump to step 3.3;
step 3.12: outputting the final vector sequence y= { Y 1 ,y 2 ,y 3 …y len(Text) };
Step 4: the obtained vector sequence Y is input into 6 layers of transformers after pruning, and the small-scale textCNN is connected in series in a cross-layer manner after the transformers are used for outputting simple samples in advance, and the specific method is as follows:
step 4.1: constructing a Bert pruning model of a 6-layer Transformer, and transmitting a vector sequence Y;
step 4.2: defining a third cyclic variable e, wherein the e gives an initial value of 1, and defines a threshold index Speed and Uncertainty Uncertainty;
step 4.3: if e < len (Y), jumping to step 4.4, otherwise jumping to step 4.9;
step 4.4: vector y e Into the transducer layer, y e E, Y, defining a fourth circulation variable g which is less than or equal to 3, wherein the initial value of g is 1;
step 4.5: if the cyclic variable g is less than 3, executing the steps 4.5.1-4.5.3, otherwise jumping to the step 4.7;
step 4.5.1: outputting a vector Pt at a layer 2g of a transducer layer, connecting small-scale textCNN in series at the layer 2g of the transducer layer, and inputting the vector Pt into a textCNN network;
step 4.5.2: outputting a prediction vector Ps by a convolution layer, a pooling layer and a Softmax layer of the convolution neural network;
step 4.5.3: calculating uncertainty
o is a traversal variable, n=len (Ps), if the noncertainty > Speed, transferring to the next layer of transducer, jumping to step 4.6, otherwise outputting a vector Ps;
step 4.6: g=g+1, jump to step 4.5;
step 4.7: the last layer of transducer layer outputs vector Pt, and vector Ps is outputted through a convolutional neural network;
step 4.8: e=e+1, jump to step 4.3;
step 4.9: outputting the full text vector sequence
Step 5: fusing the image vector sequence R and the text vector sequence H of the BIOS program data set, and outputting a program detection result based on the fused vector, wherein the specific method comprises the following steps of:
step 5.1: splicing vector sequences R and H to define variables q, ps q Represents the q-th vector of the vector sequence H, r q Representing the q-th vector of the vector sequence R;
step 5.2: splice vector Ps q Vector of r q ;
Step 5.3: and obtaining a new vector sequence B, and performing category prediction on an output layer to realize detection of the BIOS malicious program.
The invention is realized by the following technical proposal
The dual-embedding and model-based pruning BIOS malicious program detection device comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the dual-embedding and model-based pruning BIOS malicious program detection method is realized when the computer program is loaded into the processor.
The invention adopts the technical scheme and has the following beneficial effects:
1. the dual embedding and model pruning adopted by the invention has important effect and significance on the detection and classification of the traditional BIOS malicious programs. When the problem of single characteristic is faced, the binary bit stream reading is carried out on the BIOS program by using the B2M algorithm, the binary bit stream reading is converted into an uncompressed gray level diagram, and the image characteristic of the BIOS program is extracted. And extracting BIOS semantic and structural information through the ebedding layer of Bert, constructing a pruning model of a 6-layer Transformer, connecting a small-scale textCNN model in series across layers, introducing uncertainty, outputting a simple sample in advance, and improving efficiency. And finally, splicing the image features and the text features of the BIOS program, and outputting a BIOS program detection result based on the fused features.
2. The fusion of the image information and the program text information of the BIOS program can effectively resist the variant viruses with strong functions during the model detection;
3. the invention adopts the image SIFT feature represented by the BoVW model, is beneficial to large-scale image retrieval and the expandability of the SIFT feature, and can be conveniently combined with feature vectors of other forms.
4. According to the invention, a 6-layer Transformer pruning model is adopted to construct a cross-layer serial textCNN model, so that the model detection efficiency is improved better;
5. the invention can generate the dynamic vector by adopting the Tranformer, so that the extracted text information is better adapted to the situation;
6. the invention adopts the Bert model to extract the semantic and structural information of the BIOS program, so that the extracted text information is more abundant;
7. compared with a shallow network, the deep network has excellent performance, and the bi-directional language characterization of the depth of the Bert model has higher performance.
Detailed Description
The present invention will be further illustrated with reference to the accompanying drawings 1-6, it being understood that these examples are intended to illustrate the invention and not to limit the scope of the invention, and that various equivalent modifications to the invention will fall within the scope of the appended claims to the present application after reading the invention.
The following takes a single BIOS image file as an example:
step 1: the BIOS image file is read and is subjected to binary translation processing, as shown in FIG. 2:
step 1.1: reading BIOS mirror image file, defining the BIOS mirror image program data as d 1 ;
Step 1.2: mirroring program data d to BIOS 1 And performing binary translation processing.
Step 2: data d using B2M algorithm 1 Conversion to an uncompressed gray imageAnd extracting LBP textures and SIFT features represented by a BoVW word bag model, and inputting the spliced SIFT features into an SVM classifier, wherein the SIFT features are specifically shown in figure 3:
step 2.1: read data d 1 The binary file of the code is converted into an unsigned integer from 8bit, the value interval is (0, 255), and the binary file is converted into a one-dimensional array A1;
step 2.2: defining a two-dimensional array A2, setting a fixed width with=2 m Taking the value of the one-dimensional array A1 and taking the fixed width as an element of A2 to obtain a fixed width matrix M1;
step 2.3: dumping the two-dimensional matrix with the fixed width into a gray level image G;
step 2.4: k-means clustering is carried out on the extracted whole SIFT feature to obtain K clustering centers serving as a visual word list;
step 2.5: taking the image G as a specification of a word list, and calculating the distance between the image G and each word in the word list for each SIFT feature point;
step 2.6: obtaining a feature vector f1 of the image G;
step 2.7: calculating an LBP characteristic image of the BIOS program image G, and performing block processing;
step 2.8: calculating a histogram of each regional characteristic image and normalizing the histogram;
step 2.9: the histograms of the characteristic images of each block of area are arranged according to the space sequence of the blocks to obtain LBP characteristic vectors u1;
step 2.10: splicing LBP feature vector and SIFT feature vector, inputting into SVM classifier, outputting vector r 1 。
Step 3: BIOS program data d 1 Inputting the ebadd layer of the Bert, combining token information, segment information and position information to obtain a vector containing a program structure and semantics, as shown in fig. 4:
step 3.1: definition t 1 ={label,d 1 Label is a BIOS program dataset tag;
step 3.2: definition len (S) 1 ) For data d 1 Length, unifying fixed text length len_max;
and 3, step 3.3: if len (S) 1 ) The +2 is less than or equal to len_max, the sequence is zero-padded, otherwise, the sequence is truncated to unify the length of the sequence;
step 3.4: acquisition of a New sequence T 1 ;
Step 3.5: inputting token embedding layer, segment embedding layer and position embedding layer to obtain vector v 1 ,v 2 And v 3 Defining a circulating variable Na and giving an initial value of 1;
step 3.6: if Na < len (T 1 ) Jumping to step 3.7, otherwise jumping to step 3.9;
step 3.7: definition vector V Na =v 1 +v 2 +v 3 ;
Step 3.8: na=na+1, jump to step 3.6;
step 3.9: output vector y 1 ={V 1 ,V 2 ,V 3 …V len_max }。
Step 4: the obtained vector is input into a 6-layer transducer model after pruning, and a small-scale TextCNN is connected in series in a cross-layer manner after transducer for outputting simple samples in advance, and the method is specifically shown in fig. 5:
step 4.1: constructing a Bert pruning model of a 6-layer transducer, and introducing a vector sequence y 1 ;
Step 4.2: defining a threshold index Speed and Uncertainty Uancertainty;
step 4.3: vector y 1 Transferring into a transducer layer, defining a cyclic variable g which is less than or equal to 3, wherein the initial value of g is 1;
step 4.4: if the cyclic variable g is less than 3, executing the steps 4.4.1-4.4.3, otherwise jumping to the step 4.6;
step 4.4.1: outputting a vector Pt at a layer 2g of a transducer layer, connecting small-scale textCNN in series at the layer 2g of the transducer layer, and inputting the vector Pt into a textCNN network;
step 4.4.2: the Softmax layer outputs the prediction vector Ps through the convolutional layer, the pooling layer, and the convolutional neural network 1 ;
Step 4.4.3: calculating uncertainty
If Uncertainty>Speed, transfer to next layer transducer, jump to step 4.5, otherwise output vector Ps
1 ;
Step 4.5: g=g+1, jump to step 4.5;
step 4.6: the final layer of transducer layer outputs vector Pt, and vector Ps is outputted through convolutional neural network 1 。
Step 5: the image vector and the text vector of the BIOS program data set are fused, and a program detection result is output based on the fused vector, specifically as shown in fig. 6:
step 5.1: splice vector Ps 1 Vector of r 1 ;
Step 5.3: obtaining a new vector b 1 And performing category prediction at an output layer, and performing malicious virus detection on the BIOS program.
The invention can be combined with a computer system to form a BIOS malicious program detection device based on double embedding and model pruning, and the device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the BIOS malicious program detection method based on double embedding and model pruning is realized when the computer program is loaded to the processor.