CN114329472A

CN114329472A - BIOS (basic input output System) malicious program detection method and device based on double embedding and model pruning

Info

Publication number: CN114329472A
Application number: CN202111671081.4A
Authority: CN
Inventors: 李翔; 张豪杰; 赵建洋; 谢乾; 汪涛; 周国栋; 陈礼青; 寇海洲; 高尚兵; 束玮; 张宁; 丁婧娴
Original assignee: Jiangsu Zhuoyi Information Technology Co ltd; Nanjing Byosoft Co ltd; Huaiyin Institute of Technology
Current assignee: Huai'an Xinye Electric Power Design Consulting Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-12
Anticipated expiration: 2041-12-31
Also published as: CN114329472B

Abstract

The invention discloses a BIOS (basic input output System) malicious program detection method and device based on double embedding and model pruning. Firstly, reading a BIOS mirror image file to construct an original data set, and performing binary translation processing on the original data set; then, converting the translated data set into a two-dimensional matrix by using a B2M algorithm, and mapping a program file into an uncompressed gray level image to extract features; inputting an original data set into a Bert pruning model with 6 layers of transformers, cross-layer connecting small-scale TextCNNs in series behind the transformers, and introducing uncertainty for outputting a simple program in advance; and finally, splicing the image vector and the text vector of the BIOS program, and outputting a program detection result based on the fused vector. The method of the invention uses double embedding of the text and the gray level image to carry out feature expansion, so that variant viruses in the BIOS program can be effectively resisted, and meanwhile, the program detection efficiency can be improved by the deep learning model after pruning, so that the method can be better applied to an actual scene.

Description

BIOS (basic input output System) malicious program detection method and device based on double embedding and model pruning

Technical Field

The invention belongs to the technical field of text classification and multi-feature fusion, and particularly relates to a BIOS malicious program detection method and device based on double embedding and model pruning.

Background

In recent years, the number of variants of malicious codes is increased explosively, rapid variants and confusion means of the malicious codes make the variants of the malicious codes more and more difficult to identify, and the rapid variants and the confusion means pose a significant threat to network security, so that detection of the malicious codes becomes a research hotspot.

In the existing malicious program detection method, the following defects exist: 1. the detection of the malicious programs always depends on a feature code detection method similar to the traditional virus, and the method cannot deal with novel powerful malicious programs; 2. the existing malicious program detection cannot cope with rapid variant malicious programs, and single characteristics in algorithm detection cannot resist variant forms; 3. the detection of the algorithm malicious program is easily interfered by an obfuscation technology, so that the detection precision is reduced.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problems, the invention provides a BIOS malicious program detection method and device based on double embedding and model pruning, which can effectively resist variant viruses by combining image information and semantic structure information of a BIOS program, and meanwhile, the detection efficiency of the model is effectively improved by constructing a pruning model.

The invention is realized by the following technical scheme:

the invention provides a BIOS malicious program detection method and device based on double embedding and model pruning, which comprises the following steps:

step 1: reading the BIOS image file, constructing a BIOS program original data set D1, and performing binary translation processing on the data after the data is cleaned to obtain a data set D2, wherein the specific method comprises the following steps:

step 1.1: reading the BIOS image file, obtaining the data to be cleaned, defining the BIOS image program data set as D1, D1 ═ D₁,d₂,d₂…d_n}，d_nThe nth data to be cleaned;

step 1.2: performing data cleaning on the data set D1 to obtain a data set D1';

step 1.3: binary translation processing is performed on the cleaned data set D1' to obtain a data set D2, D2 ═ Doc₁,Doc₂,Doc₃…Doc_NTherein Doc of_NIs the Nth data to be processed.

Step 2: converting a data set D2 into an uncompressed gray image by using a B2M algorithm, extracting LBP textures and SIFT features represented by a BoVW bag-of-words model, splicing, and inputting the spliced SIFT features into an SVM classifier, wherein the specific method comprises the following steps:

step 2.1: reading a binary file in a BIOS data set D2, converting a group of 8 bits into an unsigned integer, setting a value interval between (0, 255) and converting into a one-dimensional array A1;

step 2.2: defining a two-dimensional array A2, and setting a fixed width, width being 2ⁿTaking the numerical value of a one-dimensional array A1 as an element of A2 to obtain a fixed width matrix M1;

step 2.3: dumping the two-dimensional matrix with the fixed width into a gray image G;

step 2.4: performing K-means clustering on the extracted whole SIFT characteristics to obtain K clustering centers serving as visual word lists;

step 2.5: taking a word list as a standard, and calculating the distance between each SIFT feature point and each word in the word list;

step 2.6: obtaining a characteristic vector F of the image, and obtaining a data set vector sequence F ═ F₁，f₂，f₃…f_len(D2)Len (D2) is defined as the data set D2 length;

step 2.7: calculating an LBP characteristic image of the BIOS program image, and carrying out blocking processing;

step 2.8: calculating a histogram of each regional characteristic image, and normalizing;

step 2.9: arranging the histograms of the characteristic images of each block according to the spatial sequence of the blocks to obtain an LBP characteristic vector u;

step 2.10: obtaining a data set LBP feature vector sequence U ═ U₁，u₂，u₃…u_len(D2)}。

Step 2.11: the LBP feature vector and the SIFT feature vector are spliced and input into an SVM classifier, and the output vector sequence R is { R ═ R₁，r₂，r₃…r_len(D2)}。

And step 3: inputting a data set D1' of cleaned BIOS programs into an embedding layer of Bert, and obtaining a vector containing a program structure and semantics by combining token information, segment information and position information, wherein the specific method comprises the following steps:

step 3.1: processing the data set D1' cleaned by the BIOS program, defining the data set Text ═ t₁，t₂，t₃…t_len(D1′)}，t_j＝{label，d_j}，j＜len(D1′)，d_jE is D1 ', len (D1 ') is the length of the data set D1 ', and label is the BIOS program data set label;

step 3.2: defining a loop variable i, circularly traversing the Text data set, giving the variable i an initial value of 1, and defining len (S)_i) Defining len (text) as the length of the data set for the ith data length, and unifying the fixed text length len _ max;

step 3.3: if i < len (text), skipping to step 3.4, otherwise skipping to step 3.12;

step 3.4: if len (S)_i) And (2) len _ max is more than or equal to len _ max, zero filling is carried out on the sequence, otherwise, the sequence is truncated, and the sequence is unified and fixed in length;

step 3.5: obtaining a new sequence T_iLength is defined as len (T)_i)；

Step 3.6: inputting token embedding layer, segment embedding layer and position embedding layer to obtain vector v₁，v₂And v₃Defining a cycle variable Na and assigning an initial value of 1;

step 3.7:if Na < len (T)_i) Skipping to step 3.8, otherwise skipping to step 3.10;

step 3.8: definition vector v (na) ═ v₁+v₂+v₃；

Step 3.9: na +1, skipping to step 3.7;

step 3.10: obtain a vector y_i＝{V₁，V₂，V₃…V_{(len_max)}}；

Step 3.11: i is i +1, and skipping to step 3.3;

step 3.12: outputting the final vector sequence Y ═ Y₁，y₂，y₃…y_len(Text)}。

And 4, step 4: inputting the obtained vector sequence into 6 layers of transformers after pruning, and performing cross-layer tandem connection on small-scale TextCNNs behind the transformers for outputting simple samples in advance, wherein the specific method comprises the following steps:

step 4.1: constructing a Bert pruning model of 6 layers of transformers, and transmitting the Bert pruning model into a vector sequence Y;

step 4.2: defining a cycle variable j, wherein an initial value assigned to the j is 1, and defining a threshold index Speed and an Uncertainty;

step 4.3: if j < len (Y), skipping to step 4.4, otherwise skipping to step 4.10;

step 4.4: will vector y_jInto the Transformer layer, y_jE, defining a cycle variable i, wherein i is less than or equal to 3, and an initial value of i is 1;

step 4.5: if the loop variable i is less than 3, executing the step 4.5.1-4.5.3, otherwise, jumping to the step 4.7;

step 4.5.1: outputting a vector Pt at a 2i layer of a transform, connecting small-scale TextCNN in series at a 2i layer of the transform, and inputting the vector Pt into a TextCNN network;

step 4.5.2: outputting a prediction vector Ps through a convolution layer, a pooling layer and a Softmax layer of the convolutional neural network;

step 4.5.3: calculating uncertainty

If Uncertainty > SpeedTransmitting the next layer of transform, jumping to the step 4.6, otherwise outputting the vector Ps;

step 4.6: i is i +1, and skipping to step 4.5;

step 4.7: outputting a vector Pt by a last layer of Transformer layer, and outputting a vector Ps through a convolutional neural network;

step 4.8: j equals j +1, go to step 4.3;

step 4.9: outputting the whole vector sequence H ═ Ps₁,Ps₂,Ps₃…Ps_len(Y)}。

And 5: fusing a BIOS program data set image vector and a text vector, and outputting a program detection result based on the fused vector, wherein the specific method comprises the following steps:

step 5.1: splicing vector sequences R and H, defining variables i and Ps_iRepresenting the ith vector, r, of the sequence of vectors H_iRepresenting the ith vector of the vector sequence R;

step 5.2: stitching vector Ps_iAnd vector r_i；

Step 5.3: and obtaining a new vector sequence B, and performing class prediction on an output layer to realize the detection of the BIOS malicious program.

The invention is realized by the following technical scheme

The device for detecting the BIOS malicious programs based on the double embedding and the model pruning comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and is characterized in that when the computer program is loaded to the processor, the method for detecting the BIOS malicious programs based on the double embedding and the model pruning described in any one of the steps 1 to 5 is implemented.

By adopting the technical scheme, the invention has the following beneficial effects:

1. the dual embedding and the model pruning adopted by the invention have important functions and significance for the detection and classification of the traditional BIOS malicious programs. When the problem of single feature is faced, the B2M algorithm is used for reading the binary bit stream of the BIOS program, the BIOS program is converted into a non-compressed gray scale image, and the image feature of the BIOS program is extracted. Then BIOS semantics and structure information are extracted through an imbedding layer of Bert, a pruning model of 6 layers of transformers is constructed, a small-scale TextCNN model is connected in series in a cross-layer mode, uncertainty is introduced, simple samples can be output in advance, and efficiency is improved. And finally, splicing the image characteristics and the text characteristics of the BIOS program, and outputting a BIOS program detection result based on the fused characteristics.

2. The fusion of the image information and the program text information of the BIOS program can effectively resist variant viruses with strong functions during model detection;

3. the image SIFT features expressed by the BoVW model are beneficial to large-scale image retrieval and the expandability of the SIFT features is realized, so that the image SIFT features can be conveniently combined with feature vectors in other forms.

4. According to the invention, the construction of a pruning model of 6 layers of transformers and the cross-layer series connection TextCNN model are adopted, so that the efficiency of model detection is better improved;

5. according to the method, the dynamic vector can be generated by adopting the Tranformer, so that the extracted text information can better adapt to the situation;

6. the invention adopts the Bert model to extract the semantic and structural information of the BIOS program, so that the extracted text information is more abundant;

7. the deep network has excellent performance compared with the shallow network, and the deep bidirectional language representation of the Bert model ensures that the Bert model has higher performance.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a flow chart of BIOS program dataset cleaning;

FIG. 3 is a flow chart of extracting BIOS program image features;

FIG. 4 is a flow chart of extracting BIOS program text information by the imbedding layer of the Bert model;

FIG. 5 is a flow chart of pruning model construction;

FIG. 6 is a flow chart of vector stitching.

Detailed Description

The present invention is further illustrated in the following description with reference to the accompanying figures 1-6, which are intended to be illustrative only and not to be limiting of the scope of the invention, and various equivalent modifications of the invention will occur to those skilled in the art upon reading the present invention and fall within the scope of the appended claims.

The following takes a single BIOS image file as an example:

step 1: reading the BIOS image file, and performing binary translation processing on the BIOS image file, as shown in fig. 2:

step 1.1: reading BIOS image file, defining the BIOS image program data as d₁；

Step 1.2: for BIOS mirror image program data d₁Binary translation processing is performed.

Step 2: data d is processed by using B2M algorithm₁Converting into an uncompressed gray image, extracting LBP texture and SIFT features represented by a BoVW bag-of-words model, splicing, and inputting into an SVM classifier, specifically as shown in FIG. 3:

step 2.1: reading data d₁In the binary file, one group of 8 bits is converted into unsigned integer, the value range is (0, 255), and the binary file is converted into a one-dimensional array A1;

step 2.2: defining a two-dimensional array A2, and setting a fixed width, width being 2ⁿTaking the value of the one-dimensional array A1 and taking the fixed width as the element of A2 to obtain a fixed width matrix M1;

step 2.6: obtaining a feature vector f1 of the image G;

step 2.7: calculating an LBP characteristic image of the BIOS program image G, and carrying out blocking processing;

step 2.9: the histograms of the characteristic images of each block are arranged according to the spatial sequence of the blocksObtaining LBP feature vector u₁；

Step 2.10: and (4) splicing the LBP feature vector and the SIFT feature vector, inputting the spliced LBP feature vector and SIFT feature vector into an SVM classifier, and outputting a vector r 1.

And step 3: the BIOS program data d₁Inputting an embedding layer of Bert, and obtaining a vector containing a program structure and semantics by combining token information, segment information and position information, wherein the concrete steps are as shown in FIG. 4:

step 3.1: definition of t₁＝{label，d₁And label of BIOS program data set;

step 3.2: definition len (S)₁) As data d₁Length, uniform fixed text length len _ max;

step 3.3: if len (S)₁) And (2) len _ max is more than or equal to len _ max, zero filling is carried out on the sequence, otherwise, the sequence is truncated, and the sequence is unified and fixed in length;

step 3.4: obtaining a new sequence T₁；

Step 3.5: inputting token embedding layer, segment embedding layer and position embedding layer to obtain vector v₁，v₂And v₃Defining a cycle variable Na and assigning an initial value of 1;

step 3.6: if Na < len (T)₁) Skipping to step 3.7, otherwise skipping to step 3.9;

step 3.7: definition vector v (na) ═ v₁+v₂+v₃；

Step 3.8: na +1, skipping to step 3.6;

step 3.9: output vector y₁＝{V₁，V₂，V₃…V_{(len_max)}}。

And 4, step 4: inputting the obtained vector into a 6-layer Transformer model after pruning, and cross-layer and series-connecting small-scale TextCNN behind the Transformer for outputting a simple sample in advance, as shown in fig. 5 specifically:

step 4.1: constructing a Bert pruning model of 6 layers of transformers, and transmitting the Bert pruning model into a vector sequence y₁；

Step 4.2: defining a threshold index Speed and an Uncertainty;

step 4.3: will vector y₁Transmitting the data into a Transformer layer, and defining a cycle variable i, wherein i is less than or equal to 3, and an initial value of i is 1;

step 4.4: if the loop variable i is less than 3, executing the step 4.4.1-4.4.3, otherwise, jumping to the step 4.6;

step 4.4.1: outputting a vector Pt at a 2i layer of a transform, connecting small-scale TextCNN in series at a 2i layer of the transform, and inputting the vector Pt into a TextCNN network;

step 4.4.2: outputting the prediction vector Ps through a convolution layer, a pooling layer and a Softmax layer of the convolutional neural network₁；

Step 4.4.3: calculating uncertainty

If Uncertainty>Speed, transferring next layer of Transformer, jumping to step 4.5, otherwise outputting vector Ps₁；

Step 4.5: i is i +1, and skipping to step 4.5;

step 4.6: the last layer of Transformer layer outputs a vector Pt, and the vector Ps is output through a convolutional neural network₁。

And 5: fusing the image vector and the text vector of the BIOS program data set, and outputting a program detection result based on the fused vector, as shown in FIG. 6 specifically:

step 5.1: stitching vector Ps₁And vector r₁；

Step 5.3: obtain a new vector b₁And performing class prediction on an output layer, and performing malicious virus detection on the BIOS program.

The device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and when the computer program is loaded to the processor, the BIOS malicious program detection method based on the double embedding and the model pruning is realized.

Claims

1. The BIOS malicious program detection method based on double embedding and model pruning is characterized by comprising the following steps of:

step 1: reading a BIOS mirror image file, constructing a BIOS program original data set D1, and performing binary translation processing on the data after the data are cleaned to obtain a data set D2;

step 2: converting a data set D2 into an uncompressed gray image by using a B2M algorithm, extracting LBP textures and SIFT features represented by a BoVW bag-of-words model, splicing, and inputting the spliced SIFT features into an SVM classifier;

and step 3: inputting a data set D1' of a cleaned BIOS program into an embedding layer of a Bert model, and obtaining a vector containing a program structure and semantics by combining token information, segment information and position information;

and 4, step 4: inputting the obtained vector sequence into 6 layers of transmormers after pruning, and performing cross-layer series connection on small-scale TextCNNs behind the transmormers for outputting simple samples in advance;

and 5: and fusing the image vector and the text vector of the BIOS program data set, and outputting a program detection result based on the fused vector.

2. The method for detecting the BIOS malware based on double embedding and model pruning as claimed in claim 1, wherein the specific method of step 1 is:

step 1.1: reading the BIOS image file, obtaining the data to be cleaned, defining the BIOS image program data set as D1, D1 ═ D₁，d₂，d₂…d_n}，d_nThe nth data to be cleaned;

step 1.2: performing data cleaning on the data set D1 to obtain a data set D1';

step 1.3: binary conversion of the cleaned data set D1The translation processing results in a data set D2, D2 ═ Doc₁，Doc₂，Doc₃…Doc_NTherein Doc of_NIs the Nth data to be processed.

3. The dual-embedding and model-pruning-based BIOS malware detection method of claim 1, wherein the specific method of step 2 is:

4. The dual-embedding and model-pruning-based BIOS malware detection method of claim 1, wherein the specific method of step 3 is:

step 3.5: obtaining a new sequence T_iLength is defined as len (T)_i)；

step 3.7: if Na < len (T)_i) Skipping to step 3.8, otherwise skipping to step 3.10;

step 3.8: definition vector v (na) ═ v₁+v₂+v₃；

Step 3.9: na +1, skipping to step 3.7;

step 3.10: obtain a vector y_i＝{V₁，V₂，V₃…V_{(len_max)}}；

Step 3.11: i is i +1, and skipping to step 3.3;

5. The dual-embedding and model-pruning-based BIOS malware detection method of claim 1, wherein the specific method of step 4 is:

step 4.5.3: calculating uncertainty

If Uncertainty is larger than Speed, transferring the next layer of transform, and skipping to the step 4.6, otherwise, outputting the vector Ps;

step 4.6: i is i +1, and skipping to step 4.5;

step 4.8: j equals j +1, go to step 4.3;

step 4.9: outputting the whole vector sequence H ═ Ps₁，Ps₂，Ps₃…Ps_len(Y)}。

6. The dual-embedding and model-pruning-based BIOS malware detection method of claim 1, wherein the specific method of step 5 is:

step 5.2: stitching vector Ps_iAnd vector r_i；

7. Dual embedding and model pruning based BIOS malware detection apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program when loaded into the processor implements the dual embedding and model pruning based BIOS malware detection method according to any one of claims 1-6.