CN114077479A

CN114077479A - Method for detecting malicious codes of client virtual machine in cloud platform

Info

Publication number: CN114077479A
Application number: CN202111300132.2A
Authority: CN
Inventors: 陈博翰; 丁紫薇; 马桂才; 杨诏钧; 魏立峰; 韩光; 姬一文
Original assignee: Kirin Software Co Ltd
Current assignee: Kirin Software Co Ltd
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-02-22

Abstract

The application discloses a method for detecting malicious codes of a client virtual machine in a cloud platform, which comprises the following steps: step S1, obtaining a memory dump file; step S2, extracting information by the virtual machine; step S3, training a model; and step S4, detecting the malicious codes. The method can avoid malicious codes from attacking the agent in the client, so that the agent is invalid and even bypasses detection software, the detection efficiency and the detection precision are improved, and the agent does not need to be adapted again aiming at different types of operating systems.

Description

Method for detecting malicious codes of client virtual machine in cloud platform

Technical Field

The application relates to the field of cloud security, in particular to a method for detecting malicious codes of a client virtual machine in a cloud platform.

Background

Cloud computing is one of distributed computing, and means that a huge data computing processing program is decomposed into countless small programs through a network cloud, and then the small programs are processed and analyzed through a system consisting of a plurality of servers to obtain results and are returned to a user.

In recent years, with the rapid development of cloud computing, the cloud security problem has become more severe. In the traditional malicious code detection in the cloud platform, the state information of the client during operation is mostly obtained by arranging an agent in the client, so that the attack of the malicious code on the agent in the client cannot be avoided, and the malicious code is invalid and even bypasses detection software. Meanwhile, the traditional method generally performs malicious code detection through expert analysis or feature-based machine learning classification, which consumes a lot of time on analysis or data preprocessing. In addition, the existing detection method needs to be adapted and changed for different types of operating systems, and cannot well detect operating systems without visual interfaces.

Disclosure of Invention

The invention mainly aims to provide a method for detecting malicious codes of a client virtual machine in a cloud platform, which can prevent the malicious codes from attacking an agent in the client, so that the agent is invalid and even bypasses detection software, the detection efficiency and the detection precision are improved, and the method does not need to be adapted again for different types of operating systems.

In order to achieve the above object, the present invention provides a method for detecting malicious codes of a client virtual machine in a cloud platform, which comprises the following steps:

step S1, obtaining a memory dump file, creating and starting a client virtual machine in the cloud platform, and obtaining the memory dump file of the client virtual machine by using the memory dump function of the virtualization platform;

step S2, the virtual machine introspection extracts information, analyzes the memory transfer file, and obtains various running state characteristics of the client virtual machine through the virtual machine introspection technology;

s3, model training, namely training a plurality of running state characteristics in sequence by using a BERT model to obtain the trained BERT model and the model classification accuracy corresponding to the various running state characteristics;

and step S4, malicious code detection, namely, sequentially inputting the detected running state characteristics into the trained BERT model for detection to obtain detection results of various running state characteristics, distributing the weight of the detection result of each running state characteristic according to the model classification accuracy corresponding to the various running state characteristics, multiplying the detection result of each running state characteristic by the corresponding weight respectively, and adding the multiplication results to obtain the final detection result.

Optionally, step S1 includes:

step S101, a client virtual machine is created and started in a cloud platform;

step S102, saving a snapshot of a client virtual machine as a recovery point, running normal software and malicious software in the client virtual machine, and simulating a scene of normal use of a user and invasion of the malicious software;

s103, acquiring a memory dump file of the client virtual machine by using a memory dump function of the virtualization platform;

and S104, restoring the client virtual machine to a restoring point, running the rest software and repeating the step S103.

Optionally, step S2 includes:

step S201, constructing a symbol table of a client operating system;

step S202, analyzing the internal storage transfer file, acquiring various running state characteristics of the client virtual machine during running, and storing the acquired data in different documents according to categories.

Optionally, the guest virtual machine includes a linux system virtual machine and a windows system virtual machine, the type of the guest virtual machine is determined, and if the guest virtual machine is the windows system virtual machine, the symbol table is obtained by using the vollatinity in step S201; if the client virtual machine is a linux system virtual machine, in step S201, a symbol table is obtained by using dwarf2 json.

Optionally, the linux system virtual machine adopts an ubuntu16.04 operating system, and the windows system virtual machine adopts a windows 7 operating system.

Optionally, step S3 includes:

s301, sorting and marking the running state features acquired in the step S2 to be used as an input data set of the BERT model, and dividing the input data set into a training set and a verification set;

s302, adjusting a hyper-parameter structure of the BERT model;

and S303, inputting the input data set into the BERT model for pre-training, completing two pre-training tasks of mask LM and NSP, and then using the same input data set again to train the pre-trained model again, so as to finally obtain the trained BERT classification model and the model classification accuracy corresponding to various operation state characteristics.

Optionally, step S302 adjusts the hyper-parameter structure of the BERT model to:

the number of hidden layers L is 2, the hidden layer size H is 512, and the number of attention headers a is 8.

Optionally, the step S4 includes:

screening n running state features with the highest model classification accuracy;

calculating the weight of the detection result of each running state characteristic according to the model classification accuracy of the screened running state characteristics, wherein the weight calculation formula is as follows:

wherein, w_iAs a weight of the ith operating state feature, acc_iFor the model classification accuracy of the ith operating condition feature,

the sum of the model classification accuracies of n operating state features, n being an integer greater than zero；

When malicious codes of a client virtual machine to be detected are detected, firstly, n selected running state features of the client virtual machine to be detected are obtained, then, the screened running state features are classified and detected by using a trained BERT model, detection results of various running state features are obtained, and then, a final detection result calculation formula is as follows:

wherein R is₀Probability of malicious code being present in the guest virtual machine, w_iIs the weight of the ith feature, R_iThe classification accuracy r of the ith feature of the guest virtual machine to be tested_i Setting 1 represents that the detection result of the ith feature is malicious.

Optionally, the screened operating state features include filescan, netscan, malfind, privs, modules, psxview, pslist, svcscan, thrdscan, and mutantscan.

According to the technical scheme, the embodiment of the application has the following advantages:

1. according to the method, malicious software detection can be performed on the client virtual machines using different operating systems without changing or adjusting, and additional setting of the client operating systems is not required;

2. the method extracts various running state characteristics of the client virtual machine by acquiring the memory dump file of the virtual client and using the virtual machine introspection technology, so that the information of the client is acquired from the hypervisor layer, and the damage or bypass of malicious software to a detection system is effectively avoided;

3. the operation information is classified by using the BERT model, so that the complicated steps (such as sliding windows and the like) of feature extraction in the prior art are avoided, additional analysis or feature extraction is not needed, and the time overhead is reduced. Meanwhile, the output of the BERT comprises a word vector, a text vector and a position vector, the context relation of the running state information is strengthened, the detection accuracy is improved, the method achieves 99.9% of classification accuracy, and the safety of the cloud platform is effectively guaranteed.

Drawings

In order to express the technical scheme of the embodiment of the invention more clearly, the drawings used for describing the embodiment will be briefly introduced below, and obviously, the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a design architecture diagram of a method for malicious code detection of a guest virtual machine in a cloud platform according to the present invention;

FIG. 2 is a schematic diagram of the operation of the BERT model in an embodiment of the present invention;

fig. 3 is a flowchart of a method for detecting malicious code of a guest virtual machine in a cloud platform according to the present invention.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the invention provides a method for detecting malicious codes of a client virtual machine in a cloud platform. The data acquisition part mainly has the function of acquiring operation data of an instance on the cloud platform and providing data for subsequent model classification. The actual environment of this part is the OpenStack train version environment installed under ubuntu 18.04. A plurality of instances, namely client virtual machines, are created on the cloud platform of OpenStack, and dump file contents in operation are obtained. The main function of the virtual machine introspection part is to acquire specific information of the client during operation by using the virtual machine introspection technology from the dump file content extracted on the cloud platform. The part can extract client running information of different systems, is not limited to the same system, and enables the classification model to still show good detection rate when facing different system data.

The main function of the model training part is to obtain the possibility that the client is attacked by malicious codes by classification after the BERT model is trained. The training of the BERT model comprises a pre-training part and a training part, and the vector representation of each character/word output by the model can completely and accurately depict the whole information of the input text as much as possible by pre-training the model, so that a better initial value of the model parameter is provided for the subsequent training. After pre-training, the data set is used again for training, and a BERT classification model for malicious code detection is obtained. During detection, the classification accuracy of each feature is obtained by using the running state features of the plurality of client virtual machines and the trained BERT classification model, and the possibility of malicious codes existing in the client virtual machines is calculated by combining a weight formula.

Referring to fig. 1, fig. 2 and fig. 3, a method for detecting malicious codes of a client virtual machine in a cloud platform according to an embodiment of the present invention includes the following specific steps.

Step one, obtaining a memory dump file

(1) Creating a plurality of client virtual machines in the cloud platform, installing linux and windows systems with different versions respectively, and starting the client virtual machines. Specifically, a plurality of guest virtual machines are created using OpenStack, ubuntu16.04 (a linux system) and windows 7 operating systems are installed, respectively, and the guest virtual machines are started. By setting two types of virtual machines, the client running information of different systems can be extracted, so that the running state characteristics of various operating systems can be detected.

(2) And storing the snapshot of the client virtual machine as a recovery point, running normal software and malicious software in the client virtual machine, and simulating the scenes of normal use and invasion of the malicious software by a user. The dump file function of OpenStack is utilized to establish dump files of the client virtual machines, normal software and malicious software are operated in the client virtual machines, and scenes in which users normally use and are invaded by the malicious software are simulated.

(3) And then, acquiring a memory dump file of the client virtual machine by using a memory dump function of the virtualization platform, specifically, by using a virsh dump instruction of libvirt.

(4) And (4) restoring the client virtual machine to a recovery point, running the rest software and repeating the step (3).

Step two, the virtual machine extracts information from provinces

After obtaining the memory dump file of the client virtual machine, the state information of the client virtual machine during operation is obtained by utilizing the virtual machine introspection technology. The virtual machine introspection technology may obtain internal state information of the virtual machine during operation from outside the guest virtual machine, and the following is a step of performing virtual machine introspection using vollatity.

(1) And constructing a symbol table of a client operating system, wherein the windows operating system can use the symbol table provided by Volatinity, the Linux operating system is obtained by using DWARF2JSON, and the DWARF2JSON has the function of processing DWARF and symbol table information from the ELF file and symbols of a System map input file to generate a JSON file for the Volatinity to analyze Linux.

(2) And analyzing by using the memory dump file acquired in the last step, acquiring the running state characteristics such as an API (application programming interface) calling sequence, memory operation, network activity, registry and file operation when the client virtual machine runs by using plugins such as callbacks, dlllist, filescan and handles of the vollatity, and storing the acquired data in different documents according to the types.

The malicious code detection method of the embodiment is different from the conventional malicious code detection method, the method selects offline to perform classification detection, and in the steps, the running state feature acquisition is to acquire the client from the hypervisor layer, so that the damage or bypass of malicious software to the detection system is avoided. Moreover, the virtual machine introspection technology is adaptive to a plurality of systems, and required running state information can be extracted from different systems for subsequent classification prediction work without re-adaptation.

Step three, model training

The method selects the BERT model, the training process of the model is divided into a pre-training stage and a training stage, the pre-training stage can help the BERT model to understand word meaning and inter-sentence relation, and the training stage can enable the BERT model to learn word dependence relation in sentences and capture internal structures of the sentences. BERT model detection principle referring to fig. 2, context information is learned using a transformer encoder to enhance the semantic representation of the target word. Wherein the flow at the left side of fig. 2 is the complete process of one transform encoder in the right side diagram. After data is input into a BERT model, semantic vector representation of each word in a data text is respectively enhanced through a multi-head self-attention module, the input data and the output data passing through the multi-head self-attention module are added through residual connection to serve as new input and are subjected to standardization processing, and then linear conversion is performed on each word twice to enhance the expression capacity of the whole model. The data set of the method is the running state characteristics on the cloud platform extracted through the virtual machine introspection technology, and the context relationship is close. The context relation of the text is considered during the classification of the BERT model, which is beneficial to improving the detection accuracy, so the method selects the BERT model as the classification model.

(1) Collecting a plurality of running state characteristics obtained by using the virtual machine introspection technology, and labeling the running state characteristics as input data. After the running state features acquired by the virtual machine introspection technology are acquired, the running state features need to be sorted and marked to serve as input data of a training model, a text _ dataset _ from _ direction function is used for reading a data set and is divided into a training set and a verification set, and the proportion is 8: 2.

(2) the BERT model is pre-trained. The pre-training task can adjust parameters of the BERT model, so that the output of the model can accurately express semantics. The BERT model used in the method comprises the following structure of adjustable hyper-parameters: the number of concealment layers L (L ═ 2,4,8, 12), the size of each concealment layer H (H ═ 128,256,512,768), and the number of anchorage headers a (a ═ 2,4,8, 12). And selecting the model structure with the best classification effect by adjusting the hyper-parameters, and obtaining the best classification result when the parameters in the experiment are L-2, H-512 and A-8.

(3) And pre-training the BERT model by using the data set collected in the second step to complete two pre-training tasks of mask LM and Next Sequence Prediction (NSP), so that the BERT model can better understand the relationship between words and sentences. And (5) retraining the model after pre-training by using the same data set again, and finally obtaining the trained BERT classification model.

Step four, malicious code detection

The malicious code detection is divided into two steps, firstly, model classification accuracy of a plurality of characteristics is utilized to calculate, and a weight formula corresponding to each characteristic is calculated. And then, carrying out malicious code detection by using a weight formula and the trained BERT model to obtain a final detection result.

(1) The first n features (n is a positive integer) with the highest classification accuracy are selected from all the features used in the model training. In this embodiment, the first ten features with the highest classification accuracy are screened out as the features for finally performing malicious code detection, and all the model classification accuracy tables using the features are shown in table 1.

TABLE 1

Since the weight considering the high classification accuracy has a greater influence on the final detection result, a weight is set for each feature, and the weight calculation formula is as follows:

wherein, w_iIs the weight of the ith feature, acc_iThe model classification accuracy for the ith feature.

Is the sum of the model classification accuracies of the n features. As can be seen from the formula, the classification is accurateThe higher the degree, the greater the weight of the feature, the greater the impact on the final detection result.

(2) When malicious code detection is carried out on a client virtual machine to be detected, the first ten characteristics (respectively: filescan, netscan, malfind, priv, modules, psxview, pslist, svcscan, thrdscan and mutantscan) with higher classification accuracy of the client virtual machine to be detected (the system is Ubuntu16.04) are obtained, and classification accuracy of each characteristic, namely classification accuracy R of malicious code detection by using the ith characteristic is obtained by respectively carrying out classification detection on the characteristics by using a BERT model trained in the step three_i. And multiplying the classification accuracy of each feature by the corresponding weight and the detection result, and adding the multiplication results to obtain the final detection result, wherein the formula is as follows.

In this embodiment, there are two acc locations related to the classification accuracy_iAnd R_i。acc_iTraining a BERT model by taking a data set as input to obtain output which is acc_iI.e. the classification accuracy of the model. R_iThe method comprises the steps of using the ith characteristic data of a virtual machine to be tested as input, using a trained BERT model for classification, and obtaining output which is R_iI.e., the classification accuracy of malicious code detection using the ith feature. acc (acrylic acid)_iAnd R_iAs output of the training model and output of the classification using the model, acc, respectively_iIncludes operating state characteristic data of two types of clients, and R_iThen the ith running state feature of the virtual machine to be tested is taken as input, acc_iHas two systems in common, and R_iIt is for the system under test, bothNo additional calculation steps are required for the acquisition.

In this embodiment, first, the memory dump function of the virtualization platform is used to obtain its memory dump files from outside the client virtual machine, and the introspection software of the client virtual machine is used to analyze these memory dump files to obtain the running state information of the client virtual machine. By the method, damage and bypass of malicious codes to the detection system are effectively prevented, adjustment and change are not needed for different types of client operating systems, adaptation and change are needed for different types of operating systems in the existing method, and the operating systems without visual interfaces cannot be well detected. According to the method, the obtained running state information is classified by using the BERT framework, additional analysis or feature extraction is not needed, and the time overhead is reduced. And the method achieves 99.9% of classification accuracy and effectively ensures the safety of the cloud platform.

The terms referred to in the present embodiment are explained as follows.

Hypervisors, also known as Hypervisor, are software, firmware, or hardware used to build and execute virtual machines.

BERT: is called as the Bidirective IEncoder responses from Transformer. The goal of the BERT model is to obtain the Representation of the text containing rich semantic information by using large-scale unmarked corpus training, namely: and performing semantic representation on the text, then performing fine adjustment on the semantic representation of the text in a specific NLP task, and finally applying the semantic representation of the text to the NLP task.

Volatinity: and storing a evidence obtaining tool.

dwarf2 json: and acquiring the symbol table and saving the symbol table as a tool of a json file.

API: an application programming interface.

Mask LM: the Language Model is named Mask Language Model and is based on a Mask mechanism pre-training Language Model.

Next Sequence Prediction (NSP): and learning the training task of the relation between sentences.

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims

1. A method for detecting malicious codes of a client virtual machine in a cloud platform is characterized by comprising the following steps:

2. The method for detecting malicious code of a client virtual machine in a cloud platform according to claim 1, wherein step S1 includes:

step S101, a client virtual machine is created and started in a cloud platform;

3. The method for detecting malicious code of a client virtual machine in a cloud platform according to claim 1, wherein step S2 includes:

step S201, constructing a symbol table of a client operating system;

4. The method according to claim 3, wherein the guest virtual machines include a linux system virtual machine and a windows system virtual machine, the type of the guest virtual machine is determined, and if the guest virtual machine is a windows system virtual machine, the symbol table is obtained by using a vollatity in step S201; if the client virtual machine is a linux system virtual machine, in step S201, a symbol table is obtained by using dwarf2 json.

5. The method for detecting malicious code of a client virtual machine in a cloud platform according to claim 4, wherein the linux system virtual machine adopts an ubuntu16.04 operating system, and the windows system virtual machine adopts a windows 7 operating system.

6. The method for detecting malicious code of a client virtual machine in a cloud platform according to claim 1, wherein step S3 includes:

s302, adjusting a hyper-parameter structure of the BERT model;

7. The method for detecting malicious codes of a client virtual machine in a cloud platform according to claim 6, wherein the step S302 adjusts the hyper-parameter structure of the BERT model to:

8. The method for detecting malicious code of a client virtual machine in a cloud platform according to claim 1, wherein the step S4 includes:

the sum of the model classification accuracy of n running state features, wherein n is an integer greater than zero;

wherein R is₀Probability of malicious code being present in the guest virtual machine, w_iIs the weight of the ith feature, R_iThe classification accuracy r of the ith feature of the guest virtual machine to be tested_iSetting 1 represents that the detection result of the ith feature is malicious.

9. The method for detecting malicious codes of a client virtual machine in a cloud platform according to claim 8, wherein the screened out operation status features comprise a filescan, a netscan, a malfine, a priv, modules, a psxview, a pslist, a svcscan, a thrdscan and a mutantscan.