CN116861420B

CN116861420B - Malicious software detection system and method based on memory characteristics

Info

Publication number: CN116861420B
Application number: CN202310607910.5A
Authority: CN
Inventors: 邹凯; 陈凯枫; 顾颂斐; 韩日富
Original assignee: Guangzhou Trustmo Information System Co ltd
Current assignee: Guangzhou Trustmo Information System Co ltd
Priority date: 2023-05-26
Filing date: 2023-05-26
Publication date: 2024-05-28
Anticipated expiration: 2043-05-26
Also published as: CN116861420A

Abstract

The invention discloses a malicious software detection system and method based on memory characteristics, and relates to the technical field of computer host security. The system comprises a memory acquisition module, a feature processing module, a sample generation module, an off-line training module and a real-time prediction alarm module, wherein the memory acquisition module is used for acquiring memory data when a host runs, the feature processing module is used for carrying out feature engineering processing on the memory data and obtaining corresponding memory features, the sample generation module is used for running multiple malicious software in the host and acquiring memory and extracting memory features, the off-line training module is used for dividing a sample data set and training a two-layer stacked two-class machine learning model by using an evaluation index, and the real-time prediction alarm module is used for dividing different alarm grades. The method establishes a supervised two-layer stacking two-class machine learning model to realize the detection of the operation of the malicious software from the perspective of the memory characteristics; the stacked structure of the two layers of models improves the evaluation index of the models and pays more attention to recall effects of malicious software; the hierarchical alarm is convenient for technical staff to monitor and check problems.

Description

Malicious software detection system and method based on memory characteristics

Technical Field

The invention belongs to the technical field of computer host security, and particularly relates to a system and a method for detecting malicious software based on memory characteristics.

Background

With the rapid development of computer-related industries, various kinds of malware for the purpose of gaining illegal interests are also being iteratively updated. Although different types of protection measures include firewall, security detection, antivirus software, etc. are used for protecting the security of the network host, malicious software still can realize hiding, bypassing and finally launching the attack on the host through different technical means.

While malware can avoid the detection of other detection software, if it were to run on a host, it would necessarily leave traces of the running on the host's memory, so research on memory characteristics can help discover, detect the running of malware.

However, the malicious software is various, the workload is too great in a mode of manually summarizing rules, and with the development of artificial intelligence technology, the invention acquires the behavior data characteristics remained on the memory when the malicious software operates, supervised machine learning is performed by using a machine learning model, and after the model is trained, the system memory characteristics are inferred, so that whether the malicious software operates in a host system is detected. The system of the invention can find out the operation of the malicious software from the angle of the memory characteristics, and further ensure the safety of the host.

Chinese patent CN101989322B discloses a method and system for automatically extracting the memory characteristics of malicious code. Comprising the following steps: running malicious codes to perform memory dump on newly generated thread information to generate a dump file; carrying out association analysis and grouping on dump files; extracting characteristics from the grouped dump files and performing test processing; the system comprises: the memory dump module is used for running malicious codes to perform memory dump on the newly generated thread information and generating a dump file; the association analysis module is used for carrying out association analysis and grouping on the dump file; and the feature extraction and test module is used for extracting features from the grouped dump files and performing test processing. The whole scheme of the invention is an automatic process, does not need manual participation, takes threads as basic processing objects, realizes more accurate and comprehensive extraction of the memory characteristics with fine granularity, does not depend on experience of analysts any more, and finally obtains the memory characteristics with lower false alarm rate and extremely low false alarm rate.

According to the method, after malicious codes are run, newly generated threads are transferred, similarity analysis grouping is carried out on the threads, and then the memory characteristics of the threads are extracted. Although the process of the invention is automatic, when the memory features with more than a certain similarity are extracted, certain errors are generated in the judgment of the degree of similarity, and the thread is used as the memory to extract the unit granularity, so that the number of features is large, a large amount of manual rule intervention is needed in the process of judging the similarity of the features, and the workload is large and the management is difficult.

Chinese patent CN114692156B discloses a method and system for detecting malicious code on a memory segment. Comprising the following steps: acquiring a memory file to be detected; sequentially performing binary conversion and word segmentation pretreatment on the memory file to be detected, and then performing segment interception based on the optimal segment position and length combination to obtain a predicted segment; inputting the predicted segment into an optimal neural network model, and detecting the predicted segment to obtain a result of whether the memory file to be detected is implanted with malicious codes or not; after the dimension of the input prediction segment is increased by the neural network model through the embedding layer, the input prediction segment is pooled after being convolved through convolution layers with different convolution kernel sizes, and finally is input into the classifier after being converted through the flattening layer and the full-connection layer. By learning the potential rules and features of malicious code, undiscovered viruses are detected and existing viruses are detected.

The method carries out binary conversion and word segmentation processing on the obtained memory code segments, and models and predicts the memory content as natural language data. However, the method has the defects that an optimal memory segment needs to be searched, modeling is performed after the memory segment is converted into natural language sequence data, the data size is large, deep learning model modeling is used, the model is complex, the training time is long, and the model also lacks feature interpretation.

Disclosure of Invention

The invention aims to provide a malicious software detection system and method based on memory characteristics, which are characterized in that memory data of a computer host are collected during the operation and normal operation of malicious software, the memory characteristics are respectively extracted, a supervised two-layer stacked two-class machine learning model is built, and the memory data segments in daily monitoring are subjected to two-class prediction and risk warning of whether the malicious software operation exists, so that the malicious software operation is detected from the memory characteristics.

The aim of the invention can be achieved by the following technical scheme:

In a first aspect, an embodiment of the present application provides a malware detection system based on memory features, including a memory acquisition module, a feature processing module, a sample generation module, an offline training module, and a real-time prediction alarm module that are sequentially connected in a communication manner;

The memory acquisition module is used for acquiring memory data when the host operates;

The feature processing module is used for performing feature engineering processing on the acquired memory data to obtain corresponding memory features;

The sample generation module is used for acquiring various malicious software, running the malicious software and normal software in the host, acquiring first memory data through the memory acquisition module and acquiring first memory characteristics through the characteristic processing module; operating the normal software in the host, acquiring second memory data through the memory acquisition module and acquiring second memory characteristics through the characteristic processing module;

the off-line training module is used for dividing a sample data set, taking an F1 score as an evaluation index, training and generating a two-class machine learning model;

The real-time prediction alarm module is used for applying the trained two-class machine learning model to judge whether the host in monitoring is running with the malicious software or not, and dividing different alarm grades according to the probability value judged by the two-class machine learning model;

the system adopts a supervised machine learning algorithm to perform two classification learning tasks, wherein the malicious software and the normal software which are operated in the host are used as first class labels, and the normal software which is operated in the host is used as second class labels;

Wherein the feature engineering process comprises: extracting basic features and carrying out statistical treatment on part of the basic features;

and taking the first memory feature and the second memory feature as memory sample data of the offline training module.

Preferably, the two-layer stacked two-class machine learning model adopts a model stacking technology, and a two-layer model stacking scheme is used for generating the two-layer stacked two-class machine learning model; wherein the single model comprises: logistic regression, naive bayes, support vector machines, random forests, GBDT, XGBoost, and Lightbm.

Preferably, in the two-layer stacked two-classification machine learning model, a first layer model selects a plurality of the single models to be combined, and a second layer model selects one of the single models; and taking the predicted probability value output by the first layer model as the input characteristic of the second layer model.

Preferably, the first layer single model combination of the two-layer stacked two-classification machine learning model adopts: XGBoost, support vector machines, random forests, lightgbm, and naive bayes; and a second layer single model of the two-layer stacked two-classification machine learning model adopts logistic regression.

Preferably, the memory acquisition module acquires the memory data by adopting LiME software; the feature processing module performs the feature engineering processing on the memory data based on Volatility software; the statistical processing includes: statistical summation, statistical quotient and statistical averaging.

Preferably, the memory feature includes: malware discovers class features, module injection class features, handle class features, process class features and interface hook class features; the malware includes: trojan software, virus software and lux software.

Preferably, the offline training module comprises a data set dividing unit, a model evaluation unit, a model selection unit and a model storage unit which are sequentially connected in a communication manner;

The data set dividing unit is used for dividing the memory sample data into a training data set and a test data set; wherein the training data set: test dataset = 4:1;

the model evaluation unit is used for adopting the F1 score as an evaluation index of the classification task;

The model selection unit is used for carrying out the two-class machine learning task by adopting a plurality of different models, observing the difference of the F1 scores among the different models, and selecting a model according to test data;

The model storage unit is used for storing the trained two-classification machine learning model as a binary file;

The memory sample data operated by the malicious software is used as a positive class and marked as 0; taking the memory sample data of the normal software operation as a negative class, and marking the negative class as 1;

wherein the F1 score is formulated as: f1-score=2 x precision rate x recall/(precision rate+recall);

wherein, accuracy = number of samples predicted to be positive and predicted to be correct/number of samples predicted to be positive; recall = predict as positive class and predict the correct number of samples/number of samples for all positive classes.

Preferably, a harmonic coefficient b is added in the formula of the F1 score, recall rate weight and precision rate weight are adjusted, and the formula of the F1 score is adjusted as follows: f1-score _b = (1+b x b) precision rate recall/(b x b) precision rate + recall rate).

Preferably, the real-time prediction alarm module comprises a real-time memory acquisition unit, a real-time feature processing unit, a model real-time analysis unit and a grading alarm unit which are sequentially in communication connection;

The real-time memory acquisition unit is used for acquiring and storing real-time memory data in real time through the memory acquisition module; wherein, the acquisition interval for acquiring the real-time memory data is one minute;

the real-time feature processing unit is used for extracting the features of the acquired real-time memory data through the feature processing module and carrying out statistical processing to acquire corresponding real-time memory features;

The model real-time analysis unit is used for calling the two-layer stacking two-class machine learning model, inputting the real-time memory characteristics as a model sample and outputting a probability value; wherein the probability value is between 0 and 1;

the grading alarm unit is used for distinguishing different risk levels according to the size of the probability value and giving an alarm;

If the probability value is smaller than 0.3, the risk is the highest risk; if the probability value is greater than 0.3 and less than 0.6, the risk is a medium-level risk; if the probability value is greater than 0.6, then it is low-level/risk-free.

In a second aspect, an embodiment of the present application provides a method for detecting malware based on memory features, including the following steps:

s1, collecting memory data of a host in operation;

S2, performing feature engineering processing on the acquired memory data to obtain corresponding memory features;

S3, acquiring multiple types of malicious software, running the malicious software and normal software in the host, acquiring first memory data through the step S1 and extracting first memory features through the step S2; operating the normal software in the host, collecting second memory data through the step S1 and extracting second memory characteristics through the step S2;

Taking the first memory characteristic and the second memory characteristic as sample data of a model;

s4, dividing the sample data into sample data sets to obtain a training data set and a test data set;

Wherein the training data set: test dataset = 4:1;

S5, training and testing to generate a two-class machine learning model by taking the blended F1 score as an evaluation index;

wherein the reconciled F1 score is: adding a tuning sum coefficient to the F1 score;

the two-layer stacking two-class machine learning model is generated by adopting a model stacking technology;

S6, acquiring real-time memory data through the step 1 and acquiring real-time memory characteristics through the step S2; inputting the real-time memory characteristics as real-time sample data into the two-layer stacking two-class machine learning model;

The two-layer stacking two-classification machine learning model is applied to judging whether the host runs the malicious software or not;

s7, dividing different risk levels and alarming according to the probability value output after the two-layer stacking two-classification machine learning model is judged;

The beneficial effects of the invention are as follows:

(1) The method mainly uses different memory characteristics as characteristic sample data of the model when the Linux system runs if the malicious software exists, and the software can be loaded on the system memory certainly when running, so that the running condition of the malicious software can be effectively identified from the perspective of the memory characteristic data.

(2) The invention uses memory feature data and derived statistical feature data generated based on Volatility software as input data for a two-class machine learning model. The feature data are collected by LiME memory collection software, and memory features with discrimination are generated by a statistical processing mode and are used for a subsequent two-class machine learning model, wherein the statistical processing comprises the following steps: statistical summation, statistical quotient and statistical averaging, etc.

(3) According to the invention, after a plurality of malicious software samples are collected and run on an experimental system, corresponding memory data are collected, corresponding memory features are extracted, and the memory features are used as learning samples of a two-class machine learning model to perform offline training and testing. The learning samples are marked in the offline training module (the memory samples of the malicious software are used as positive classes and marked as 0, and the memory samples without the malicious software are used as negative classes and marked as 1), so that the performance index of the classification model can be greatly improved by the explicitly marked samples.

(4) The two-layer stacked two-class machine learning model uses a two-layer model stacking scheme, so that the two-layer stacked two-class machine learning model is formed. The first layer model uses XGBoost, a support vector machine, a random forest, lightgbm and naive Bayes, and five independent model combinations are used; the second layer model uses logistic regression; the output result of the first layer model for each piece of sample data is represented by a probability value between 0 and 1, then the probability value output by the samples is used as the input characteristic of the second layer model, and finally the size of the probability value of two classifications is output as the output of the model. The invention adopts the model structure to greatly improve the evaluation index of the model.

(5) Because the malicious software is harmful, the two-class machine learning model should preferentially guarantee the high recall rate, and find all the malicious software memory samples as much as possible, so that the recall rate and the weight of the precision rate are adjusted by adding the harmonic coefficient b into the formula of the F1 score, and the weight of the recall rate is increased, so that a better recall effect is obtained. The invention evaluates the model by using the F1 score of parameter harmony, so that the recall effect of the malicious software is better, and memory samples of all the malicious software can be found out as much as possible.

(6) The real-time prediction alarm module uses a two-layer stacked two-class machine learning model which is trained offline to perform feature processing on memory data acquired in real time, and takes the obtained real-time memory features as sample data of the model to perform subsequent sample prediction; and different alarm risk levels are set according to the probability value dividing range of the prediction result, so that the technical staff can monitor and check the problems conveniently.

Drawings

For a better understanding and implementation, the technical solution of the present application is described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of a malware detection system based on memory features according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an offline training module according to an embodiment of the present application;

FIG. 3 is a diagram of a two-layer stacked model according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a real-time prediction alarm module according to an embodiment of the present application;

fig. 5 is a flowchart of steps of a method for detecting malware based on memory features according to an embodiment of the present application.

Detailed Description

For further explanation of the technical means and effects adopted by the present application for achieving the intended purpose, exemplary embodiments will be described in detail herein, examples of which are shown in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of methods and systems that are consistent with aspects of the application as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to any or all possible combinations including one or more of the associated listed items.

The following detailed description of specific embodiments, features and effects according to the present invention is provided with reference to the accompanying drawings and preferred embodiments.

Example 1

As shown in fig. 1, the invention provides a malicious software detection system based on memory characteristics, which comprises a memory acquisition module, a characteristic processing module, a sample generation module, an offline training module and a real-time prediction alarm module which are sequentially connected in a communication way;

Specifically, the invention uses LiME software to collect the memory of the host computer during operation, and performs characteristic engineering processing on the collected memory data to obtain corresponding memory characteristics, thereby obtaining various malicious software in the network public resources, operating the malicious software and the normal software in the host computer, collecting the first memory data through a memory collection module and obtaining the first memory characteristics through a characteristic processing module; only normal software is operated in the host, second memory data are acquired through the memory acquisition module, and second memory characteristics are acquired through the characteristic processing module; and training the obtained first memory characteristics and second memory characteristics as memory sample data of the offline training module. Because the invention adopts the supervised machine learning algorithm to carry out the classification learning task, the malicious software and the normal software which are operated in the host are used as the first type label, and the normal software which is operated in the host is used as the second type label, thereby obtaining the classification label of whether the malicious software is operated or not and the corresponding memory characteristics thereof; and dividing a sample data set in an offline training module, training a two-layer stacked two-classification machine learning model by using the reconciled F1 score as an evaluation index, and finally using the trained model to judge whether a host in monitoring is running or not, and dividing different alarm levels according to the probability value output by the judgment of the model, thereby being beneficial to monitoring and problem investigation of a system host by technicians.

The method mainly uses different memory characteristics as characteristic sample data of the model when the Linux system runs if the malicious software exists, and the software can be loaded on the system memory certainly when running, so that the running condition of the malicious software can be effectively identified from the perspective of the memory characteristic data.

The above modules and the contents will be described in detail.

Regarding the memory collection module, the module uses memory mirror extraction software LiME (Linux Memory Extractor) of the Linux system to extract the memory data. Because the system is operated on the Linux system, liME software is selected as the data acquisition software of the system memory acquisition module, liME is a loadable kernel module, and the temporary memory of the system which takes Linux as an operating system or is developed based on Linux can be obtained. The module uses LiME to collect memory data in the raw format of the Linux system (which refers to the physical memory that acquires all physical memory segments).

Regarding the feature processing module, the feature engineering processing is carried out on the raw-format memory data file obtained in the memory acquisition module based on Volatility software, and five types of memory features are extracted from the memory data file, namely twenty-six types of memory feature data are obtained; five broad categories include: malware discovery class features, module injection features, handle class features, process class features, and interface hook class features.

Specifically, the malware discovery class features are mainly used for detecting possible malicious execution software, and the memory features are as follows: the total amount of virtual memory used, the total amount of process protection, the total amount of unique code injection.

The module injection class features mainly relate to the behavior of the injection code, and the memory features are as follows: in the to-be-loaded list, the number of lost modules of each process is averaged; in the initialization list, the number of lost modules of each process is averaged; the number of lost modules per process is averaged in the memory list.

The handle class features are memory information handle features, and the memory features are as follows: the handle total number of interfaces, the handle total number of documents, the handle total number of events, the handle total number of desktops, the handle total number of keys, the handle total number of threads, the handle total number of directories, the handle total number of beacons, the handle total number of timers, the handle total number of session controls, and the handle total number of mutants.

The process class features can be used for searching possible malicious software related processes in a process list, and the memory features are as follows: process list error rate = number of errors in process list/number of processes of process list; error rate of process pool = number of errors of process pool/number of processes of process list; error rate of thread = number of errors of thread/number of processes of process list; error rate of process id = number of errors of process id/number of processes of process list; error rate of session control = number of errors of session control/number of processes of process list; desktop thread error rate = number of desktop thread errors/number of processes of the process list.

Interface couple class characteristic, demonstrate the quantity of interface couple in the memory, the memory characteristic comprising is: the total number of interface hooks, the number of interface hooks with the type of in_line and the total number of interface hooks in a user mode.

In one embodiment of the present invention, the memory acquisition module acquires memory data using LiME software; the feature processing module performs feature engineering processing on the memory data based on Volatility software; the statistical processing comprises the following steps: statistical summation, statistical quotient and statistical averaging.

The twenty-six memory features are obtained based on the secondary development performed by Volatility software, in the embodiment of the invention, volatility software extracts the collected memory data to obtain basic features, and then performs statistical processing such as statistical summation, statistical quotient calculation, statistical averaging and the like on the basic features to further obtain the memory features required by the subsequent model. It should be noted that, during the secondary development process, volatility software generates a part of features as basic features, such as the three memory features included in the malware discovery class features. And the other part of the features need to be subjected to subsequent statistical processing to be used as features required by the model, such as an average memory feature included in the module injection class feature, a sum (total) memory feature included in the handle class feature and the interface hook class feature, and a quotient (error rate) memory feature included in the process class feature.

With respect to Volatility software, it is an open source memory evidence framework that can analyze the exported memory image, and through obtaining the kernel data structure, use the plug-in to obtain the details of the memory and the running state of the system.

The invention uses memory feature data and derived statistical feature data generated based on Volatility software as input data for a two-class machine learning model. The feature data are collected by LiME memory collection software, and memory features with discrimination are generated by a statistical processing mode and are used for a subsequent two-class machine learning model, wherein the statistical processing comprises the following steps: statistical summation, statistical quotient and statistical averaging, etc.

Regarding a sample generation module, a supervised machine learning algorithm is used for carrying out classification tasks, so that malicious software and normal software are operated in the system as first type tags, only normal software is operated in the system as second type tags, and memory characteristics under the two types of tags are respectively acquired.

In one embodiment provided by the present invention, the memory features include: malware discovers class features, module injection class features, handle class features, process class features and interface hook class features; the malware includes: trojan software, virus software and lux software.

First, a plurality of malicious software running based on a Linux system is acquired from public resources on a network. Since some relatively old malware is directed to old versions of systems or system vulnerabilities, which have not been a threat to newer Linux systems, sample collection would pick up malware discovered in the last decade that was directed to Linux systems. In total, three major classes of trojans, viruses, lux software were collected, including the following families: zeus, emotet, refroso, scar, reconyc; viruses include the following families: 180Solutions, coolWebSearch, gator, transponder, TIBS; the lux software includes the following families: conti, MAZE, shade. There are about 100 to 200 different version samples per family. Thus a total of about 2500 malware samples were collected.

Secondly, each malicious software sample is independently operated in the Linux system, and meanwhile, normal software of some Linux systems is operated to simulate the actual use condition. The normal software is represented as common, normal use, no malware that does not cause malicious damage to the host. And after running, each malicious software sample uses LiME software to collect memory data once every 10 seconds, and the memory data is collected 10 times in total, so that 10 memory characteristic samples of the malicious software in actual running can be obtained. And performing 10 memory acquisitions on all malicious software versions to obtain enough memory samples of more than 25000 malicious software running in the Linux system.

Meanwhile, the memory data condition of the system is required to be used as another sample data when no malicious software (in the invention, normal software) runs. At this time, one or more commonly used software is run in the Linux system to simulate the actual use condition. Memory data was also grabbed using LiME software every 10 seconds for a total of 20 grabs. The number of samples during normal operation of the system is small compared to the number of samples obtained by the system running various malware in the first class of tags. Thus, in order to balance the number of samples, the number of samples is increased by taking normal memory samples, while the number of samples is increased by using SMOTE (SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE, synthetic minority class oversampling) algorithm. The algorithm may add non-repeating features as a supplemental number of samples. Finally, 13000 samples of the memory condition of the Linux system in the running process without malicious samples are obtained, the ratio of the samples to the memory samples in the running process with the malicious samples is 1 to 2, and the problem of unbalanced samples can be effectively avoided.

With respect to the SMOTE algorithm, it is an over-sampling method. The method is an improvement on a random sampling method, and in random oversampling, although a sample set can be balanced, problems are brought, for example, oversampling copies a few types of samples for multiple times, the data scale is enlarged, the complexity of model training is increased, and meanwhile, overfitting is easily caused. Often, the samples are not simply copied at the time of oversampling, but new samples are generated by some method. For example, the SMOTE algorithm randomly selects one sample y from its K neighbors for each sample x of a minority class of samples, and then randomly selects one point on the x, y line as a newly synthesized sample. This over-sampling method of synthesizing new samples may reduce the risk of over-fitting.

Finally, according to twenty-six memory characteristics of five major categories obtained in the characteristic processing module, extracting the memory characteristics of all the samples, and storing the obtained data as a csv file format.

It should be noted that, in the sample generation module, the memory data is collected by the memory collection module, and the memory feature is extracted by the feature processing module. The memory data collected in the sample generating module are the first memory data and the second memory data, and the extracted memory features are the first memory features and the second memory features.

According to the invention, after a plurality of malicious software samples are collected and run on an experimental system, corresponding memory data are collected, corresponding memory features are extracted, and the memory features are used as learning samples of a two-class machine learning model to perform offline training and testing. The learning samples are marked in the offline training module (the memory samples of the malicious software are used as positive classes and marked as 0, and the memory samples without the malicious software are used as negative classes and marked as 1), so that the performance index of the classification model can be greatly improved by the explicitly marked samples.

Regarding the offline training module, as shown in fig. 2, in one embodiment provided by the present invention, it includes a data set dividing unit, a model evaluation unit, a model selection unit, and a model saving unit that are sequentially connected in communication;

Specifically:

In the dataset partitioning unit, the offline training module will use 80% of the collected sample data as the training dataset and 20% as the test dataset. Wherein, in order to avoid random sampling such that there are too few samples of a certain class of malware, 80% of the put training data set and 20% of the put test data set are extracted for each malware sample. After division, a memory sample with 20000 pieces of malicious software and a memory sample with 10400 pieces of non-malicious software running are used as two types of samples of a training set and used for training a machine learning model. The remaining 5000 memory samples of malware and 2600 memory samples of no-malware operation are used as two types of samples of the test set for testing how well the trained model is effective.

Further, adding a harmonic coefficient b into the formula of the F1 score, adjusting recall rate weight and precision rate weight, and adjusting the formula of the F1 score to be: f1-scoreb = (1+b) precision rate recall/(b precision rate+recall).

In the model evaluation unit, a reconciled F1 score (F1-score) is used as an evaluation index for the classification task. And taking the memory sample of the malicious software as a positive class, marking as 0, and taking the memory sample of the normal software (without the malicious software) running as a negative class, marking as 1.

Wherein F1-score = 2 x precision rate x recall/(precision rate + recall);

accuracy = number of samples predicted to be positive and number of samples predicted to be correct/number of samples predicted to be positive;

Recall = number of samples predicted as positive and number of samples predicted to be correct/number of samples of all positive classes;

Under the current application scene, as the damage of the malicious software is large, the two classification models should preferentially guarantee the high recall rate, and find all the memory samples of the malicious software as far as possible, so that the weight of the recall rate and the accuracy rate is adjusted by adding the reconciliation coefficient b into the F1 scoring formula, and the weight of the recall rate is increased, and the F1 scoring formula is adjusted as follows:

f1-score _b = (1+b×b) ×precision rate recall/(b×b×precision rate+recall rate)

When b takes a value greater than 1, the recall rate will be weighted greater than the precision rate, and the value of b is selected to be 2 according to experience and actual test conditions:

F1-score _b =5 x precision rate recall/(4 x precision rate + recall).

In the invention, because the malicious software is harmful, the two classification machine learning models should preferentially guarantee the high recall rate and find all the malicious software memory samples as much as possible, so the recall rate and the weight of the precision rate are adjusted by adding the reconciliation coefficient b into the formula of the F1 score, the weight of the recall rate is further increased, and a better recall effect is obtained. The invention evaluates the model by using the F1 score of parameter harmony, so that the recall effect of the malicious software is better, and memory samples of all the malicious software can be found out as much as possible.

In one embodiment provided by the invention, a model stacking technology is adopted for the two-layer classification machine learning model, and a two-layer stacking scheme is used for generating the two-layer stacking two-layer classification machine learning model; wherein the single model comprises: logistic regression, naive bayes, support vector machines, random forests, GBDT, XGBoost, and Lightbm.

Further, in the two-layer stacked two-classification machine learning model, a first layer model selects a plurality of single models to be combined, and a second layer model selects one single model; and taking the predicted probability value output by the first layer model as the input characteristic of the second layer model.

Further, a first layer single model combination of the two-layer stacked two-classification machine learning model is adopted: XGBoost, support vector machines, random forests, lightgbm, and naive bayes; the second layer single model of the two-layer stacked two-classification machine learning model adopts logistic regression.

In the model selection unit, a single model is firstly used for carrying out a two-class machine learning task, the difference of the F1-score _b scores between different models is observed, and the observation results are shown in the following table 1:

TABLE 1F 1-score _b score between different individual models

Single model	F1-score_b
		Logistic regression	0.91
Naive Bayes	0.85
		Support vector machine	0.91
Random forest	0.92
		GBDT	0.94
XGBoost	0.94
		Lightgbm	0.95

The F1-score _b score for the individual model exceeded 0.9. On the basis, in order to further improve the classification performance of the model, the technical scheme of stacking the models is considered to further improve the classification performance. The method comprises the steps of using a two-layer model stacking scheme, selectively using a plurality of single models in a first layer, representing the output result of one sample data by using probability values between 0 and 1, taking the probability values predicted by the samples as input features of a second layer model, training the second layer model by using one single model, and finally outputting a two-class predicted result. Test data results are shown in table 2 below:

Table 2 test data results for two-layer model stacking

Through comparison, when a first layer model is combined with XGBoost +support vector machine+random forest+ Lightgbm +naive Bayes, a second layer model is logically regressed, and two layers of models are stacked, the score of F1-score _b is highest, 0.99 is reached, the recall rate is 1, namely, the models successfully find out all malicious software memory samples in a test data set, and therefore the finally reserved model selects the scheme, namely, the two-layer stacked two-classification machine learning model.

As shown in FIG. 3, the present invention provides a block diagram of a two-layer stacked two-classification machine learning model. The first layer model of the model selects XGBoost single model combinations of a support vector machine, a random forest, lightgbm and naive Bayes, and the second layer model selects a logistic regression model; inputting the internal memory statistical characteristic data into the model, outputting XGBoost predicted probability values by a XGBoost model, outputting support vector machine predicted probability values by a support vector machine model, outputting random forest predicted probability values by a random forest model, outputting Lightgbm predicted probability values by a Lightgbm model, and outputting naive Bayes predicted probability values by a naive Bayes model; and then taking the predicted probability values output by the five combined models of the first layer model as the input of the second layer model, namely training by using the logistic regression model, and finally outputting the logistic regression predicted probability values. The logistic regression prediction probability value is also the prediction probability value finally output by the whole two-layer stacking model.

It should be noted that GBDT (Gradient Boosting Decision Tree) is a gradient lifting decision tree, the basic structure is a forest formed by the decision tree, and the learning mode is gradient lifting; lightgbm (LIGHT GRADIENT Boosting Machine) is lightweight gradient lifting Machine learning, is an open source framework for gradient lifting, is one of frameworks for realizing GBDT algorithm, and supports efficient parallel training; XGBoost (eXtreme Gradient Boosting) is an extreme gradient lifting tree, which is one implementation of boosting algorithm.

And regarding the model storage unit, storing the trained two-layer stacked two-classification machine learning model as a binary file, and calling the model file by a script program in a real-time prediction alarm module to perform prediction analysis on the memory data acquired in real time.

The two-layer stacked two-class machine learning model uses a two-layer model stacking scheme, so that the two-layer stacked two-class machine learning model is formed. The first layer model uses XGBoost, a support vector machine, a random forest, lightgbm and naive Bayes, and five independent model combinations are used; the second layer model uses logistic regression; the output result of the first layer model for each piece of sample data is represented by a probability value between 0 and 1, then the probability value output by the samples is used as the input characteristic of the second layer model, and finally the size of the probability value of two classifications is output as the output of the model. The invention adopts the model structure to greatly improve the evaluation index of the model.

The real-time prediction alarm module is used for acquiring real-time memory data of the working Linux system, carrying out statistics processing on the data to obtain memory characteristic data, storing the memory characteristic data as a csv file, and obtaining the probability of running malicious software in the memory data acquisition through a two-layer stacked two-class machine learning model obtained in the offline training module, and carrying out alarm according to different risk levels of different output probability values.

As shown in fig. 4, in one embodiment provided by the present invention, the real-time prediction alarm module includes a real-time memory acquisition unit, a real-time feature processing unit, a model real-time analysis unit and a hierarchical alarm unit that are sequentially connected in communication;

The real-time feature processing unit is used for extracting features of the collected real-time memory data through the feature processing module and carrying out statistical processing to obtain corresponding real-time memory features;

the model real-time analysis unit is used for calling a two-layer stacking two-class machine learning model, inputting real-time memory characteristics as model samples and outputting probability values; wherein the probability value is between 0 and 1;

The hierarchical alarm unit is used for distinguishing different risk levels according to the size of the probability value and giving an alarm;

The following will specifically describe the content of the real-time prediction alarm module:

Firstly, in the real-time memory acquisition unit, when memory data is acquired, because a certain system resource is also required to be occupied when LiME software is used for memory acquisition, an acquisition interval of 1 minute is adopted in the module to acquire and store memory data files.

And secondly, in a real-time characteristic processing unit, extracting and counting the memory data to obtain corresponding memory characteristic data, and storing the memory characteristic data as a csv file. After this step is completed, the memory data file originally collected can be deleted, so as to save disk space.

And then, in the model real-time analysis unit, calling the offline trained two-layer stacked two-class machine learning model through a script, inputting the characteristic data sample into the model, and outputting a probability value between 0 and 1. Because the mark of the memory sample in the case of malicious software operation is 0 and the mark of the memory sample in the case of no malicious software operation is 1, the range of probability values output by the model is [0,1], the probability value is closer to 0, the probability of malicious software operation is larger, and otherwise, the probability of malicious software operation is smaller.

Finally, in the hierarchical alarm unit, different risk levels are distinguished according to the sizes of different probabilities, and an alarm is carried out. According to the data condition of the test data set, when the model output probability is less than 0.3, the actual labels of the samples are all memory samples operated by malicious software, so that the real-time analysis result model output probability efficiency is the highest risk when the model output probability is less than 0.3; when the model output probability is more than 0.3 and less than 0.6, the model has a small amount of misjudgment, the real-time analysis result is output as a medium-level risk, and the samples are analyzed by manual or other judging modules later; when the model output probability is greater than 0.6, the actual labels of the samples are all memory samples without malicious software operation, so that the real-time analysis result is output as low/no risk.

The real-time prediction alarm module uses a two-layer stacked two-class machine learning model which is trained offline to perform feature processing on memory data acquired in real time, and takes the obtained real-time memory features as sample data of the model to perform subsequent sample prediction; and different alarm risk levels are set according to the probability value dividing range of the prediction result, so that the technical staff can monitor and check the problems conveniently.

It should be noted that, the real-time prediction alarm module performs real-time prediction by acquiring real-time host memory data and performing feature processing, and then a trained two-layer stacked two-class machine learning model is used, so in the module, the real-time memory acquisition unit performs real-time acquisition of memory data by using the memory acquisition module, and the real-time feature processing unit performs feature processing of the memory data by using the feature processing module.

In summary, the invention has the following beneficial effects:

Example 2

As shown in fig. 5, the present invention provides a method for detecting malicious software based on memory characteristics, which includes the following steps:

s1, collecting memory data of a host in operation;

Wherein the training data set: test dataset = 4:1;

Specifically, liME software is used for collecting a memory during the operation of a host, and characteristic engineering processing is carried out on the collected memory data to obtain corresponding memory characteristics, so that multiple malicious software is obtained from network public resources, the malicious software and normal software are operated in the host, first memory data are collected through a step S1, and the first memory characteristics are extracted through a step S2; only normal software is operated in the host, second memory data are collected through the step S1, and second memory characteristics are extracted through the step S2; and training the obtained first memory characteristics and second memory characteristics as memory sample data for model training. Because the invention adopts the supervised machine learning algorithm to carry out the classification learning task, the malicious software and the normal software which are operated in the host are used as the first type label, and the normal software which is operated in the host is used as the second type label, thereby obtaining the classification label of whether the malicious software is operated or not and the corresponding memory characteristics thereof; and then dividing the sample data set, using the reconciled F1 score as an evaluation index, training a two-layer stacked two-classification machine learning model, and finally using the trained model to judge whether the host in monitoring is running with malicious software or not, and dividing different alarm grades according to the probability value output by the model judgment, thereby being beneficial to monitoring and problem investigation of the system host by technicians.

Because the malicious software is harmful, the two-class machine learning model should preferentially guarantee the high recall rate, and find all the malicious software memory samples as much as possible, so that the recall rate and the weight of the precision rate are adjusted by adding the harmonic coefficient b into the formula of the F1 score, and the weight of the recall rate is increased, so that a better recall effect is obtained. The invention evaluates the model by using the F1 score of parameter harmony, so that the recall effect of the malicious software is better, and memory samples of all the malicious software can be found out as much as possible.

The method uses a two-layer stacked two-class machine learning model trained offline to perform feature processing on memory data acquired in real time, and uses the obtained real-time memory features as sample data of the model to perform subsequent sample prediction; and different alarm risk levels are set according to the probability value dividing range of the prediction result, so that the technical staff can monitor and check the problems conveniently.

It should be noted that, the method for detecting malware based on memory features provided by the present invention is applied to the system for detecting malware based on memory features provided by the present invention, so that parts of the embodiment not described in detail or in detail may be referred to the related description of other embodiments.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional units and modules according to needs. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

The present invention is not limited to the above embodiments, but is capable of modification and variation in detail, and other modifications and variations can be made by those skilled in the art without departing from the scope of the present invention.

Claims

1. A malicious software detection system based on memory features is characterized in that: the system comprises a memory acquisition module, a characteristic processing module, a sample generation module, an off-line training module and a real-time prediction alarm module which are sequentially connected in a communication way;

The first memory feature and the second memory feature are used as memory sample data of the offline training module;

the two-layer stacking two-class machine learning model is generated by adopting a model stacking technology and using a two-layer model stacking scheme;

In the two-layer stacked two-classification machine learning model, a plurality of single models are selected for combination by a first layer model, and one single model is selected by a second layer model; taking the predicted probability value output by the first layer model as the input characteristic of the second layer model;

the off-line training module comprises a data set dividing unit, a model evaluation unit, a model selection unit and a model storage unit which are sequentially connected in a communication mode;

the model evaluation unit is used for adopting the F1 score as an evaluation index of the two-class learning task;

The model selection unit is used for carrying out the two classification learning tasks by adopting a plurality of different models, observing the difference of the F1 scores among the different models, and selecting a model according to test data;

2. The memory-feature-based malware detection system of claim 1, wherein: the single model includes: logistic regression, naive bayes, support vector machines, random forests, GBDT, XGBoost, and Lightbm.

3. The memory-feature-based malware detection system of claim 2, wherein: the first layer single model combination of the two-layer stacked two-classification machine learning model adopts: XGBoost, support vector machines, random forests, lightgbm, and naive bayes; and a second layer single model of the two-layer stacked two-classification machine learning model adopts logistic regression.

4. The memory-feature-based malware detection system of claim 1, wherein: the memory acquisition module acquires the memory data by adopting LiME software; the feature processing module performs the feature engineering processing on the memory data based on Volatility software; the statistical processing includes: statistical summation, statistical quotient and statistical averaging.

5. The memory-feature-based malware detection system of claim 1, wherein: the memory features include: malware discovers class features, module injection class features, handle class features, process class features and interface hook class features; the malware includes: trojan software, virus software and lux software.

6. The memory-feature-based malware detection system of claim 1, wherein: adding a reconciliation coefficient b into the formula of the F1 score, adjusting recall rate weight and precision rate weight, and adjusting the formula of the F1 score to be: f1-score _b = (1+b x b) precision rate recall/(b x b) precision rate + recall rate).

7. The memory-feature-based malware detection system of claim 6, wherein: the real-time prediction alarm module comprises a real-time memory acquisition unit, a real-time characteristic processing unit, a model real-time analysis unit and a grading alarm unit which are sequentially connected in a communication mode;

8. A method for detecting malicious software based on memory features, applied to a malicious software detection system based on memory features as claimed in any one of claims 1 to 7, characterized in that: the method comprises the following steps:

s1, collecting memory data of a host in operation;

wherein the training data set: test dataset = 4:1;

S6, acquiring real-time memory data through the step S1 and acquiring real-time memory characteristics through the step S2; inputting the real-time memory characteristics as real-time sample data into the two-layer stacking two-class machine learning model;