CN111581640A

CN111581640A - Malicious software detection method, device and equipment and storage medium

Info

Publication number: CN111581640A
Application number: CN202010256014.5A
Authority: CN
Inventors: 张若愚
Original assignee: Beijing Lanyun Technologies Co ltd
Current assignee: Beijing Lanyun Technologies Co ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2020-08-25

Abstract

A malicious software detection method, a malicious software detection device, malicious software detection equipment and a storage medium are provided, wherein the malicious software detection method comprises the following steps: converting the software to be tested into an image; extracting feature information of the image; and processing the characteristic information through a pre-trained classifier model to obtain a detection result of the software to be detected. In this embodiment, according to the scheme provided by this embodiment, the extraction efficiency of features is higher than that of manual extraction, and by using the classifier model obtained through training, the trained model can identify unknown and variant malware, and the execution efficiency is higher than that of a behavior-based malware detection method.

Description

Malicious software detection method, device and equipment and storage medium

Technical Field

The present disclosure relates to internet technologies, and in particular, to a method, an apparatus, and a device for detecting malware.

Background

With the development of the internet, malware has become one of the major threats to network security. Malicious software, also called malicious code, and malicious executable files refer to a class of software that is installed and executed in a system without authorization to achieve an improper purpose. At present, various novel malicious code forms such as backdoor, Trojan horses, worms, zombie programs and the like are gradually generated, great challenges are brought to the safety of computers and networks, enterprises and users are subjected to great economic losses, and the national information safety is seriously threatened.

Rapidly and accurately classifying malicious codes is one of the keys for preventing the malicious codes; at present, malicious code classification is mainly performed by extracting and analyzing feature codes of malicious software by technicians in the field according to experience, and when samples of the malicious software are large, the analysis process is complicated, the efficiency is low, and the judgment standards are not uniform.

Disclosure of Invention

The embodiment of the application provides a method, a device and equipment for detecting malicious software and a storage medium, and realizes the detection of the malicious software.

The application provides a malicious software detection method, which comprises the following steps:

converting the software to be tested into an image;

extracting feature information of the image;

and processing the characteristic information through a pre-trained classifier model to obtain a detection result of the software to be detected.

In an exemplary embodiment, the converting the software to be tested into the image includes: and converting the software to be tested into a two-dimensional image by using a binary gray-level image conversion algorithm.

In an exemplary embodiment, the characteristic information includes at least one of: the size of the image, the global gray level mean value of the image, the global gray level standard deviation of the image and the local characteristic information of the image; wherein the local feature information includes feature information of a partial region of the image.

In an exemplary embodiment, the local feature information includes at least one of: mean gray value pmu for partition k_kGray scale standard deviation of partition k psi_kK is 1 to K, all pmu_kMean of (n), (p_kStd (pmu) standard deviation of (A), all pmu_kAll pmu_kAll pmu in the partition in which the maximum is located_kMinimum of (a), all pmu_kAll psi_kMean of (psi), all psi_kStandard deviation of std (psi), all psi_kMaximum of, all psi_kAll psi_kMinimum value of (a), all psi_kPmu in the partition in which the minimum value of_kMean of all values less than mean (pmu) -3 × std (pmu), pmu_kStandard deviation of all values less than mean (pmu) -3 × std (pmu), pmu_kMean of all values greater than mean (pmu) +3 std (pmu), pmu_kStandard deviation of all values greater than mean (pmu) +3 std (pmu), psi_kMean of all values less than mean (psi) -3 std (psi), psi_kStandard deviation of all values less than mean (psi) -3 × std (psi), psi_kMean of all values greater than mean (psi) +3 std (psi), psi_kThe standard deviation of all values greater than mean (psi) +3 × std (psi), wherein the image is divided into K partitions from top to bottom with a preset sliding step, each partition comprising n rows, the K>1, n is less than the total number of lines of the image.

In an exemplary embodiment, n is 3, and the preset sliding step is one row.

In an exemplary embodiment, the classifier model is generated based on a random forest algorithm.

An embodiment of the present application provides a malware detection apparatus, including:

the training module is configured to train to obtain a classifier model;

the conversion module is configured to convert the software to be tested into an image;

a feature extraction module configured to extract feature information of the image;

and the detection module is configured to process the characteristic information through the classifier model to obtain a detection result of the software to be detected.

In an exemplary embodiment, the characteristic information includes at least one of: size of the image, global mean grayscale of the image, global standard grayscale deviation of the image, mean grayscale pmu of partition k_kGray scale standard deviation of partition k psi_kK is 1 to K, all pmu_kMean of (n), (p_kStd (pmu) standard deviation of (A), all pmu_kAll pmu_kAll pmu in the partition in which the maximum is located_kMinimum of (a), all pmu_kAll psi_kMean of (psi), all psi_kStandard deviation of std (psi), all psi_kMaximum of, all psi_kAll psi_kMinimum value of (a), all psi_kPmu in the partition in which the minimum value of_kMean of all values less than mean (pmu) -3 × std (pmu), pmu_kStandard deviation of all values less than mean (pmu) -3 × std (pmu), pmu_kMean of all values greater than mean (pmu) +3 std (pmu), pmu_kStandard deviation of all values greater than mean (pmu) +3 std (pmu), psi_kMean of all values less than mean (psi) -3 std (psi), psi_kStandard deviation of all values less than mean (psi) -3 × std (psi), psi_kMean of all values greater than mean (psi) +3 std (psi), psi_kThe standard deviation of all values greater than mean (psi) +3 × std (psi), wherein the image is divided into K partitions from top to bottom with a preset sliding step, each partition comprising n rows, the K>1, n is less than the total number of lines of the image.

The embodiment of the application provides a malicious software detection device, which comprises a memory and a processor, wherein the memory stores a program, and the program is read and executed by the processor to realize the malicious software detection method.

Embodiments of the present application provide a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the above-mentioned malware detection method.

Compared with the related art, the embodiment of the application provides a malicious software detection method, which comprises the following steps: converting the software to be tested into an image; extracting feature information of the image; and processing the characteristic information through a pre-trained classifier model to obtain a detection result of the software to be detected. In the embodiment, the software to be detected is converted into image data, the features of the image are automatically extracted, and the trained model is used for detecting the software to be detected.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. Other advantages of the present application may be realized and attained by the instrumentalities and combinations particularly pointed out in the specification and the drawings.

Drawings

The accompanying drawings are included to provide an understanding of the present disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the examples serve to explain the principles of the disclosure and not to limit the disclosure.

Fig. 1 is a flowchart of a malware detection method according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating conversion of software to be tested into an image according to an embodiment of the present application;

FIG. 3 is a flowchart of a classifier model training method according to an embodiment of the present application;

fig. 4 is a block diagram of a malware detection apparatus according to an embodiment of the present application;

fig. 5 is a block diagram of a malware detection apparatus according to an embodiment of the present application;

fig. 6 is a block diagram of a computer-readable storage medium provided in an embodiment of the present application.

Detailed Description

The present application describes embodiments, but the description is illustrative rather than limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or instead of any other feature or element in any other embodiment, unless expressly limited otherwise.

The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements disclosed in this application may also be combined with any conventional features or elements to form a unique inventive concept as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive aspects to form yet another unique inventive aspect, as defined by the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not limited except as by the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.

Further, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other orders of steps are possible as will be understood by those of ordinary skill in the art. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Further, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.

The traditional antivirus software mainly uses a feature code technology to match file contents with a feature library by statically scanning the file contents. The feature code extraction is the core of the detection method, and is obtained by a method of extracting code segments with malware features by performing reverse analysis on known malware. This approach requires manual feature extraction, consumes a lot of manpower, and is unable to detect unknown and deformed malware, requiring frequent feature library updates.

Another is a behavior-based malware detection method, which monitors the behavior of a program (API calls, etc.) as it runs, and alerts and suspends malicious behavior when the program behavior triggers predefined rules. The method for detecting the relative feature codes of the false alarm rate and the missing alarm rate has the advantages that the detection is high, the malicious software can be detected only after running in the system, and the detection efficiency is low.

Machine Learning (ML) is one of the important methods for implementing artificial intelligence, and mainly studies how to make a computer simulate or implement human Learning behaviors to acquire new knowledge or skills. Most machine learning techniques learn through data or experience to achieve the purposes of improving algorithm performance or problem solving effect and the like. Machine learning is a multidisciplinary comprehensive cross field of knowledge relating to probability theory, statistics, algorithm complexity theory and the like, and related achievements are widely applied to information retrieval, recommendation systems, network security, fraud detection, medical diagnosis and the like.

The embodiment of the application provides a malicious software detection method based on machine learning, the method automatically extracts features by converting software to be detected into two-dimensional image data, the extraction efficiency of the features is higher than that of manual extraction, training is performed by using a machine learning algorithm, a trained model can identify unknown and variant malicious software, and the execution efficiency is higher than that of a behavior-based malicious software detection method.

As shown in fig. 1, an embodiment of the present application provides a malware detection method, including:

step 101, converting software to be tested into an image;

step 102, extracting characteristic information of the image;

and 103, processing the characteristic information through a pre-trained classifier model to obtain a detection result of the software to be detected. The detection result is, for example, whether the software to be detected is malware.

In this embodiment, software to be detected is converted into image data, features of the image are automatically extracted, the trained model is used for detecting the software to be detected, the extraction efficiency of the features is higher than that of manual extraction, the trained classifier model is used, the trained model can identify unknown and variant malicious software, and the execution efficiency is higher than that of a behavior-based malicious software detection method.

In one embodiment, the converting the software to be tested into the image includes: and converting the software to be tested into a two-dimensional image by using a binary gray-level image conversion algorithm.

In an embodiment, for a given executable file (i.e., software to be tested), the binary file is read in units of bytes, and sequentially read into unsigned shaping values (ranging from 0 to 255), and the unsigned shaping values are combined into vectors with fixed lengths, the value in this embodiment is 256, and finally, an r × J matrix m is generated for the whole file (in this embodiment, 256 is taken as J). This matrix is visualized as a grayscale image as shown in fig. 2. It should be noted that 256 is only an example, and other values greater than 256 or less than 256 may be taken as needed.

The malware coding rules are distinguished from the normal coding rules, and the distinction is reflected in the image feature statistics of the converted grayscale image, so that, in an exemplary embodiment, the feature information includes at least one of the following: the image processing method comprises the steps of obtaining the size of an image, the global gray level mean value of the image, the global gray level standard deviation of the image and local feature information of the image, wherein the local feature information comprises feature information of a partial region of the image.

The size of the image is, for example, the number of lines of the image, for example, r × J, and r may be the size of the image.

Global gray level mean of the image

Where r is the number of rows of the image and J is the number of columns of the image.

Global gray scale standard deviation of the image

With the sliding step length of 1 and the window size of n (in this embodiment, n is 3), the image is divided into p-r-n +1 partitions from top to bottom, that is, from top to bottom, the 1 st to nth rows are taken as the first partition, the 2 nd to nth +1 th rows are taken as the second partition, and so on, and the gray level mean value and the gray level standard deviation of each partition are calculated. In this embodiment, the step size is 1, and the window size is 3, and other step sizes and window sizes may be set. In addition, partitioning may be performed in other manners.

The local feature information includes at least one of:

gray average of partitions

Gray scale standard deviation of partitions

Where k denotes the kth partition.

Thus, the gray level mean value of each partition is formed into an array pmu [ p ], the gray level standard deviation of each partition is formed into an array psi [ p ], and on the basis, the coding characteristics of the partitions are embodied, and the following characteristics are considered:

the mean value of the array pmu represents a partition coding rule, mpmu mean (pmu), wherein mean represents a mean function;

the standard deviation of array pmu, representing the partition encoding law: spmu ═ std (pmu), where std stands for the mean function

Mean value of array psi, representing partition coding fluctuation rule, mpsi mean (psi)

The standard deviation of the array psi represents the partition coding fluctuation law: spsi ═ std (psi)

Partition encoding rule maximum: maxpmu ═ max (pmu), where max represents the maximum function;

partition encoding rule minimum value: minpmu ═ min (pmu), where min represents a minimum function;

maximum value of partition coding fluctuation rule: maxpsi max (psi)

Partition coding fluctuation rule minimum value: min (psi)

Relative position of partition coding rule maxima: rmaxpmu ═ find (maxpmu)/p, wherein the find function represents the index of the variable in the original array, namely the partition number where the maximum value is located, and p is the partition number, the same as the following;

relative position of partition coding rule minimum value: rminpmu & fin (minpmu)/p

Relative position of maximum value of partition coding fluctuation rule: rmaxpsi fine (maxpsi)/p

Relative position of minimum value of partition coding fluctuation rule: rminpsi fine (minpsi)/p

Furthermore, considering that the coding of malicious activities may only exist locally, and therefore should also consider partitioning and spreading cases that are distinguished from global coding rules, the following features can also be considered:

pmu mean ma1 and standard deviation sa1 for all values less than mean (pmu) -3 std (pmu), and mean ma2 and standard deviation sa2 for all values greater than mean (pmu) +3 std (pmu) of pmu.

Mean mb1 and standard deviation sb1 in psi for all values less than mean (psi) -3 std (psi), and mean mb2 and standard deviation sb2 in psi for all values greater than mean (psi) +3 std (psi).

In a gray image, texture appears to be different from the most regional code of the image in terms of coding and can be used as a characteristic for identifying the gray image, while texture is an abnormal value of the whole distribution in the image coding distribution, and the abnormal value can be represented by mean +/-3 std according to the three sigma principle of statistics, so that the statistical index of regional codes except mean (pmu) +/-3 std (pmu) can be used as an effective characteristic for identifying the image.

In conclusion, all the indexes form a feature vector for describing the texture features of the gray level image: [ r, mu, si, mpmu, spmu, mpsi, spsi, maxpmu, minpmu, maxpsi, minpsi, rmaxpmu, rminpmu, rmaxpsi, rminpsi, ma1, sa1, ma2, sa2, mb1, sb1, mb2, sb2]

It should be noted that the above features are merely examples, and other features may be used as necessary.

In an embodiment, the classifier model is trained based on a machine learning algorithm, such as generation by a random forest algorithm, and generating the classifier model includes training of the classifier model and detection of the classifier model.

The random forest belongs to a bagging algorithm in Ensemble Learning (Ensemble Learning), and the implementation process is as follows:

randomly putting back and sampling n1 samples from an original training set by using a Bootstrap (boot-pulling method), and sampling for k1 times to generate k1 training sets;

for k1 training sets, respectively training k1 Classification regression tree (CART) decision tree models (the k1 decision tree models can be determined according to specific problems, such as ID3 algorithm, C4.5 algorithm decision tree model)

For a single decision tree model, assuming that the number of training sample features is M, randomly selecting M feature subsets from the M features, and selecting the best feature from the M feature subsets to split each time according to the Gini index (if the algorithm is ID3/C4.5, the splitting principle is an information gain/information gain ratio, namely, selecting the best feature to split according to the information gain/information gain ratio);

and forming a random forest by the generated decision trees to serve as a classifier model. And for the classification problem, voting is carried out according to classifiers of a plurality of decision trees to determine a final classification result.

After training, carry out the optimization, include:

performing cross inspection on the trained classifier model, and calculating accuracy, detection rate, false alarm rate and auc (area under the Curve) value;

and adjusting the number n1 of the decision trees in the classifier model and the maximum feature number m of a single decision tree, retraining, and calculating the indexes (accuracy rate, detection rate, false alarm rate and auc value) until the indexes are optimal.

The above malware detection method can be implemented using python, java, or the like.

As shown in fig. 3, the training process of the classifier model includes:

step 301, converting collected samples including normal software (marked as white samples) and malicious software (marked as black samples) by using a binary system to gray level image conversion algorithm, and mapping the samples into a two-dimensional texture image;

step 302, extracting texture features of the texture image;

in this embodiment, the features are extracted from the overall and local aspects, and the overall coding rule and the local coding rule of the software are represented, including, for example, the size of an image, the global gray level mean, the global gray level standard deviation, the local gray level mean, the local gray level standard deviation, and the like;

step 303, dividing a data set consisting of the feature vectors and the class labels (white samples and black samples) into a test set and a training set;

in this example, a ten-fold cross-assay method was used: randomly dividing the data into 10 parts, taking 9 parts as a training set in turn, and taking the remaining 1 part as a test set; it should be noted that the training set and the test set may be divided in other ways, which is not limited in this application.

Step 304, training a classifier model by using a random forest algorithm;

in other embodiments, the classifier model may be trained using supervised Machine learning algorithms, such as the application proximity algorithm (kNN, k-nearest neighbor), Support Vector Machine (SVM), logistic regression, decision trees, and so forth;

305, verifying the classifier model by using the test set, verifying the classifier model by using accuracy, detection rate, false alarm rate, auc index and the like, executing step 307 if the classifier model passes the verification, and executing step 306 if the classifier model does not pass the verification;

step 306, adjusting parameters, and returning to step 304;

and 307, finishing training.

As shown in fig. 4, at least one embodiment of the present application provides a malware detection apparatus, including:

a training module 401 configured to train to obtain a classifier model;

a conversion module 402 configured to convert the software to be tested into an image;

a feature extraction module 403 configured to extract feature information of the image;

the detection module 404 is configured to process the feature information through the classifier model to obtain a detection result of the software to be detected.

As shown in fig. 5, an embodiment of the present application provides a malware detection apparatus 50, which includes a memory 510 and a processor 520, where the memory 510 stores a program, and when the program is read and executed by the processor 520, the program implements the malware detection method according to any embodiment.

As shown in fig. 6, an embodiment of the present application provides a computer-readable storage medium 60, where the computer-readable storage medium 60 stores one or more programs 610, and the one or more programs 610 may be executed by one or more processors to implement the malware detection method according to any one of the embodiments.

It should be noted that the scheme provided by the embodiment of the present application is not limited to be used for detecting malware, and may also be used as software classification.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A malware detection method, comprising:

converting the software to be tested into an image;

extracting feature information of the image;

2. The malware detection method of claim 1, wherein converting the software under test into an image comprises: and converting the software to be tested into a two-dimensional image by using a binary gray-level image conversion algorithm.

3. The malware detection method of claim 1, wherein the characteristic information comprises at least one of: the size of the image, the global gray level mean value of the image, the global gray level standard deviation of the image and the local characteristic information of the image; wherein the local feature information includes feature information of a partial region of the image.

4. The malware detection method of claim 3,

the local feature information includes at least one of: mean gray value pmu for partition k_kGray scale standard deviation of partition k psi_kK is 1 to K, all pmu_kMean of (n), (p_kStd (pmu) standard deviation of (A), all pmu_kAll pmu_kAll pmu in the partition in which the maximum is located_kMinimum of (a), all pmu_kAll psi_kMean of (psi), all psi_kStandard deviation of std (psi), all psi_kMaximum of, all psi_kAll psi_kMinimum value of (a), all psi_kPmu in the partition in which the minimum value of_kMean of all values less than mean (pmu) -3 × std (pmu), pmu_kStandard deviation of all values less than mean (pmu) -3 × std (pmu), pmu_kMean of all values greater than mean (pmu) +3 std (pmu), standard deviation of all values greater than mean (pmu) +3 std (pmu) in pmuk, psi_kMean of all values less than mean (psi) -3 std (psi), psi_kStandard deviation of all values less than mean (psi) -3 × std (psi), psi_kMean of all values greater than mean (psi) +3 std (psi), psi_kThe standard deviation of all values greater than mean (psi) +3 × std (psi), wherein the image is divided into K partitions from top to bottom with a preset sliding step, each partition comprising n rows, the K>1, n is less than the total number of lines of the image.

5. The malware detection method of claim 4, wherein n is 3, and the preset sliding step is one row.

6. The malware detection method of any one of claims 1 to 5, wherein the classifier model is generated based on a random forest algorithm.

7. A malware detection apparatus, comprising:

the training module is configured to train to obtain a classifier model;

8. The malware detection device of claim 7, wherein the characteristic information comprises at least one of: size of the image, global mean grayscale of the image, global standard grayscale deviation of the image, mean grayscale pmu of partition k_kGray scale standard deviation of partition k psi_kK is 1 to K, all pmu_kMean of (n), (p_kStd (pmu) standard deviation of (A), all pmu_kAll pmu_kAll pmu in the partition in which the maximum is located_kMinimum of (a), all pmu_kAll psi_kMean of (psi), all psi_kStandard deviation of std (psi), all psi_kMaximum of, all psi_kAll psi_kMinimum value of (a), all psi_kThe minimum value of (a), the mean value of all values in pmuk which are less than mean (pmu) -3 std (pmu), the standard deviation of all values in pmuk which are less than mean (pmu) -3 std (pmu), pmu_kMean of all values greater than mean (pmu) +3 std (pmu), pmu_kStandard deviation of all values greater than mean (pmu) +3 std (pmu), psi_kMean of all values less than mean (psi) -3 std (psi), psi_kStandard deviation of all values less than mean (psi) -3 × std (psi), psi_kMean of all values greater than mean (psi) +3 std (psi), psi_kThe standard deviation of all values greater than mean (psi) +3 × std (psi), wherein the image is divided into K partitions from top to bottom with a preset sliding step, each partition comprising n rows, the K>1, n is less than the total number of lines of the image.

9. A malware detection apparatus comprising a memory and a processor, the memory storing a program that, when read and executed by the processor, implements the malware detection method according to any one of claims 1 to 6.

10. A computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the malware detection method of any one of claims 1 to 6.