CN109492395B

CN109492395B - Method, device and storage medium for detecting malicious program

Info

Publication number: CN109492395B
Application number: CN201811291951.3A
Authority: CN
Inventors: 林思明; 陈腾跃; 王锦江; 吴陈炜; 梁煜麓; 杨心恩; 罗佳
Original assignee: Xiamen Anscen Network Technology Co ltd
Current assignee: Xiamen Anscen Network Technology Co ltd
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2021-01-12
Anticipated expiration: 2038-10-31
Also published as: CN109492395A

Abstract

The invention provides a method, a device and a storage medium for detecting malicious programs, wherein the method comprises the steps of firstly carrying out static analysis on a sample file to be detected, if the sample file cannot be determined to be malicious, carrying out dynamic analysis and identification on the sample file, if the sample file cannot be determined to be malicious, detecting by using a deep learning detection model, and converting the dynamic analysis behavior and static analysis static information into images suitable for deep learning model identification during deep learning model detection for classification.

Description

Method, device and storage medium for detecting malicious program

Technical Field

The present invention relates to the field of database processing technologies, and in particular, to a method, an apparatus, and a storage medium for detecting a malicious program.

Background

The traditional malicious program detection technology comprises feature code detection, behavior detection, heuristic scanning, machine learning and the like. The detection principles are different, the required overhead and detection range are different during implementation, and the detection methods have limitations and cannot cope with malicious programs in new technologies. It is now common to detect malicious programs based on statistical and feature rule analysis and employing virtual techniques. However, compared with the development speed and trend of malicious programs and the number of malicious programs, the conventional malicious program detection technology has entered into a technical bottleneck and cannot meet the detection requirement of the malicious programs. The american CIA spyware Hive is a typical example, and no samples are detected in all large online detection websites from 2010 to 2017 when exposed to wiki decryption.

One of the prior art feature code detection techniques is to collect malicious program samples, extract feature codes thereof, and compare the malicious program samples with inspection samples during detection; and judging whether the sample is matched with the feature code or not so as to judge whether the program is a malicious program or not. The detection method can only aim at known malicious programs and basically has no effect on widely spread deformed malicious programs, polymorphic malicious programs and the like.

Behavior detection methods in the prior art detect malicious programs by using the characteristic line characteristics generated by the malicious programs during running. When the characteristics owned by the running of the malicious program appear during the running of the program during the detection, the program is judged to be the malicious program. The behavior detection method can find unknown malicious programs, but has the defects of high false alarm rate and difficult realization. In addition, behavior collection countermeasure technology is largely used by the current malicious programs to interfere with behavior detection technology, so that the effect of the traditional behavior detection scheme is further weakened.

Along with the development of the internet, the appearance speed of malicious programs is higher and higher, and according to statistics, most of the newly appeared malicious programs are deformed malicious programs and polymorphic malicious programs. The newly generated program is usually modified on the original basis by the designer in order to bypass the killing of the antivirus software. As the speed of malicious programs increases exponentially, the traditional detection method is not enough to secure the operating environment.

Disclosure of Invention

The present invention provides the following technical solutions to overcome the above-mentioned drawbacks in the prior art.

A method of detecting malicious programs, the method comprising:

a static analysis step, in which a sample file to be detected is subjected to static analysis, if the static analysis result is that the sample file contains a malicious program, threat information is generated, if the static analysis result is a credible sample file, the detection is quitted, and if the static analysis result is an undetermined sample file, dynamic analysis detection is performed;

a dynamic analysis and detection step, wherein an undetermined sample file is operated in a sandbox, behaviors generated in the operation process of the undetermined sample file are collected, the undetermined sample file is identified through the behaviors, if the sample file contains a malicious program according to the identification result, threat information is generated, and if the sample file is still the undetermined sample file according to the identification result, deep learning detection is used for detecting the sample file;

a deep learning detection step, namely detecting the sample file which is not determined by using the deep learning detection model, and generating threat information if the detection result is that the sample file contains a malicious program;

and early warning, namely analyzing the threat information, acquiring a threat level, and sending early warning information according to the threat level.

Further, the static analysis is performed in the following manner: a trusted document identification mode based on full-text hash, a family identification mode based on digital signature, based on fuzzy hash and/or scanning by a static scanning engine.

Further, the dynamic analysis identifies undetermined sample files through behavioral pattern matching or heuristic algorithms.

Further, the operation of training the deep learning detection model is as follows: acquiring training sample files, and preparing a sample file set which has the same file type and is classified into threat sample files and credible sample files, wherein the number of the threat sample files is consistent with that of the credible sample files; dynamically analyzing a training sample file, respectively operating the threat sample file and the credible sample file in a sandbox, collecting behaviors generated in the operation process to generate a dynamic operation record, statically analyzing the threat sample file and the credible sample to extract static information to generate a static information record, and combining the dynamic operation record and the static information record into a training sample information behavior set; generating a sample image of a training sample file, preprocessing the training sample information behavior set, removing redundant and interference information in the behavior set, and converting the processed training sample information behavior set into the sample image of the training sample file, wherein the sample image comprises a threat sample image and a credible sample image; training a deep learning detection model, carrying out up-down turning, left-right turning, binarization and/or ZCA whitening data enhancement processing on the threat sample image and the credible sample image to obtain an expanded threat sample image and a credible sample image, and carrying out convolution training on the expanded threat sample image and the credible sample image by using a convolution neural network mode to generate the deep learning detection model for detecting the malicious program.

Further, the behavior generated during the running process during the dynamic analysis includes: API call information, process chain information, and/or network behavior data; the static information includes: string information, lead-in lead-out information, and/or resource information.

Further, the operation of detecting the sample file which is not determined yet by using the deep learning detection model is as follows: collecting the behavior generated in the running process of the sample file which is not determined yet to generate a dynamic running record, carrying out static analysis on the sample file which is not determined yet to extract static information to generate a static information record, combining the dynamic running record and the static information record into the information behavior set of the sample file which is not determined yet, preprocessing the information behavior set of the sample file which is not determined yet to remove redundant and interference information in the behavior set, converting the processed information behavior set of the sample file which is not determined yet into the sample image of the sample file which is not determined yet, and detecting the sample image of the sample file which is not determined yet by using a deep learning detection model to obtain a detection result.

The invention also provides a device for detecting malicious programs, which comprises:

the static analysis unit is used for carrying out static analysis on the sample file to be detected, generating threat information if the static analysis result is that the sample file contains a malicious program, quitting the detection if the static analysis result is a credible sample file, and carrying out dynamic analysis and detection if the static analysis result is an undetermined sample file;

the dynamic analysis and detection unit is used for operating an undetermined sample file in a sandbox, collecting behaviors generated by the undetermined sample file in the operation process, identifying the undetermined sample file through the behaviors, generating threat information if the sample file contains a malicious program according to the identification result, and detecting the undetermined sample file by using deep learning detection if the sample file still contains the malicious program according to the identification result;

the deep learning detection unit is used for detecting the sample file which is not determined yet by using the deep learning detection model, and generating threat information if the detection result is that the sample file contains a malicious program;

and the early warning unit is used for analyzing the threat information, acquiring the threat level and sending out early warning information according to the threat level.

The invention has the technical effects that: according to the invention, static analysis is carried out on the sample file to be detected firstly, if the sample file cannot be determined to be malicious, dynamic analysis and identification are carried out on the sample file, if the sample file cannot be determined to be malicious, a deep learning detection model is used for detection, and the behavior of dynamic analysis and static information of static analysis are converted into images suitable for deep learning model identification for classification during deep learning model detection.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method of detecting malicious programs according to an embodiment of the present invention.

Fig. 2 is a block diagram of an apparatus for detecting a malicious program according to an embodiment of the present invention.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows a method of detecting a malicious program according to the present invention, which includes:

a static analysis step S101, performing static analysis on a sample file to be detected, generating threat information if the static analysis result indicates that the sample file contains a malicious program, quitting detection if the static analysis result indicates that the sample file is a credible sample file, and performing dynamic analysis and detection if the static analysis result indicates that the sample file is an undetermined sample file.

And a dynamic analysis and detection step S102, operating an undetermined sample file in a sandbox, collecting behaviors generated by the undetermined sample file in the operation process, identifying the undetermined sample file through the behaviors, generating threat information if the sample file contains a malicious program according to the identification result, and detecting the undetermined sample file by using deep learning detection if the sample file still contains the malicious program according to the identification result.

And a deep learning detection step S103, detecting the sample file which is not determined by using the deep learning detection model, and generating threat information if the detection result is that the sample file contains a malicious program.

And an early warning step S104, analyzing the threat information, acquiring a threat level, and sending out early warning information according to the threat level.

In one embodiment, in the static analysis step S101, different process flows are selected according to the classification of the sample file: ending the detection process of the credible sample; for samples with threats, generating threat information according to the threat conditions of the samples; and further carrying out dynamic analysis and deep learning detection on the unknown sample. The static analysis comprises a plurality of traditional recognition technologies with the sequence capable of being increased and decreased and adjusted, and the static analysis adopts the following modes: the method comprises a trusted file identification mode based on full-text hash, a family identification mode based on digital signature, a family identification mode based on fuzzy hash and/or scanning by a static scanning engine (which can be a static scanning engine provided by a third party), and can be carried out by comprehensively using the above analysis modes during static analysis.

In one embodiment, in the dynamic analysis detecting step S102, the dynamic analysis identifies an undetermined sample file through behavior pattern matching or a heuristic algorithm, records the behavior generated during the operation of collecting the undetermined sample file, and if the sample file with an undetermined structure still cannot be determined through dynamic analysis, detects the behavior generated during the operation of the undetermined sample file needing to be recorded by using deep learning.

In order to accurately detect, before the deep learning detection model is used, training needs to be performed, the deep learning detection model is trained by combining dynamic operation records and static information records into a training sample information behavior set to generate sample images of a training sample file, which is one of important invention points of the deep learning detection model, and the operation of training the deep learning detection model is as follows: acquiring training sample files, and preparing a sample file set which has the same file type and is classified into threat sample files and credible sample files, wherein the number of the threat sample files is consistent with that of the credible sample files; dynamically analyzing a training sample file, respectively operating the threat sample file and the credible sample file in a sandbox, collecting behaviors generated in the operation process to generate a dynamic operation record, statically analyzing the threat sample file and the credible sample to extract static information to generate a static information record, and combining the dynamic operation record and the static information record into a training sample information behavior set; generating a sample image of a training sample file, preprocessing the training sample information behavior set, removing redundant and interference information in the behavior set, and converting the processed training sample information behavior set into the sample image of the training sample file, wherein the sample image comprises a threat sample image and a credible sample image; training a deep learning detection model, carrying out up-down turning, left-right turning, binarization and/or ZCA whitening data enhancement processing on the threat sample image and the credible sample image to obtain an expanded threat sample image and a credible sample image, and carrying out convolution training on the expanded threat sample image and the credible sample image by using a convolution neural network mode to generate the deep learning detection model for detecting the malicious program.

In the embodiment of the invention, the dynamic operation record and the static information record are stored in a JSON format, so that the training of the model is facilitated. JSON (JavaScript Object Notation) is a lightweight data exchange format. It stores and represents data in a text format that is completely independent of the programming language, based on a subset of ECMAScript (js specification set by the european computer association). The compact and clear hierarchy makes JSON an ideal data exchange language. The network transmission method is easy to read and write by people, is easy to analyze and generate by machines, and effectively improves the network transmission efficiency. Just with the above advantages of JSON, the present invention encapsulates data into a JSON format.

The convolutional neural network used in the method comprises a plurality of convolutional pooling units, wherein the convolutional pooling units comprise convolutional layers applying convolutional kernels and pooling layers for further dimension reduction. And after the characteristic detection is finished, accessing a full connection layer, observing which classification the output characteristic is closest to, and finally obtaining a classification result through a classifier. Based on the method, after the threat samples and the credible samples are converted, the training is carried out on the GPU cluster through the convolutional neural network, and a model which can be used for detecting the malicious program is generated. And in the deep learning detection model training process, adjusting the parameters of the model or training samples at any time according to the training effect until the optimal deep learning detection model is obtained. In order to further improve the detection precision, the training sample file can be used for training a corresponding deep learning model according to types, for example, the file types are divided into PE files, ELF files, Office files, PDF, and the like, and the deep learning model corresponding to the file types is obtained through training.

In one embodiment, the behavior generated during runtime during dynamic analysis includes: API call information, process chain information, and/or network behavior data; the static information includes: string information, lead-in lead-out information, and/or resource information.

In one embodiment, the operation of detecting a sample file that has not yet been determined using the deep learning detection model is: collecting the behavior generated in the running process of the sample file which is not determined yet to generate a dynamic running record, carrying out static analysis on the sample file which is not determined yet to extract static information to generate a static information record, combining the dynamic running record and the static information record into the information behavior set of the sample file which is not determined yet, preprocessing the information behavior set of the sample file which is not determined yet to remove redundant and interference information in the behavior set, converting the processed information behavior set of the sample file which is not determined yet into the sample image of the sample file which is not determined yet, obtaining the sample image of the sample file which is not determined yet in the same mode as the mode generated during model training, and then detecting the sample image of the sample file which is not determined yet by using a deep learning detection model to obtain a detection result. During deep learning detection, a corresponding deep learning detection model can be selected according to the type of a sample file which is not determined yet, and if the sample file is an ELF file, a trained deep learning model of the ELF file type is selected, so that the detection precision is further improved.

In the early warning step S104, the threat information is analyzed to obtain a threat level, and early warning information is sent according to the threat level, for example, the threat level is classified into very serious, general, and the like, and sound or color may be used to represent the early warning information, for example, red represents that the threat level is very serious, orange represents that the threat level is serious, green represents that the threat level is general, and the like.

According to the method, static analysis and dynamic analysis recognition are carried out on a sample file to be detected, a deep learning detection model is used for carrying out detection three-level detection, and dynamic analysis behaviors and static analysis static information are used for being converted into images suitable for deep learning model recognition during deep learning model detection for classification, so that the detection rate of malicious programs is improved, the recognition rate of widely-propagated deformed malicious programs and polymorphic malicious programs is greatly improved, the false alarm rate is reduced, and the detection speed is also improved.

With further reference to fig. 2, as an implementation of the method shown in fig. 1, the present application provides an embodiment of an apparatus for detecting a malicious program, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 1, and the apparatus may be specifically included in various electronic devices.

Fig. 2 shows an apparatus for detecting a malicious program according to the present invention, which includes:

the static analysis unit 201 is configured to perform static analysis on a sample file to be detected, generate threat information if the static analysis result indicates that the sample file contains a malicious program, quit detection if the static analysis result indicates that the sample file is a trusted sample file, and perform dynamic analysis detection if the static analysis result indicates that the sample file is an undetermined sample file.

The dynamic analysis detection unit 202 is configured to run an undetermined sample file in a sandbox, collect behaviors generated by the undetermined sample file in the running process, identify the undetermined sample file through the behaviors, generate threat information if the sample file contains a malicious program as an identification result, and detect the sample file by using deep learning detection if the sample file is still the undetermined sample file as the identification analysis result.

And the deep learning detection unit 203 is configured to detect a sample file that is not determined yet by using the deep learning detection model, and generate threat information if the detection result indicates that the sample file contains a malicious program.

And the early warning unit 204 is configured to analyze the threat information, obtain a threat level, and send out early warning information according to the threat level.

In one embodiment, in the static analysis unit 201, different process flows are selected according to the classification of the sample file: ending the detection process of the credible sample; for samples with threats, generating threat information according to the threat conditions of the samples; and further carrying out dynamic analysis and deep learning detection on the unknown sample. The static analysis comprises a plurality of traditional recognition technologies with the sequence capable of being increased and decreased and adjusted, and the static analysis adopts the following modes: the method comprises a trusted file identification mode based on full-text hash, a family identification mode based on digital signature, a family identification mode based on fuzzy hash and/or scanning by a static scanning engine (which can be a static scanning engine provided by a third party), and can be carried out by comprehensively using the above analysis modes during static analysis.

In one embodiment, in the dynamic analysis detection unit 202, the dynamic analysis identifies the undetermined sample file through behavior pattern matching or a heuristic algorithm, records the behavior generated by collecting the undetermined sample file during the operation, and if the sample file of which the dynamic analysis structure can not be determined yet, detects the behavior generated during the operation by using the recorded undetermined sample file during the operation by using deep learning.

In order to accurately detect, before the deep learning detection model is used, training needs to be performed, the deep learning detection model is trained by combining dynamic operation records and static information records into a training sample information behavior set to generate sample images of a training sample file, which is one of important invention points of the deep learning detection model, and the operation of training the deep learning detection model is as follows: acquiring training sample files, and preparing a sample file set which has the same file type and is classified into threat sample files and credible sample files, wherein the number of the threat sample files is consistent with that of the credible sample files; dynamically analyzing a training sample file, respectively operating the threat sample file and the credible sample file in a sandbox, collecting behaviors generated in the operation process to generate a dynamic operation record, statically analyzing the threat sample file and the credible sample to extract static information to generate a static information record, and combining the dynamic operation record and the static information record into a training sample information behavior set; generating a sample image of a training sample file, preprocessing the training sample information behavior set, removing redundant and interference information in the behavior set, and converting the processed training sample information behavior set into the sample image of the training sample file, wherein the sample image comprises a threat sample image and a credible sample image; training a deep learning detection model, carrying out up-down turning, left-right turning, binarization and/or ZCA whitening data enhancement processing on the threat sample image and the credible sample image to obtain an expanded threat sample image and a credible sample image, and carrying out convolution training on the expanded threat sample image and the credible sample image by using a convolution neural network mode to generate the deep learning detection model for detecting the malicious program. In order to further improve the detection precision, the training sample file can be used for training a corresponding deep learning model according to types, for example, the file types are divided into PE files, ELF files, Office files, PDF, and the like, and the deep learning model corresponding to the file types is obtained through training.

The convolutional neural network used by the device comprises a plurality of convolutional pooling units, wherein the convolutional pooling units comprise convolutional layers applying convolutional kernels and pooling layers for further dimension reduction. And after the characteristic detection is finished, accessing a full connection layer, observing which classification the output characteristic is closest to, and finally obtaining a classification result through a classifier. Based on the method, after the threat samples and the credible samples are converted, the training is carried out on the GPU cluster through the convolutional neural network, and a model which can be used for detecting the malicious program is generated. And in the deep learning detection model training process, adjusting the parameters of the model or training samples at any time according to the training effect until the optimal deep learning detection model is obtained.

In the pre-warning unit 204, the threat information is analyzed to obtain a threat level, pre-warning information is sent according to the threat level, for example, the threat level is classified into very serious, general, and the like, and sound or color may be used to represent the pre-warning information, for example, red represents that the threat level is very serious, orange represents that the threat level is serious, green represents that the threat level is general, and the like.

The device carries out static analysis and dynamic analysis recognition on a sample file to be detected and carries out detection three-level detection by using the deep learning detection model, and converts the behavior of the dynamic analysis and the static information of the static analysis into the image suitable for the deep learning model recognition for classification when the deep learning model is detected, thereby improving the detection rate of malicious programs, greatly improving the recognition rate of widely-propagated deformed malicious programs and polymorphic malicious programs, reducing the false alarm rate and improving the detection speed.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made thereto without departing from the spirit and scope of the invention and it is intended to cover in the claims the invention as defined in the appended claims.

Claims

1. A method of detecting malicious programs, the method comprising:

a dynamic analysis and detection step, wherein an undetermined sample file is operated in a sandbox, behaviors generated in the operation process of the undetermined sample file are collected, the undetermined sample file is identified through the behaviors, if the sample file contains a malicious program according to the identification result, threat information is generated, and if the sample file is still the undetermined sample file according to the identification result, a deep learning detection model is used for detecting the sample file;

early warning step, analyzing the threat information, obtaining threat level, and sending out early warning information according to the threat level;

wherein, the operation of training the deep learning detection model is as follows: acquiring training sample files, and preparing a sample file set which has the same file type and is classified into threat sample files and credible sample files, wherein the number of the threat sample files is consistent with that of the credible sample files; dynamically analyzing a training sample file, respectively operating the threat sample file and the credible sample file in a sandbox, collecting behaviors generated in the operation process to generate a dynamic operation record, statically analyzing the threat sample file and the credible sample to extract static information to generate a static information record, and combining the dynamic operation record and the static information record into a training sample information behavior set; generating a sample image of a training sample file, preprocessing the training sample information behavior set, removing redundant and interference information in the behavior set, and converting the processed training sample information behavior set into the sample image of the training sample file, wherein the sample image comprises a threat sample image and a credible sample image; training a deep learning detection model, performing up-down turning, left-right turning, binarization and/or ZCA whitening data enhancement processing on the threat sample image and the credible sample image to obtain an expanded threat sample image and a credible sample image, and performing convolution training on the expanded threat sample image and the credible sample image by using a convolution neural network mode to generate the deep learning detection model for detecting the malicious program;

wherein, the operation of detecting the sample file which is not determined by using the deep learning detection model is as follows: collecting behaviors generated in the running process of a sample file which is not determined yet to generate a dynamic running record, carrying out static analysis on the sample file which is not determined yet to extract static information to generate a static information record, combining the dynamic running record and the static information record into an information behavior set of the sample file which is not determined yet, preprocessing the information behavior set of the sample file which is not determined yet to remove redundant and interference information in the behavior set, converting the processed information behavior set of the sample file which is not determined yet into a sample image of the sample file which is not determined yet, selecting a corresponding deep learning detection model according to the type of the sample file which is not determined yet, and detecting the sample image of the sample file which is not determined yet by using the selected deep learning detection model to obtain a detection result.

2. The method of claim 1, wherein the static analysis is performed by: a trusted document identification mode based on full-text hash, a family identification mode based on digital signature, based on fuzzy hash and/or scanning by a static scanning engine.

3. The method of claim 2, wherein the dynamic analysis identifies undetermined sample files through behavioral pattern matching or a heuristic algorithm.

4. The method of claim 3, wherein dynamically analyzing behavior generated during runtime comprises: API call information, process chain information, and/or network behavior data; the static information includes: string information, lead-in lead-out information, and/or resource information.

5. An apparatus for detecting malicious programs, the apparatus comprising:

the dynamic analysis detection unit is used for operating an undetermined sample file in a sandbox, collecting behaviors generated by the undetermined sample file in the operation process, identifying the undetermined sample file through the behaviors, generating threat information if the sample file contains a malicious program according to the identification result, and detecting the undetermined sample file by using a deep learning detection model if the sample file still contains the malicious program according to the identification result;

the early warning unit is used for analyzing the threat information, acquiring a threat level and sending out early warning information according to the threat level;

6. The apparatus of claim 5, wherein the static analysis is performed by: a trusted document identification mode based on full-text hash, a family identification mode based on digital signature, based on fuzzy hash and/or scanning by a static scanning engine.

7. The apparatus of claim 6, wherein the dynamic analysis identifies undetermined sample files through behavioral pattern matching or a heuristic algorithm.

8. The apparatus of claim 7, wherein the behavior generated during operation during dynamic analysis comprises: API call information, process chain information, and/or network behavior data; the static information includes: string information, lead-in lead-out information, and/or resource information.

9. A computer-readable storage medium, characterized in that the storage medium has stored thereon computer program code which, when executed by a computer, performs the method of any of claims 1-4.