CN111639337A - Unknown malicious code detection method and system for massive Windows software - Google Patents

Unknown malicious code detection method and system for massive Windows software Download PDF

Info

Publication number
CN111639337A
CN111639337A CN202010305550.XA CN202010305550A CN111639337A CN 111639337 A CN111639337 A CN 111639337A CN 202010305550 A CN202010305550 A CN 202010305550A CN 111639337 A CN111639337 A CN 111639337A
Authority
CN
China
Prior art keywords
malicious
detection
rule
sample
benign
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010305550.XA
Other languages
Chinese (zh)
Other versions
CN111639337B (en
Inventor
贾晓启
李帅
陈阳
杜海超
白璐
解亚敏
唐静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202010305550.XA priority Critical patent/CN111639337B/en
Publication of CN111639337A publication Critical patent/CN111639337A/en
Application granted granted Critical
Publication of CN111639337B publication Critical patent/CN111639337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a method and a system for detecting unknown malicious codes for massive Windows software, belongs to the technical field of system safety, aims to solve the problem that the traditional detection method based on feature codes cannot detect unknown malicious codes, combines the advantages of dynamic detection and static detection, uses a deep learning detection technology to realize the detection of the malicious codes with unknown features, uses a static feature auxiliary detection method to accelerate the detection in the scene of massive samples, and improves the detection efficiency.

Description

Unknown malicious code detection method and system for massive Windows software
Technical Field
The invention belongs to the technical field of system security, relates to a malicious code detection method, and particularly relates to an unknown malicious code detection method and system suitable for massive Windows platform software.
Background
The influence of the rapid development of computer technology and internet technology is increasingly remarkable, and great changes are made in the fields of economy, culture, politics, medical treatment, education and the like. However, people who enjoy these benefits also have an inevitable need to consider security issues, most typically attacks and floods of malicious code. Malicious code, also known as malware, may also be referred to as adware, spyware, or malware. The method is characterized in that software which runs on a user computer or other terminals and infringes the legal rights and interests of users is installed and operated under the condition that the users are not explicitly prompted or the user license is not authorized. In the last half of 2018, the 360 Internet security center cumulatively intercepts 1.4 hundred million samples of newly added malicious programs. 14099.8 ten thousand samples of the PC-side malicious programs are newly added, 77.9 ten thousand samples of the newly added malicious programs are intercepted each day on average, and the PC-side malicious programs can be found to occupy 97.9% of the total amount of the malicious programs, so that the research of the malicious software under the Windows platform is necessary.
Malware grows faster and faster, the number of varieties increases, and family features are obvious. Other advanced malicious codes are usually confronted with security analysts by adopting advanced technologies in order to protect themselves. The difficulty of inverse analysis is increased by measures such as shelling, confusion, and the like. Malicious code poses a great deal of harm and typically has one or more of the following behaviors: forced installation, browser hijacking, stealing, user data modification, malicious user information collection, malicious uninstallation, malicious binding and other malicious behaviors that violate user awareness, option rights, etc. These behaviors will seriously infringe the legitimate interests of the user and even bring enormous economic or other forms of loss of interest to the user and others. For example, in 2017 the WannaCry "helminth" lemonavirus software infected more than 10 million computers in over 100 countries and regions, resulted in at least 80 billion dollar losses. Other classical programs include programs such as flame virus, vibration net virus, panda incense, cloudiness III and the like, malicious codes need to be analyzed to avoid larger loss, and then effective malicious code detection technologies are researched.
Methods for analyzing malicious code can be generally classified into static analysis methods and dynamic analysis methods. Static analysis refers to analysis performed without executing the binary program, such as disassembling analysis, source code analysis, binary statistical analysis, decompilation, and the like, and techniques include static disassembling analysis, static source code analysis, decompilation analysis, and the like. The existing static analysis has the defects that confusing, shelled and polymorphic malicious codes are difficult to accurately analyze; the traditional static method has low detection accuracy, and particularly has poor detection effect on malicious codes with unknown characteristics. The dynamic analysis refers to a working process of determining the malicious code by utilizing a program debugging tool to track and observe the malicious code under the condition of executing the malicious code. Currently, researchers have developed a number of dynamic analysis tools to analyze malicious samples, primarily by extracting the API sequence of system calls. The existing dynamic analysis has the defects of long time consumption and relatively high cost. In addition, the characteristics need to be manually selected by using the traditional machine learning method, and the difficulty of manual analysis is increased.
Several methods currently exist for the detection of malicious code. The traditional detection based on the feature codes extracts the feature codes of the intercepted samples, records the feature codes in a database and then matches the feature codes, but the conditions of depending on the feature library, detection hysteresis and incapability of detecting unknown feature samples exist; the traditional heuristic detection method has low detection efficiency when processing massive samples, depends on the knowledge and experience of experts, and needs to consume manpower to establish heuristic rules. Under the scene of massive samples, a great deal of time and human resources are consumed to construct rules, so that the detection efficiency is low; increasingly, methods using machine learning have emerged that detect by extracting valid features in conjunction with classification algorithms. The characteristics in the traditional machine learning method are manually screened, the quality of characteristic selection directly influences the detection effect, manual participation is needed in the process of extracting and screening the characteristics, and the labor cost is high.
The deep neural network has proved to have high learning ability, can benefit from a very large training set, can learn the potential characteristics of malicious software, and realize the detection of unknown characteristic samples. In addition, the characteristics of the malicious software can be automatically extracted from the original data, so that the signature of the malicious software does not need to be designed manually, and the excessive dependence on professional knowledge is avoided. The malware detection method based on deep learning is efficient in training time, and the training time is linearly related to the amount of malware. Such a detected network can run on a GPU, essentially a mandatory component of all PCs, which also means that more malware can be analyzed per unit time. Therefore, potential features of malicious codes can be learned based on historical data by means of a deep learning method, and detection of unknown malicious software is achieved.
With the rapid development of the internet in recent years, the number and the types of malicious software are also rapidly increased, and the loss caused by the continuous updating of the propagation mode is also increased. In the face of a huge amount of samples, it becomes a challenge to quickly and efficiently identify malware of unknown characteristics. The traditional detection method based on the feature code can not detect samples with unknown features. The heuristic detection method needs to consume manpower, and the detection efficiency is low when dealing with a large number of samples. In addition, the features in the traditional machine learning method depend on manual extraction and screening, and the labor cost is high. In the face of a large amount of malicious software of a Windows platform, manual analysis cannot be performed by excessively depending on expert knowledge, because huge time and manpower resources are consumed, and a set of method capable of automatic detection needs to be designed.
Disclosure of Invention
The invention aims to design and realize a malicious code detection method and system based on deep learning, which avoid the problem that the traditional detection method based on feature codes cannot detect unknown malicious codes, introduce a mass sample static auxiliary detection method, accelerate the detection speed in the scene of large-scale samples, have the capability of identifying related malicious codes, help malicious software researchers to find risks in time and avoid greater loss.
The technical scheme adopted by the invention is as follows:
a method for detecting unknown malicious codes for massive Windows software comprises the following steps:
preprocessing target software: screening out Windows platform executable files with standard formats, preliminarily judging whether the files are malicious or benign, and taking the files as malicious samples if the files are malicious;
carrying out static auxiliary detection on the malicious sample: combining the sensitive character string and the ImpHash value, automatically generating a rule aiming at a malicious sample, constructing a feature library of the malicious sample according to the rule and the existing rule, judging whether the malicious sample matches the rule of the feature library, judging the malicious sample to be malicious if the malicious sample matches the rule of the feature library, and judging the malicious sample to be benign if the malicious sample does not match the rule of the feature library;
and (3) carrying out dynamic behavior classification on the malicious samples: dynamically operating the malicious samples judged to be benign, acquiring an API (application program interface) calling sequence during dynamic operation, inputting the API calling sequence into the deep neural network model for classification, and judging whether the API calling sequence is malicious or benign;
and finally judging the malicious sample as malicious software if the malicious sample is judged to be malicious by one of static auxiliary detection and dynamic behavior classification, otherwise, judging the malicious sample as benign software.
Further, the preprocessing method is to check the format of the target software, screen out the Windows platform executable file with a standard format, and then preliminarily determine whether the Windows platform executable file is benign or malicious by using an online detection tool VirusTotal.
Further, using the malware pattern matching tool yara for static aided detection, the yara rule base is used as an existing rule base, which comprises a yara-rules official library and a yara rule base converted by a ClamAV feature code.
Furthermore, the extraction method of the sensitive character strings comprises the steps of firstly obtaining printable character strings of malicious samples, then deleting all character strings existing in benign software collected in advance, reserving a malicious software character string set, and finally screening out a sensitive character string set which comprises a certain number of URLs, IPs, Hash and File, system sensitive positions and registry paths.
And further, generating a rule aiming at the malicious sample by combining the ImpHash value, wherein a hash is created based on the library/API name in the import address table and the specific sequence of the library/API name in the executable file, if the files have the same ImpHash value, the files are judged to have the same import address table, and the files are judged to be compiled by the same source code in the same coding mode, so that the related malicious software is identified.
Further, the deep neural network model is a textCNN deep neural network model.
Further, the textCNN deep neural network model comprises a convolution layer, a pooling layer, a splicing layer, a full-connection layer and a classification layer; the convolution layer comprises three types of convolution kernels, the number of each type is 128, the heights of the convolution kernels are 3, 4 and 5 respectively, the widths of the convolution kernels are the same as the widths of word vectors, and the convolution kernels are one-dimensional convolution; and the pooling layer generates the feature maps with the same quantity as the total quantity of the convolution kernels by adopting a maximum pooling method.
Further, word vectors are expanded for each API, one-dimensional texts of the API calling sequences are converted into two-dimensional matrixes, and then the two-dimensional matrixes are input into the textCNN deep neural network model.
Further, a malicious program analysis system Cuckoo Sandbox is established based on the virtualization environment, a malicious sample is dynamically operated by using the Cuckoo Sandbox, API call tracing, system operations of files and networks are recorded, and an API sequence is extracted from a result log Json file.
An unknown malicious code detection system for massive Windows software comprises:
a preprocessing module: the method is used for screening out the Windows platform executable files with standard formats, preliminarily judging whether the files are malicious or benign, and taking the files as malicious samples if the files are malicious;
a static auxiliary detection module: the method is used for combining the sensitive character string and the ImpHash value, automatically generating a rule aiming at a malicious sample, constructing a feature library of the malicious sample according to the rule and the existing rule, judging whether the malicious sample matches the rule of the feature library, judging the malicious sample to be malicious if the malicious sample matches the rule of the feature library, and judging the malicious sample to be benign if the malicious sample does not match the rule of the feature library;
a dynamic behavior classification module: the API calling sequence is input into the deep neural network model for classification, and whether the samples are malicious or benign is judged;
a judging module: and the method is used for judging whether the malicious sample is malicious or benign, and finally judging as malicious software if the malicious sample is judged to be malicious by one of static auxiliary detection and dynamic behavior classification, or judging as benign software if the malicious sample is judged to be malicious.
The invention has the beneficial effects that:
the invention provides an unknown malicious code detection method facing mass Windows software based on understanding of malicious code analysis technology and deep learning research, combines the advantages of dynamic detection and static detection, uses the deep learning detection technology to realize detection of malicious codes with unknown characteristics, uses the static characteristic auxiliary detection method to accelerate detection in the scene of mass samples, and improves the detection efficiency.
Drawings
FIG. 1 is a flow chart of an unknown malicious code detection method for massive Windows software according to the present invention;
FIG. 2 is a schematic diagram of the static yara rule base construction of the present invention;
FIG. 3 is a diagram of the textCNN model of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples and the accompanying drawings.
The embodiment discloses a method and a system for detecting unknown malicious codes for massive Windows software, and the processing flow is shown in FIG. 1. The system comprises a preprocessing module, a static auxiliary detection module, a dynamic behavior classification module and a studying and judging module, which are specifically described as follows.
1) Pre-processing module
The preprocessing module analyzes the loaded file, checks the format and judges whether the file is a Windows platform executable file with a standard format. If the sample is legal, detecting by using an online detection tool VirusTotal, preliminarily judging whether the sample is benign or malicious, and finishing the labeling work. An own data set can be constructed for the labeled exemplars. The preprocessing module is mainly used for cleaning and marking data and providing input data for a subsequent detection process.
2) Static auxiliary detection module
For the input samples, a static auxiliary detection is first performed. The malicious software pattern matching tool yara is used, a yara rule base is constructed to serve as a feature base of the malicious software, and the detection speed can be increased in the detection scene of massive samples. The static detection stage constructs a malicious code feature library of the static detection stage, and the malicious code feature library can be divided into 2 parts. The first is the existing rule base, including yara-rules official base and the yara rule base after conversion by the ClamAV feature code. The second is a rule base which needs to be maintained, and is mainly used for automatically generating rules for static detection of the malicious samples.
The method aims at the technical scheme that the rules for automatically generating the static detection of the malicious samples are automatically generated by adopting the feature codes, and the technical route is that the rule files are automatically generated by adopting a method of combining sensitive character strings and Imphash. The extraction idea is that printable character strings of a sample are firstly obtained, then all character strings existing in benign files collected in advance are deleted, a character string set of the malicious software is reserved, and finally a sensitive character string set is screened out and comprises a certain number of URLs, IPs, Hash and files, system sensitive positions, registry paths and the like. Next is imhash to identify relevant malware (family variants). Imhash creates hashes based on the library/API names in the import address table and their specific order in the executable file, is a powerful way to identify relevant malware, whose value itself is relatively unique. This is because the linker of the compiler generates and constructs an Import Address Table (IAT) according to the specific order of functions in the source file. ImpHash also becomes distinct as the sequence of function calls changes. If two files have the same ImpHash value, they have the same IAT, which means that the files are compiled from the same source code and encoded in the same way. The relevant malware can be identified using the method of ImpHash.
By constructing a malicious code feature library of the user, firstly, static preliminary detection is carried out on an exe file of an input Windows platform, if rules are matched in a yara rule library, the file is known malicious codes or variants thereof, and the file directly enters a judging module. Otherwise, the next detection is carried out, and the dynamic behavior classification module is automatically entered.
3) Dynamic behavior classification module
The dynamic behavior classification module is mainly used for acquiring a dynamic runtime API sequence of a program through a dynamic runtime program. And then, taking the API sequence which runs dynamically as an input text, expanding each API into a word vector, putting the word vector into a textCNN model for training, and finishing the two classifications. The Cuckoo Sandbox is used for completing the acquisition of dynamic behaviors, a malicious program analysis system Cuckoo Sandbox established based on a virtualization environment can automatically execute and analyze program behaviors, and system operations such as API call tracking, files, networks and the like are recorded. After the dynamic behavior information is obtained, an API (application program interface) calling sequence is extracted from a Json file of a result log, a textCNN (conditional access network) -based deep neural network model is constructed for secondary classification, and finally whether the software has maliciousness or not is judged.
4) Judging module
The final judging module is used for integrating the detection results of the first two stages and giving out the final malice judgment. If the file is judged to be malicious after the static auxiliary detection module in the first stage, the final result of the sample can be judged to be malicious, and the processing of the dynamic behavior classification module in the second stage is not needed, so that the time is saved and the efficiency is improved. Otherwise, the dynamic behavior classification module is required to process under the condition that the first stage is not matched, and the detection result of the second stage is given. Only if both phases are judged to be benign, the file is finally judged to be benign.
The processing flow of the method and the system is shown in fig. 1, and the input is an executable program exe file of a Windows platform. Firstly, the samples are processed by a preprocessing module, batch uploading to VirusTotal is realized through programming for online detection, and the real type of the samples is determined according to the returned result, so that the labeling work is completed. And if the program is a legal Windows program, entering a next static auxiliary detection module.
In the static detection module, a pre-constructed rule base is used for carrying out preliminary detection on an input sample. The myyarrowulemaker is a tool for automatically generating rule files, and the automation degree of a detection system is improved in order to save labor when a rule base is constructed. The static yara rule base is constructed as shown in fig. 2, and a rule file of an input sample is generated based on a printable sensitive character string and an import table hash value without spending much professional knowledge and energy of an expert. The first is to match some rules in the rule base, then it can directly enter the judging module, without dynamic detection. Otherwise, when the existing rule is not matched, the next dynamic behavior classification module is required to be entered.
And the next dynamic behavior classification module submits the input samples to a built sandbox environment Cuckoo for dynamic operation to obtain an API calling sequence as a one-dimensional text, and then converts the API calling sequence into a two-dimensional matrix by using a word embedding method, so that the two-dimensional matrix is changed into input data which can be identified by the textCNN model, and the training and detection of the model are further completed.
A schematic of the textCNN model is shown in fig. 3:
the input samples are two-dimensional vectors of m x n, transformed from the API sequence. The deep learning model is based on a textCNN model, firstly, the convolutional layer is a convolutional layer, which is A, B, C types of convolutional kernels, the number of each type is 128, the height of the convolutional kernel is 3, 4 and 5 respectively, and the width of the convolutional kernel is the same as that of the word vector, namely, the convolutional layer is one-dimensional. After the convolution layer processing, next, a pooling layer is performed, and a maximum pooling method is adopted to generate a fixed number of feature maps, wherein the number is the total number of convolution kernels. And then, the split joint layers are connected together and enter a full connection layer to serve as the input of a classification layer, a softmax classification algorithm is used for classification, and finally the judgment result of the dynamic stage is output to be benign software or malicious software.
And finally, a judging module is used, the function is simpler, and only the judging results generated in the first two stages are summarized for final judgment. And judging the software to be benign only if the detection results of the first two stages are both benign software, otherwise judging the software to be malicious software.
A specific application example is listed below:
the user is a virus analyst, and a rapid and efficient malicious software detection method is sought for the requirement of batch detection of whether captured software belongs to malicious software. In this case, the malicious code detection method in the invention can provide technical support for software detection.
The user takes as input the executable program that needs to be detected, in this example, the exe file under Windows is taken as an example. Firstly, data preprocessing is carried out on the file, and whether the file format meets the specification of an exe file or not is checked. And if the file format is correct, continuing to perform the next analysis. Then, a static auxiliary detection module is entered to perform static detection on the program, and mainly a precompiled rule base is used for rule matching. Because the rule base is based on the credible malicious sample characteristics collected in history, the detection of the known type of malicious sample can be realized in the process. In addition, because the Imphash technology is used, the variant of the malicious code can be detected.
After the detection of the static auxiliary detection module is finished, a detection result of a static stage is generated. And if the static stage judges that the sample is the malicious software, directly entering a judging stage. Otherwise, the dynamic behavior classification module is used for detecting the second stage. First, after a sample is submitted to Cuckoo Sandbox in batch mode and the program exits or is fixed, calling sequences of APIs and parameters arg thereof are extracted from behavior log report json files in the formats of { API1 arg11, arg12, arg13}, { API2 arg21, arg22 and arg23} … …, and whether parameters of the APIs are reserved and whether deduplication operation is performed can be selected for extracted data. The deduplication operation refers to the condition that repeated API sequences such as API1, API1, API2 and API2 are encountered, and finally reserved as API1 and API2 through deduplication processing. And then converting the word embedding into a two-dimensional matrix which can be identified by the textCNN model, and putting the two-dimensional matrix into the model for detection to obtain a result of the dynamic stage detection.
And finally entering a research and judgment stage, wherein the file is finally judged to be benign only if the detection results of the first two stages are benign, otherwise, the file is judged to be malicious, and thus, the detection of the batch of software is completed.
Here the effectiveness of the static-assisted test was first experimentally tested. First, introducing a data set, a total of 41125 malicious Windows software in about 5 years is crawled by a crawler from an internet-published malicious sample download repository malsharp, and divided into a malicious sample set ST (total of 31359) and a malicious sample set M (total of 9766) for constructing static rules in a 4:1 ratio. While the benign file data set B is derived from exe files extracted from the newly installed Windows operating system, 5914 files from Windows XP to Windows Server 2016. And respectively testing the malicious file test set and the benign file test set, and recording the number of the matched rules and the detection time. The results are shown in Table 1.
TABLE 1 static auxiliary test results
Data set Total number of Number of detections Detection rate Time of detection Single detection time
Malicious sample M 9766 8844 90.56% 3534.61 seconds 0.36 second
Benign specimen B 5914 0 0% 2579.44 seconds 0.43 second
As can be seen from table 1, the constructed yara rule base has a malicious sample detection rate of 90.56%, and a certain number of relevant samples can be identified. The detection rate of the rule base for benign samples is 0, namely the false alarm rate is 0%, because the feature base is used for detecting the malicious software, the detection result in the benign software data set should not exist, and the detection result is in line with the expectation. The detection time of a single sample is 0.36 second and 0.43 second respectively, and the method is also acceptable in the detection scene of massive samples, and the problem of low detection efficiency in the detection scene of massive samples can be solved.
In addition, a malicious code detection method based on deep learning is tested. 1907 software in total, including 1065 malware and 842 benign software; the test set consisted of 200 pieces of software, including 100 pieces of malware and 100 pieces of benign files. Training and testing are performed on the constructed model, and corresponding training time and test set accuracy under different training data are recorded, as shown in table 2:
TABLE 2 dynamic test results
Type of input data For training modelsTime of flight Accuracy Accuracy
With parameters, repetition 49 minutes, 53.29 seconds 98.5%
Repetition without parameters 42 minutes 15.55 seconds 97.0%
With parameters, de-weighting 55 minutes 32.01 seconds 98.5%
Without parameters, de-weighting 46 minutes 25.35 seconds 94.5%
Through experiments, the lowest accuracy rate of 94.5 percent is achieved by using data which are not parameters and are de-duplicated for detection. And the highest accuracy rate reaches 98.5 percent when input data with parameters are used. This is desirable because the data without parameters will have information lost compared to the original data, resulting in a slight decrease in accuracy. Such data may be employed when there is a high demand for detection time. Because the test set is unknown to the model, the method can effectively detect the malicious software with unknown characteristics.
And finally, forming an automatic detection system and performing testing. In the training stage, firstly, a yara rule base is constructed in a static auxiliary detection module, and the yara rule base comprises 5904 known rules; 4000 training data (containing 1816 benign files and 2184 malicious files) were provided at the dynamic behavior classification module. After the training is completed, in a testing stage, 2000 pieces of testing data (including 449 benign files and 1551 malicious files) are used, and are detected by a static auxiliary detection module, so that 1042 pieces of malicious files and 0 piece of benign files are detected, the detection rate of the malicious files is 67.18%, and the false alarm rate is 0%. The static detection phase is 71.79 seconds in total, and the average single sample detection time is 0.035 seconds. Then 958 documents with undetected results in the first stage enter the dynamic behavior classification module for detection, and the final results are generated in the judging stage, and only 9 documents are found to be detected wrongly, and the detection accuracy rate in this stage is 99.06%. The overall detection accuracy is 99.5% through the detection of the method.
The experiments fully show that the method can effectively detect the malicious codes, can realize batch rapid detection in the scene of massive samples, and has high detection efficiency. In addition, the method based on deep learning can realize the detection of unknown characteristic samples.
It is to be understood that the above-described embodiments are only some, and not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims (10)

1. An unknown malicious code detection method for massive Windows software is characterized by comprising the following steps:
preprocessing target software: screening out Windows platform executable files with standard formats, preliminarily judging whether the files are malicious or benign, and taking the files as malicious samples if the files are malicious;
carrying out static auxiliary detection on the malicious sample: combining the sensitive character string and the ImpHash value, automatically generating a rule aiming at a malicious sample, constructing a feature library of the malicious sample according to the rule and the existing rule, judging whether the malicious sample matches the rule of the feature library, judging the malicious sample to be malicious if the malicious sample matches the rule of the feature library, and judging the malicious sample to be benign if the malicious sample does not match the rule of the feature library;
and (3) carrying out dynamic behavior classification on the malicious samples: dynamically operating the malicious samples judged to be benign, acquiring an API (application program interface) calling sequence during dynamic operation, inputting the API calling sequence into the deep neural network model for classification, and judging whether the API calling sequence is malicious or benign;
and finally judging the malicious sample as malicious software if the malicious sample is judged to be malicious by one of static auxiliary detection and dynamic behavior classification, otherwise, judging the malicious sample as benign software.
2. The method of claim 1, wherein the preprocessing method is to check the format of the target software, screen out the Windows platform executable file with a standard format, and perform a preliminary determination of whether it is benign or malicious by using the online detection tool VirusTotal.
3. The method of claim 1, wherein the static aided detection is performed using a malware pattern matching tool, yara, with yara rule bases as existing rule bases including yara-rules official libraries and yara rule bases transformed by a ClamAV signature.
4. The method as claimed in claim 1, wherein the sensitive character strings are extracted by first obtaining printable character strings of malicious samples, then deleting all character strings existing in benign software collected in advance, reserving a malicious software character string set, and finally screening out a sensitive character string set, wherein the set comprises a certain number of URLs, IPs, Hash and File, system sensitive positions and registry paths.
5. A method as claimed in claim 1, wherein the rules for malicious samples are generated in combination with ImpHash values by creating a hash based on the library/API names in the import address table and their specific order in the executable files, determining that files have the same import address table if they have the same ImpHash values between them, and determining that files are compiled from the same source code using the same coding scheme, thereby identifying relevant malware.
6. The method of claim 1, in which the deep neural network model is a textCNN deep neural network model.
7. The method of claim 6, in which a textCNN deep neural network model comprises a convolutional layer, a pooling layer, a stitching layer, a fully-connected layer, and a classification layer; the convolution layer comprises three types of convolution kernels, the number of each type is 128, the heights of the convolution kernels are 3, 4 and 5 respectively, the widths of the convolution kernels are the same as the widths of word vectors, and the convolution kernels are one-dimensional convolution; and the pooling layer generates the feature maps with the same quantity as the total quantity of the convolution kernels by adopting a maximum pooling method.
8. The method of claim 6, wherein word vectors are expanded for each API, and one-dimensional text of the API call sequence is converted into a two-dimensional matrix and then input into the textCNN deep neural network model.
9. The method of claim 1, wherein a malware analysis system Cuckoo Sandbox is established based on a virtualized environment, a malicious sample is dynamically run using the Cuckoo Sandbox, system operations of API call tracing, files, and networks are recorded, and API sequences are extracted from a result log Json file.
10. An unknown malicious code detection system for massive Windows software is characterized by comprising:
a preprocessing module: the method is used for screening out the Windows platform executable files with standard formats, preliminarily judging whether the files are malicious or benign, and taking the files as malicious samples if the files are malicious;
a static auxiliary detection module: the method is used for combining the sensitive character string and the ImpHash value, automatically generating a rule aiming at a malicious sample, constructing a feature library of the malicious sample according to the rule and the existing rule, judging whether the malicious sample matches the rule of the feature library, judging the malicious sample to be malicious if the malicious sample matches the rule of the feature library, and judging the malicious sample to be benign if the malicious sample does not match the rule of the feature library;
a dynamic behavior classification module: the API calling sequence is input into the deep neural network model for classification, and whether the samples are malicious or benign is judged;
a judging module: and the method is used for judging whether the malicious sample is malicious or benign, and finally judging as malicious software if the malicious sample is judged to be malicious by one of static auxiliary detection and dynamic behavior classification, or judging as benign software if the malicious sample is judged to be malicious.
CN202010305550.XA 2020-04-17 2020-04-17 Unknown malicious code detection method and system for massive Windows software Active CN111639337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010305550.XA CN111639337B (en) 2020-04-17 2020-04-17 Unknown malicious code detection method and system for massive Windows software

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010305550.XA CN111639337B (en) 2020-04-17 2020-04-17 Unknown malicious code detection method and system for massive Windows software

Publications (2)

Publication Number Publication Date
CN111639337A true CN111639337A (en) 2020-09-08
CN111639337B CN111639337B (en) 2023-04-07

Family

ID=72332703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010305550.XA Active CN111639337B (en) 2020-04-17 2020-04-17 Unknown malicious code detection method and system for massive Windows software

Country Status (1)

Country Link
CN (1) CN111639337B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257757A (en) * 2020-09-27 2021-01-22 北京锐服信科技有限公司 Malicious sample detection method and system based on deep learning
CN112347479A (en) * 2020-10-21 2021-02-09 北京天融信网络安全技术有限公司 False alarm correction method, device, equipment and storage medium for malicious software detection
CN112926054A (en) * 2021-02-22 2021-06-08 亚信科技(成都)有限公司 Malicious file detection method, device, equipment and storage medium
CN113221109A (en) * 2021-03-30 2021-08-06 浙江工业大学 Intelligent malicious file analysis method based on generation countermeasure network
CN113378156A (en) * 2021-07-01 2021-09-10 上海观安信息技术股份有限公司 Malicious file detection method and system based on API
CN113761912A (en) * 2021-08-09 2021-12-07 国家计算机网络与信息安全管理中心 Interpretable judging method and device for malicious software attribution attack organization
CN114679331A (en) * 2022-04-11 2022-06-28 北京国联天成信息技术有限公司 AI technology-based malicious code passive detection method and system
CN116226854A (en) * 2023-05-06 2023-06-06 江西萤火虫微电子科技有限公司 Malware detection method, system, readable storage medium and computer
CN117034275A (en) * 2023-10-10 2023-11-10 北京安天网络安全技术有限公司 Malicious file detection method, device and medium based on Yara engine

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100180344A1 (en) * 2009-01-10 2010-07-15 Kaspersky Labs ZAO Systems and Methods For Malware Classification
CN108595955A (en) * 2018-04-25 2018-09-28 东北大学 A kind of Android mobile phone malicious application detecting system and method
CN109784056A (en) * 2019-01-02 2019-05-21 大连理工大学 A kind of malware detection method based on deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100180344A1 (en) * 2009-01-10 2010-07-15 Kaspersky Labs ZAO Systems and Methods For Malware Classification
CN108595955A (en) * 2018-04-25 2018-09-28 东北大学 A kind of Android mobile phone malicious application detecting system and method
CN109784056A (en) * 2019-01-02 2019-05-21 大连理工大学 A kind of malware detection method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张海舰;方舟;陈新;: "基于深度学习技术的恶意APP检测方案" *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257757A (en) * 2020-09-27 2021-01-22 北京锐服信科技有限公司 Malicious sample detection method and system based on deep learning
CN112347479A (en) * 2020-10-21 2021-02-09 北京天融信网络安全技术有限公司 False alarm correction method, device, equipment and storage medium for malicious software detection
CN112347479B (en) * 2020-10-21 2021-08-24 北京天融信网络安全技术有限公司 False alarm correction method, device, equipment and storage medium for malicious software detection
CN112926054A (en) * 2021-02-22 2021-06-08 亚信科技(成都)有限公司 Malicious file detection method, device, equipment and storage medium
CN112926054B (en) * 2021-02-22 2023-10-03 亚信科技(成都)有限公司 Malicious file detection method, device, equipment and storage medium
CN113221109B (en) * 2021-03-30 2022-06-28 浙江工业大学 Intelligent malicious file analysis method based on generation countermeasure network
CN113221109A (en) * 2021-03-30 2021-08-06 浙江工业大学 Intelligent malicious file analysis method based on generation countermeasure network
CN113378156A (en) * 2021-07-01 2021-09-10 上海观安信息技术股份有限公司 Malicious file detection method and system based on API
CN113378156B (en) * 2021-07-01 2023-07-11 上海观安信息技术股份有限公司 API-based malicious file detection method and system
CN113761912A (en) * 2021-08-09 2021-12-07 国家计算机网络与信息安全管理中心 Interpretable judging method and device for malicious software attribution attack organization
CN113761912B (en) * 2021-08-09 2024-04-16 国家计算机网络与信息安全管理中心 Interpretable judging method and device for malicious software attribution attack organization
CN114679331A (en) * 2022-04-11 2022-06-28 北京国联天成信息技术有限公司 AI technology-based malicious code passive detection method and system
CN114679331B (en) * 2022-04-11 2024-02-02 北京国联天成信息技术有限公司 AI technology-based malicious code passive detection method and system
CN116226854A (en) * 2023-05-06 2023-06-06 江西萤火虫微电子科技有限公司 Malware detection method, system, readable storage medium and computer
CN117034275A (en) * 2023-10-10 2023-11-10 北京安天网络安全技术有限公司 Malicious file detection method, device and medium based on Yara engine
CN117034275B (en) * 2023-10-10 2023-12-22 北京安天网络安全技术有限公司 Malicious file detection method, device and medium based on Yara engine

Also Published As

Publication number Publication date
CN111639337B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN111639337B (en) Unknown malicious code detection method and system for massive Windows software
Khan et al. Analysis of ResNet and GoogleNet models for malware detection
US10880328B2 (en) Malware detection
Alazab et al. A hybrid wrapper-filter approach for malware detection
CN107688743B (en) Malicious program detection and analysis method and system
CN109271788B (en) Android malicious software detection method based on deep learning
Sun et al. Malware family classification method based on static feature extraction
KR101858620B1 (en) Device and method for analyzing javascript using machine learning
CN109598124A (en) A kind of webshell detection method and device
RU91213U1 (en) SYSTEM OF AUTOMATIC COMPOSITION OF DESCRIPTION AND CLUSTERING OF VARIOUS, INCLUDING AND MALIMENTAL OBJECTS
Ullah et al. Clone detection in 5G-enabled social IoT system using graph semantics and deep learning model
CN104680065A (en) Virus detection method, virus detection device and virus detection equipment
CN111651768B (en) Method and device for identifying link library function name of computer binary program
Kakisim et al. Sequential opcode embedding-based malware detection method
CN109933977A (en) A kind of method and device detecting webshell data
CN113468524B (en) RASP-based machine learning model security detection method
Lajevardi et al. Markhor: malware detection using fuzzy similarity of system call dependency sequences
CN114595451A (en) Graph convolution-based android malicious application classification method
CN108229168B (en) Heuristic detection method, system and storage medium for nested files
CN113971284B (en) JavaScript-based malicious webpage detection method, equipment and computer readable storage medium
Hang et al. Malware detection method of android application based on simplification instructions
CN110990834A (en) Static detection method, system and medium for android malicious software
CN115545091A (en) Integrated learner-based malicious program API (application program interface) calling sequence detection method
Guo et al. Classification of malware variant based on ensemble learning
Wen et al. CNN based zero-day malware detection using small binary segments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant