CN111639337B

CN111639337B - Unknown malicious code detection method and system for massive Windows software

Info

Publication number: CN111639337B
Application number: CN202010305550.XA
Authority: CN
Inventors: 贾晓启; 李帅; 陈阳; 杜海超; 白璐; 解亚敏; 唐静
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2020-04-17
Filing date: 2020-04-17
Publication date: 2023-04-07
Anticipated expiration: 2040-04-17
Also published as: CN111639337A

Abstract

The invention discloses a method and a system for detecting unknown malicious codes for massive Windows software, belongs to the technical field of system safety, aims to solve the problem that the traditional detection method based on feature codes cannot detect unknown malicious codes, combines the advantages of dynamic detection and static detection, uses a deep learning detection technology to realize the detection of the malicious codes with unknown features, uses a static feature auxiliary detection method to accelerate the detection in the scene of massive samples, and improves the detection efficiency.

Description

Unknown malicious code detection method and system for massive Windows software

Technical Field

The invention belongs to the technical field of system security, relates to a malicious code detection method, and particularly relates to an unknown malicious code detection method and system suitable for massive Windows platform software.

Background

The influence of the rapid development of computer technology and internet technology is increasingly remarkable, and great changes are made in the fields of economy, culture, politics, medical treatment, education and the like. However, people also have an inevitable need to consider security issues while enjoying these benefits, most typically attacks and floods of malicious code. Malicious code, also known as malware, may also be referred to as adware, spyware, or malware. The method is characterized in that software which runs on a user computer or other terminals and infringes the legal rights and interests of users is installed and operated under the condition that the users are not explicitly prompted or the user license is not authorized. In the last half of 2018, the 360 Internet security center cumulatively intercepts 1.4 hundred million samples of newly added malicious programs. The PC end malicious program samples 14099.8 are newly added, 77.9 new PC end malicious program samples are intercepted every day on average, and the PC end malicious programs can be found to occupy 97.9% of the total amount of the malicious programs, so that the research of the malicious software under the Windows platform is necessary.

Malware grows faster and faster, the number of varieties increases, and family features are obvious. Other advanced malicious codes are usually confronted with security analysts by adopting advanced technologies in order to protect themselves. The difficulty of inverse analysis is increased by measures such as shelling, confusion, and the like. Malicious code poses a great deal of harm and typically has one or more of the following behaviors: forced installation, browser hijacking, stealing, user data modification, malicious user information collection, malicious uninstallation, malicious binding and other malicious behaviors that violate user awareness, option rights, etc. These behaviors will seriously infringe the legitimate interests of the user and even bring enormous economic or other forms of loss of interest to the user and others. For example, in 2017 the WannaCry "helminth" lemonavirus software infected more than 10 million computers in over 100 countries and regions, resulted in at least 80 billion dollar losses. Other classical programs include programs such as flame virus, vibration net virus, panda incense, cloudiness III and the like, malicious codes need to be analyzed to avoid larger loss, and then effective malicious code detection technologies are researched.

Methods for analyzing malicious code can be generally classified into static analysis methods and dynamic analysis methods. Static analysis refers to analysis performed without executing the binary program, such as disassembling analysis, source code analysis, binary statistical analysis, decompilation, and the like, and techniques include static disassembling analysis, static source code analysis, decompilation analysis, and the like. The existing static analysis has the defects that confusing, shelled and polymorphic malicious codes are difficult to accurately analyze; the traditional static method has low detection accuracy, and particularly has poor detection effect on malicious codes with unknown characteristics. The dynamic analysis refers to a working process of determining the malicious code by utilizing a program debugging tool to track and observe the malicious code under the condition of executing the malicious code. Currently, researchers have developed a number of dynamic analysis tools to analyze malicious samples, primarily by extracting the API sequence of system calls. The existing dynamic analysis has the defects of long time consumption and relatively high cost. In addition, the characteristics need to be manually selected by using the traditional machine learning method, and the difficulty of manual analysis is increased.

Several methods currently exist for the detection of malicious code. The traditional detection based on the feature codes extracts the feature codes of the intercepted samples, records the feature codes in a database and then matches the feature codes, but the conditions of depending on the feature library, detection hysteresis and incapability of detecting unknown feature samples exist; the traditional heuristic detection method has low detection efficiency when processing massive samples, depends on the knowledge and experience of experts, and needs to consume manpower to establish heuristic rules. Under the scene of massive samples, a great deal of time and human resources are consumed to construct rules, so that the detection efficiency is low; increasingly, methods using machine learning have emerged that detect by extracting valid features in conjunction with classification algorithms. The characteristics in the traditional machine learning method are manually screened, the quality of characteristic selection directly influences the detection effect, manual participation is needed in the process of extracting and screening the characteristics, and the labor cost is high.

The deep neural network has proved to have high learning ability, can benefit from a very large training set, can learn the potential characteristics of malicious software, and realize the detection of unknown characteristic samples. In addition, the characteristics of the malicious software can be automatically extracted from the original data, so that the signature of the malicious software does not need to be designed manually, and the excessive dependence on professional knowledge is avoided. The malware detection method based on deep learning is efficient in training time, and the training time is linearly related to the amount of malware. Such a detected network can run on a GPU, essentially a mandatory component of all PCs, which also means that more malware can be analyzed per unit time. Therefore, potential features of malicious codes can be learned based on historical data by means of a deep learning method, and detection of unknown malicious software is achieved.

With the rapid development of the internet in recent years, the number and the types of malicious software are also rapidly increased, and the loss caused by the continuous updating of the propagation mode is also increased. In the face of a huge amount of samples, rapidly and efficiently identifying malware with unknown characteristics becomes a challenge. The traditional detection method based on the feature code can not detect samples with unknown features. The heuristic detection method needs to consume manpower, and the detection efficiency is low when a large number of samples are dealt with. In addition, the features in the traditional machine learning method depend on manual extraction and screening, and the labor cost is high. In the face of a large amount of malicious software of a Windows platform, manual analysis cannot be performed by excessively depending on expert knowledge, because huge time and manpower resources are consumed, and a set of method capable of automatic detection needs to be designed.

Disclosure of Invention

The invention aims to design and realize a malicious code detection method and system based on deep learning, which avoid the problem that the traditional detection method based on feature codes cannot detect unknown malicious codes, introduce a mass sample static auxiliary detection method, accelerate the detection speed in the scene of large-scale samples, have the capability of identifying related malicious codes, help malicious software researchers to find risks in time and avoid greater loss.

The technical scheme adopted by the invention is as follows:

a method for detecting unknown malicious codes for massive Windows software comprises the following steps:

preprocessing target software: screening out Windows platform executable files with standard formats, preliminarily judging whether the files are malicious or benign, and if the files are malicious, taking the files as malicious samples;

carrying out static auxiliary detection on the malicious sample: combining the sensitive character string and the ImpHash value, automatically generating a rule aiming at a malicious sample, constructing a feature library of the malicious sample according to the rule and the existing rule, judging whether the malicious sample matches the rule of the feature library, judging the malicious sample to be malicious if the malicious sample matches the rule of the feature library, and judging the malicious sample to be benign if the malicious sample does not match the rule of the feature library;

and (3) carrying out dynamic behavior classification on the malicious samples: dynamically operating the malicious samples judged to be benign, acquiring an API calling sequence during dynamic operation, inputting the API calling sequence into the deep neural network model for classification, and judging whether the API calling sequence is malicious or benign;

and finally judging the malicious sample as malicious software if the malicious sample is judged to be malicious by one of static auxiliary detection and dynamic behavior classification, otherwise, judging the malicious sample as benign software.

Further, the preprocessing method is to check the format of the target software, screen out the Windows platform executable file with a standard format, and then preliminarily determine whether the Windows platform executable file is benign or malicious by using an online detection tool VirusTotal.

Further, using the malware pattern matching tool yara for static aided detection, the yara rule base is used as an existing rule base, which comprises a yara-rules official library and a yara rule base converted by a ClamAV feature code.

Furthermore, the extraction method of the sensitive character strings comprises the steps of firstly obtaining printable character strings of malicious samples, then deleting all character strings existing in benign software collected in advance, reserving a malicious software character string set, and finally screening out a sensitive character string set which comprises a certain number of URLs, IPs, hash and File, system sensitive positions and registry paths.

And further, generating a rule aiming at the malicious sample by combining the ImpHash value, wherein a hash is created based on the library/API name in the import address table and the specific sequence of the library/API name in the executable file, if the files have the same ImpHash value, the files are judged to have the same import address table, and the files are judged to be compiled by the same source code in the same coding mode, so that the related malicious software is identified.

Further, the deep neural network model is a textCNN deep neural network model.

Further, the textCNN deep neural network model comprises a convolution layer, a pooling layer, a splicing layer, a full-connection layer and a classification layer; the convolution layer comprises three types of convolution kernels, the number of each type is 128, the heights of the convolution kernels are 3, 4 and 5 respectively, the widths of the convolution kernels are the same as the widths of word vectors, and the convolution kernels are one-dimensional convolution; and the pooling layer generates the feature maps with the same quantity as the total quantity of the convolution kernels by adopting a maximum pooling method.

Further, word vectors are expanded for each API, one-dimensional texts of API calling sequences are converted into two-dimensional matrixes, and then the two-dimensional matrixes are input into the textCNN deep neural network model.

Further, a malicious program analysis system Cuckoo Sandbox is established based on the virtualization environment, a malicious sample is dynamically operated by using the Cuckoo Sandbox, API call tracing, system operations of files and networks are recorded, and an API sequence is extracted from a result log Json file.

An unknown malicious code detection system for massive Windows software comprises:

a pretreatment module: the method is used for screening out the Windows platform executable files with standard formats, preliminarily judging whether the files are malicious or benign, and taking the files as malicious samples if the files are malicious;

a static auxiliary detection module: the method is used for combining the sensitive character string and the ImpHash value, automatically generating a rule aiming at a malicious sample, constructing a feature library of the malicious sample according to the rule and the existing rule, judging whether the malicious sample matches the rule of the feature library, judging the malicious sample to be malicious if the malicious sample matches the rule of the feature library, and judging the malicious sample to be benign if the malicious sample does not match the rule of the feature library;

a dynamic behavior classification module: the API calling sequence is input into the deep neural network model for classification, and whether the samples are malicious or benign is judged;

a judging module: for judging whether a malicious sample is malicious or benign, if the malicious sample is judged to be malicious by one of static auxiliary detection and dynamic behavior classification, finally judging the software to be malicious software, otherwise, judging the software to be benign software.

The invention has the beneficial effects that:

the invention provides an unknown malicious code detection method facing mass Windows software based on understanding of malicious code analysis technology and deep learning research, combines the advantages of dynamic detection and static detection, uses the deep learning detection technology to realize detection of malicious codes with unknown characteristics, uses the static characteristic auxiliary detection method to accelerate detection in the scene of mass samples, and improves the detection efficiency.

Drawings

FIG. 1 is a flow chart of an unknown malicious code detection method for massive Windows software according to the present invention;

FIG. 2 is a schematic diagram of the static yara rule base construction of the present invention;

FIG. 3 is a diagram of the textCNN model of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples and the accompanying drawings.

The embodiment discloses a method and a system for detecting unknown malicious codes for massive Windows software, and the processing flow is shown in FIG. 1. The system comprises a preprocessing module, a static auxiliary detection module, a dynamic behavior classification module and a studying and judging module, which are specifically described as follows.

1) Pre-processing module

The preprocessing module analyzes the loaded file, checks the format and judges whether the file is a Windows platform executable file with a standard format. If the sample is legal, detecting by using an online detection tool VirusTotal, preliminarily judging whether the sample is benign or malicious, and finishing the labeling work. An own data set can be constructed for the labeled exemplars. The preprocessing module is mainly used for cleaning and marking data and providing input data for a subsequent detection process.

2) Static auxiliary detection module

For the input samples, a static auxiliary detection is first performed. The malicious software pattern matching tool yara is used, a yara rule base is constructed to serve as a feature base of the malicious software, and the detection speed can be increased in the detection scene of massive samples. The static detection stage constructs a malicious code feature library of the static detection stage, and the malicious code feature library can be divided into 2 parts. The first is the existing rule base, including yara-rules official base and the yara rule base after conversion by the ClamAV feature code. The second is a rule base which needs to be maintained, and is mainly used for automatically generating rules for static detection of the malicious samples.

The method is characterized in that a scheme of automatically generating a feature code is adopted for automatically generating a static detection rule of a malicious sample, and a technical route is that a rule file is automatically generated by adopting a method of combining a sensitive character string and Imhash. The extraction idea is that printable character strings of a sample are firstly obtained, then all character strings existing in benign files collected in advance are deleted, a character string set of the malicious software is reserved, and finally a sensitive character string set is screened out and comprises a certain number of URLs, IPs, hash and files, system sensitive positions, registry paths and the like. Next is imhash to identify relevant malware (family variants). Imhash creates hashes based on the library/API names in the import address table and their specific order in the executable file, is a powerful way to identify relevant malware, whose value is itself relatively unique. This is because the linker of the compiler generates and constructs an Import Address Table (IAT) according to the specific order of functions in the source file. ImpHash also becomes distinct as the sequence of function calls changes. If two files have the same ImpHash value, they have the same IAT, which means that the files are compiled from the same source code and encoded in the same way. The relevant malware can be identified using the method of ImpHash.

By constructing a malicious code feature library of the user, firstly, static preliminary detection is carried out on an exe file of an input Windows platform, if rules are matched in a yara rule library, the file is known malicious codes or variants thereof, and the file directly enters a judging module. Otherwise, the next detection is carried out, and the dynamic behavior classification module is automatically entered.

3) Dynamic behavior classification module

The dynamic behavior classification module is mainly used for acquiring a dynamic runtime API sequence of a program through a dynamic runtime program. And then, taking the API sequence which runs dynamically as an input text, expanding each API into a word vector, putting the word vector into a textCNN model for training, and finishing the two classifications. The Cuckoo Sandbox is used for completing the acquisition of dynamic behaviors, a malicious program analysis system Cuckoo Sandbox established based on a virtualization environment can automatically execute and analyze program behaviors, and system operations such as API call tracking, files, networks and the like are recorded. After the dynamic behavior information is obtained, an API (application program interface) calling sequence is extracted from a Json file of a result log, a textCNN (conditional access network) -based deep neural network model is constructed for secondary classification, and finally whether the software has maliciousness or not is judged.

4) Judging module

The final judging module is used for integrating the detection results of the first two stages and giving out the final malice judgment. If the file is judged to be malicious after the static auxiliary detection module in the first stage, the final result of the sample can be judged to be malicious, and the processing of the dynamic behavior classification module in the second stage is not needed, so that the time is saved and the efficiency is improved. Otherwise, the dynamic behavior classification module is required to process under the condition that the first stage is not matched, and the detection result of the second stage is given. Only if both phases are judged to be benign, the file is finally judged to be benign.

The processing flow of the method and the system is shown in fig. 1, and the input is an executable program exe file of a Windows platform. Firstly, the samples are processed by a preprocessing module, batch uploading to VirusTotal is realized through programming for online detection, and the real type of the samples is determined according to the returned result, so that the labeling work is completed. And if the program is a legal Windows program, entering a next static auxiliary detection module.

In the static detection module, a pre-constructed rule base is used for carrying out preliminary detection on an input sample. The myyarrowulemaker is a tool for automatically generating rule files, and the automation degree of a detection system is improved in order to save labor when a rule base is constructed. The static yara rule base is constructed as shown in fig. 2, and a rule file of an input sample is generated based on a printable sensitive character string and an import table hash value without spending much professional knowledge and energy of an expert. The first is to match some rules in the rule base, then it can directly enter the judging module, without dynamic detection. Otherwise, when the existing rule is not matched, the next dynamic behavior classification module is required to be entered.

And for the input samples, the dynamic behavior classification module firstly submits the input samples to a built sandbox environment Cuckoo for dynamic operation, obtains an API calling sequence of the input samples as a one-dimensional text, and then converts the one-dimensional text into a two-dimensional matrix by using a word embedding method, so that the two-dimensional matrix becomes input data which can be identified by the textCNN model, and further completes the training and detection of the model.

A schematic of the textCNN model is shown in fig. 3:

the input samples are two-dimensional vectors of m x n, transformed from the API sequence. The deep learning model is based on a textCNN model, firstly, the convolutional layer is a A, B, C three types of convolution kernels, the number of each type is 128, the heights of the convolution kernels are 3, 4 and 5 respectively, and the widths of the convolution kernels are the same as the widths of word vectors, namely, the convolution is one-dimensional. After the convolution layer processing, next, a pooling layer is performed, and a maximum pooling method is adopted to generate a fixed number of feature maps, wherein the number is the total number of convolution kernels. And then, the split joint layers are connected together and enter a full connection layer to serve as the input of a classification layer, a softmax classification algorithm is used for classification, and finally the judgment result of the dynamic stage is output to be benign software or malicious software.

And finally, a judging module is used, the function is simpler, and only the judging results generated in the first two stages are summarized for final judgment. And judging the software to be benign only if the detection results of the first two stages are both benign software, otherwise judging the software to be malicious software.

A specific application example is listed below:

the user is a virus analyst, and a rapid and efficient malicious software detection method is sought for the requirement of batch detection of whether captured software belongs to malicious software. In this case, the malicious code detection method in the invention can provide technical support for software detection.

The user takes as input the executable program that needs to be detected, in this example, the exe file under Windows is taken as an example. Firstly, data preprocessing is carried out on the file, and whether the file format meets the specification of an exe file or not is checked. And if the file format is correct, continuing to perform the next analysis. Then, a static auxiliary detection module is entered to perform static detection on the program, and mainly a precompiled rule base is used for rule matching. Because the rule base is based on the credible malicious sample characteristics collected in history, the detection of the known type of malicious sample can be realized in the process. In addition, because the Imphash technology is used, the variant of the malicious code can be detected.

After the detection of the static auxiliary detection module is finished, a detection result of a static stage is generated. And if the static stage judges that the sample is the malicious software, directly entering a judging stage. Otherwise, the dynamic behavior classification module is used for detecting the second stage. First, after a sample is submitted to Cuckoo Sandbox in batch mode and the program exits or is fixed, calling sequences of APIs and parameters arg thereof are extracted from a behavior log report. The deduplication operation refers to the condition that when repeated API sequences such as API1, API2, and API2 are encountered, the final reserved API1 and API2 is subjected to deduplication processing. And then converting the word embedding into a two-dimensional matrix which can be identified by the textCNN model, and putting the two-dimensional matrix into the model for detection to obtain a result of the dynamic stage detection.

And finally entering a research and judgment stage, wherein the file is finally judged to be benign only if the detection results of the first two stages are benign, otherwise, the file is judged to be malicious, and thus, the detection of the batch of software is completed.

Here the effectiveness of the static-assisted test was first experimentally tested. First, introducing a data set, a total of 41125 malicious Windows software in about 5 years is crawled by a crawler from an internet-published malicious sample download repository malsharp, and is divided into a malicious sample set ST (total of 31359) for constructing static rules and a malicious sample set M (total of 9766) for detection according to a proportion of 4:1. While the benign file data set B is derived from exe files extracted from the newly installed Windows operating system, 5914 files from Windows XP to Windows Server 2016. And respectively testing the malicious file test set and the benign file test set, and recording the number of the matched rules and the detection time. The results are shown in Table 1.

TABLE 1 static auxiliary test results

Data set	Total number of	Number of detections	Detection rate	Time of detection	Single detection time
						Malicious sample M	9766	8844	90.56％	3534.61 seconds	0.36 second
Benign samples B	5914	0	0％	2579.44 seconds	0.43 second

As can be seen from table 1, the constructed yara rule base has a malicious sample detection rate of 90.56%, and a certain number of relevant samples can be identified. The detection rate of the rule base for benign samples is 0, namely the false alarm rate is 0%, because the feature base is used for detecting the malicious software, the detection result in the benign software data set should not exist, and the detection result is in line with the expectation. The detection time of a single sample is 0.36 second and 0.43 second respectively, and the method is also acceptable in the detection scene of massive samples, and the problem of low detection efficiency in the detection scene of massive samples can be solved.

In addition, a malicious code detection method based on deep learning is tested. 1907 software in total, including 1065 malware and 842 benign software; the test set consisted of 200 pieces of software, including 100 pieces of malware and 100 pieces of benign files. Training and testing are performed on the constructed model, and corresponding training time and test set accuracy under different training data are recorded, as shown in table 2:

TABLE 2 dynamic test results

Type of input data	Training the model when using	Accuracy Accuracy
			With parameters, repetition	49 minutes, 53.29 seconds	98.5％
Without parameters, repetition	42 minutes 15.55 seconds	97.0％
			With parameters, de-weighting	55 minutes 32.01 seconds	98.5％
Without parameters, de-duplication	46 minutes 25.35 seconds	94.5％

Through experiments, the lowest accuracy rate of 94.5 percent is achieved by using data which are not parameters and are de-duplicated for detection. And the highest accuracy rate reaches 98.5 percent when input data with parameters are used. This is expected because data without parameters will have information lost compared to the original data, resulting in a slight decrease in accuracy. Such data may be employed when there is a high demand for detection time. Because the test set is unknown to the model, the method can effectively detect the malicious software with unknown characteristics.

And finally, forming an automatic detection system and performing testing. In the training stage, firstly, a yara rule base is constructed in a static auxiliary detection module, and the yara rule base comprises 5904 known rules; 4000 training data (containing 1816 benign files and 2184 malicious files) were provided at the dynamic behavior classification module. In the testing stage after the training is finished, 2000 test data (comprising 449 benign files and 1551 malicious files) are used, and are detected by the static auxiliary detection module, 1042 malicious files and 0 benign files are detected, the detection rate of the malicious files is 67.18%, and the false alarm rate is 0%. The static detection phase is 71.79 seconds in total, and the average single sample detection time is 0.035 seconds. Then 958 documents with undetected results in the first stage enter the dynamic behavior classification module for detection, and the final results are generated in the judging stage, and only 9 documents are found to be detected wrongly, and the detection accuracy rate in this stage is 99.06%. The overall detection accuracy is 99.5% through the detection of the method.

The experiments fully show that the method can effectively detect the malicious codes, can realize batch rapid detection in the scene of massive samples, and has high detection efficiency. In addition, the method based on deep learning can realize the detection of unknown characteristic samples.

It is to be understood that the above-described embodiments are only some, and not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims

1. An unknown malicious code detection method for massive Windows software is characterized by comprising the following steps:

preprocessing target software: screening out Windows platform executable files with standard formats, preliminarily judging whether the files are malicious or benign, and taking the files as malicious samples if the files are malicious;

carrying out static auxiliary detection on the malicious sample: combining the sensitive character string and the ImpHash value, automatically generating a rule aiming at a malicious sample, constructing a feature library of the malicious sample according to the rule and the existing rule, judging whether the malicious sample matches the rule of the feature library, judging the malicious sample to be malicious if the malicious sample matches the rule of the feature library, and judging the malicious sample to be benign if the malicious sample does not match the rule of the feature library; the method comprises the following steps of performing static auxiliary detection by using a malicious software pattern matching tool yara, and using a yara rule base as an existing rule base, wherein the yara rule base comprises a yara-rules official base and a yara rule base converted by ClamAV feature codes; the extraction method of the sensitive character string comprises the steps of firstly obtaining a printable character string of a malicious sample, then deleting all character strings existing in benign software collected in advance, reserving a malicious software character string set, and finally screening out a sensitive character string set, wherein the set comprises a certain number of URLs, IPs, hash and files, system sensitive positions and registry paths;

and (3) carrying out dynamic behavior classification on the malicious samples: dynamically operating the malicious samples judged to be benign, acquiring an API (application program interface) calling sequence during dynamic operation, inputting the API calling sequence into the deep neural network model for classification, and judging whether the API calling sequence is malicious or benign; the method comprises the steps that a malicious program analysis system Cuckoo Sandbox is established based on a virtualization environment, a malicious sample is dynamically operated by using the Cuckoo Sandbox, API call tracking, system operations of files and networks are recorded, and an API sequence is extracted from a result log Json file;

2. The method of claim 1, wherein the preprocessing method is to check the format of the target software, screen out the Windows platform executable file with a standard format, and perform a preliminary determination of whether it is benign or malicious by using the online detection tool VirusTotal.

3. A method as claimed in claim 1, wherein the rules for malicious samples are generated in combination with ImpHash values by creating a hash based on the library/API names in the import address table and their specific order in the executable files, determining that files have the same import address table if they have the same ImpHash values between them, and determining that files are compiled from the same source code using the same coding scheme, thereby identifying relevant malware.

4. The method of claim 1, in which the deep neural network model is a textCNN deep neural network model.

5. The method of claim 4, in which a textCNN deep neural network model comprises a convolutional layer, a pooling layer, a stitching layer, a fully-connected layer, and a classification layer; the convolution layer comprises three types of convolution kernels, the number of each type is 128, the heights of the convolution kernels are 3, 4 and 5 respectively, the widths of the convolution kernels are the same as the widths of word vectors, and the convolution kernels are one-dimensional convolution; and the pooling layer generates the feature maps with the same quantity as the total quantity of the convolution kernels by adopting a maximum pooling method.

6. The method of claim 4, wherein word vectors are expanded for each API, and one-dimensional text of the API call sequence is converted into a two-dimensional matrix and then input into the textCNN deep neural network model.

7. An unknown malicious code detection system for massive Windows software is characterized by comprising:

a preprocessing module: the method is used for screening out the Windows platform executable files with standard formats, preliminarily judging whether the files are malicious or benign, and taking the files as malicious samples if the files are malicious;

a static auxiliary detection module: the method is used for combining the sensitive character string and the ImpHash value, automatically generating a rule aiming at a malicious sample, constructing a feature library of the malicious sample according to the rule and the existing rule, judging whether the malicious sample matches the rule of the feature library, judging the malicious sample to be malicious if the malicious sample matches the rule of the feature library, and judging the malicious sample to be benign if the malicious sample does not match the rule of the feature library; the method comprises the following steps of performing static auxiliary detection by using a malicious software pattern matching tool yara, and using a yara rule base as an existing rule base, wherein the yara rule base comprises a yara-rules official base and a yara rule base converted by ClamAV feature codes; the extraction method of the sensitive character string comprises the steps of firstly obtaining a printable character string of a malicious sample, then deleting all character strings existing in benign software collected in advance, reserving a malicious software character string set, and finally screening out a sensitive character string set, wherein the set comprises a certain number of URLs, IPs, hash and files, system sensitive positions and registry paths;

a dynamic behavior classification module: the API calling sequence is input into the deep neural network model for classification, and whether the samples are malicious or benign is judged; the method comprises the steps that a malicious program analysis system Cuckoo Sandbox is established based on a virtualization environment, a malicious sample is dynamically operated by using the Cuckoo Sandbox, API call tracking, system operations of files and networks are recorded, and an API sequence is extracted from a result log Json file;

a judging module: and the method is used for judging whether the malicious sample is malicious or benign, and finally judging as malicious software if the malicious sample is judged to be malicious by one of static auxiliary detection and dynamic behavior classification, or judging as benign software if the malicious sample is judged to be malicious.